Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive Partitioning: Add function to report importance scores #295

Merged
merged 2 commits into from Aug 1, 2018

Conversation

njayaram2
Copy link
Contributor

JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest. RF models prior to
MADlib 1.15 used to have variable importance scores reported, but they
also have impurity variable importance from 1.15 onwards. This function
reports both those scores for >=1.15 RF models, and only the oob variable
importance score for <1.15 RF models.
This function when called for a DT model, would return the impurity
variable importance score for >=1.15 DT models.

Co-authored-by: Jingyi Mei jmei@pivotal.io
Co-authored-by: Orhan Kislal okislal@pivotal.io

iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 18, 2018
JIRA: MADLIB-1254

If tree_train/forest_train is run with grouping enabled and if one of
the groups has a categorical feature with just single level, then the
categorical feature is eliminated for that group. If other groups retain
that feature, then we end up with incorrect "bins" data structure built
as part of DT.

This commit fixes this issue by recording the categorical features
present in each group separately.

Closes apache#295
iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 18, 2018
JIRA: MADLIB-1254

If tree_train/forest_train is run with grouping enabled and if one of
the groups has a categorical feature with just single level, then the
categorical feature is eliminated for that group. If other groups retain
that feature, then we end up with incorrect "bins" data structure built
as part of DT.

This commit fixes this issue by recording the categorical features
present in each group separately.

Closes apache#295
@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/573/

@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/576/

Copy link
Contributor

@iyerr3 iyerr3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested a few changes. IMO, the python functions are better suited for random_forest.py_in since that file knows about RF and DT, whereas decision_tree.py_in (ideally) should be RF-agnostic.

_assert(table_exists(group_table),
"Recursive Partitioning: Model group table does not exist.")
# this flag has to be set to true for RF to report importance scores.
isImportance = plpy.execute("SELECT importance FROM {summary_table}".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our convention is to use snake case for Python (i.e. is_importance). I also suggest changing it to is_importance_set to make it more explicit.


def _is_model_for_RF(summary_table):
# Only an RF model (and not DT) would have num_trees column in summary
return columns_exist_in_table(summary_table, ['num_trees'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've a method_name in summary table that makes this clearer.

# Only an RF model (and not DT) would have num_trees column in summary
return columns_exist_in_table(summary_table, ['num_trees'])

def _is_RF_model_with_imp_pre_1_15(group_table, summary_table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since goal of function is to check if impurity_var_importance exists in group table, let's name it to reflect that.

unnest(regexp_split_to_array(cat_features, ',')) AS feature,
unnest(cat_var_importance) AS var_importance
FROM {group_table}, {summary_table}
""".format(**locals()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's cleaner to see the two queries UNIONed together to get the output.

@njayaram2
Copy link
Contributor Author

Thank you for the comments @iyerr3 , will make necessary changes.

njayaram2 added a commit to madlib/madlib that referenced this pull request Jul 19, 2018
@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/579/

@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/580/

@fmcquillan99
Copy link

fmcquillan99 commented Jul 19, 2018

Should impurity_var_importance always add up to 100?
From the regression example in the user docs:

DROP TABLE IF EXISTS mt_imp_output;
SELECT madlib.get_var_importance('mt_cars_output','mt_imp_output');
SELECT am, impurity_var_importance FROM mt_imp_output ORDER BY am, impurity_var_importance DESC;

results in


 am | impurity_var_importance 
----+-------------------------
  0 |        35.7664683110879
  0 |        24.7481977075922
  0 |        12.4401197123678
  0 |        12.1559096708347
  0 |        4.88929809351791
  1 |        31.7259035495099
  1 |        29.6146492693988
  1 |        14.9602257795489
  1 |        7.01369118455985
  1 |        6.68552870777581
(10 rows)

which does not add up to 100

		grp 0			grp 1
		35.76646831		31.72590355
		24.74819771		29.61464927
		12.44011971		14.96022578
		12.15590967		7.013691185
		4.889298094		6.685528708
total	89.9999935		89.99999849

The forest was built with 10 trees, so not sure if it is a coincidence that both add up to 90 (maybe 1 tree does not contribute in some way?)

@fmcquillan99
Copy link

fmcquillan99 commented Jul 19, 2018

Another run I got

			grp 0				grp1
			31.01364943			31.66666576
			22.85881741			33.33333245
			13.70257438			0
			6.344527751			3.333333304
			26.0804244			11.66666654
total		99.99999336			79.99999806

so this does seem to be about trees contributing or not. Please check on this and how the normalization is done.

@njayaram2
Copy link
Contributor Author

@fmcquillan99 the importance scores for all variables within a tree would sum up to 100. But in the case of a forest, we average over the scores for each variable across all trees in the forest, and the sum of those averages might not be 100. One reason could be because some variables might get a importance score of 0 in some trees, and with trees that have only a single node, all the variables would get a score of 0.

@iyerr3
Copy link
Contributor

iyerr3 commented Jul 19, 2018

Considering the above situation, I suggest the variable importance values not be scaled to sum to 100. We can make the normalization within get_var_importance just for the reporting (which is the behavior in rpart). In other words, the output table would keep the original values (for DT and RF) but the helper function would rescale during the report for ease in reading the values.

@fmcquillan
Copy link

Would this apply to oob too?
Or just impurity?

@njayaram2
Copy link
Contributor Author

@iyerr3, as you noted, that would change the importance scores that appear in the DT/RF output table, and the get_var_importance output table. Wouldn't that be a little confusing?
One other option is to do the scaling outside the C++ function, inside python just before writing it to the DT/RF output table (although this would force us to call that normalization code in both DT and RF).

@njayaram2
Copy link
Contributor Author

@fmcquillan only impurity, I don't think we scale oob to 100.

@iyerr3
Copy link
Contributor

iyerr3 commented Jul 20, 2018

So the way R does this is it keeps the original values in its model. But it has a "report" function that outputs multiple things including importance and in this report the values are normalized. The user can access the original values from the model if needed but would mostly use the report to understand the importance.

We could essentially do the same and be clear about this in the documentation. Alternatively, we chuck the normalization to 100. It was done solely to compare with R.

@fmcquillan99
Copy link

I like this last suggestion from @iyerr3, that we report raw values for oob and impurity VI in the model output file. (OK to keep the shifted oob > 0 as we do now.)

For the helper/reporting function, compute and report out the scaled/normalized values 0-100 for both oob and impurity VI. These should always add up to 100 unless there is some corner cases, if so pls let us know.

@asfgit
Copy link

asfgit commented Jul 23, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/591/

Copy link
Contributor Author

@njayaram2 njayaram2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall code and refactoring LGTM. A couple of things to address:

  • Two unit test cases are failing, please fix those. The command to run unit test would be python src/ports/postgres/9.4/modules/recursive_partitioning/test/unit_tests/test_random_forest.py, where postgres/9.4 must be replaced with whatever is appropriate in your environment.
  • The documentation in DT and RF must change to reflect the unnormalized values for impurity var importance when the model is trained.

{impurity_var_importance_str}
FROM {importance_model_table}, {summary_table}
""".format(**locals()))
# ------------------------------------------------------------------------------
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit
Copy link

asfgit commented Jul 24, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/592/

@iyerr3 iyerr3 force-pushed the feature/output-importance branch from 2f6acc4 to 7f6e291 Compare July 25, 2018 22:10
iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 25, 2018
JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest, by unnesting the
importance values along with corresponding features.

Closes apache#295

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
@asfgit
Copy link

asfgit commented Jul 25, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/601/

Copy link
Contributor Author

@njayaram2 njayaram2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but for one requested change.

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.get_var_importance(
message TEXT
) RETURNS TEXT AS $$
PythonFunction(recursive_partitioning, random_forest, tree_importance_help_message)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tree_importance_help_message is defined in decision_tree.py_in, so select madlib.get_var_importance() fails at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@iyerr3 iyerr3 force-pushed the feature/output-importance branch from 7f6e291 to 3ab7554 Compare July 26, 2018 18:43
iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 26, 2018
JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest, by unnesting the
importance values along with corresponding features.

Closes apache#295

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/610/

@fmcquillan99
Copy link

fmcquillan99 commented Aug 1, 2018

LGTM, here is an RF example that seems correct and adds up to 100 per group:

SELECT * FROM mt_imp_output ORDER BY am, oob_var_importance DESC;
 am | feature | oob_var_importance | impurity_var_importance 
----+---------+--------------------+-------------------------
  0 | cyl     |   31.6266798136018 |        8.99888201496216
  0 | disp    |   21.3534749649495 |        30.5938284017064
  0 | vs      |   20.2312669968611 |        25.4855561460076
  0 | wt      |   16.3410741245189 |        19.7783684870616
  0 | qsec    |   10.4475041000687 |        15.1433649502623
  1 | wt      |    34.239597267579 |        24.9348163610914
  1 | disp    |   29.4316514472623 |        31.1638455198447
  1 | cyl     |   21.9435741528927 |        20.1221371309527
  1 | vs      |   14.3851771322661 |        17.5142973837102
  1 | qsec    |                  0 |        6.26490360440106
(10 rows)

njayaram2 and others added 2 commits August 1, 2018 12:58
JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest by unnesting the
importance values along with corresponding features.

Closes apache#295

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
@asfgit asfgit merged commit 186390f into apache:master Aug 1, 2018
@asfgit
Copy link

asfgit commented Aug 1, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/630/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants