Recursive Partitioning: Add function to report importance scores #295

njayaram2 · 2018-07-18T21:13:02Z

JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest. RF models prior to
MADlib 1.15 used to have variable importance scores reported, but they
also have impurity variable importance from 1.15 onwards. This function
reports both those scores for >=1.15 RF models, and only the oob variable
importance score for <1.15 RF models.
This function when called for a DT model, would return the impurity
variable importance score for >=1.15 DT models.

Co-authored-by: Jingyi Mei jmei@pivotal.io
Co-authored-by: Orhan Kislal okislal@pivotal.io

JIRA: MADLIB-1254 If tree_train/forest_train is run with grouping enabled and if one of the groups has a categorical feature with just single level, then the categorical feature is eliminated for that group. If other groups retain that feature, then we end up with incorrect "bins" data structure built as part of DT. This commit fixes this issue by recording the categorical features present in each group separately. Closes apache#295

asfgit · 2018-07-19T02:24:01Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/573/

asfgit · 2018-07-19T03:55:48Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/576/

iyerr3

Suggested a few changes. IMO, the python functions are better suited for random_forest.py_in since that file knows about RF and DT, whereas decision_tree.py_in (ideally) should be RF-agnostic.

iyerr3 · 2018-07-19T17:34:26Z

src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in

+    _assert(table_exists(group_table),
+        "Recursive Partitioning: Model group table does not exist.")
+    # this flag has to be set to true for RF to report importance scores.
+    isImportance = plpy.execute("SELECT importance FROM {summary_table}".


Our convention is to use snake case for Python (i.e. is_importance). I also suggest changing it to is_importance_set to make it more explicit.

iyerr3 · 2018-07-19T17:35:39Z

src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in

+
+def _is_model_for_RF(summary_table):
+    # Only an RF model (and not DT) would have num_trees column in summary
+    return columns_exist_in_table(summary_table, ['num_trees'])


We've a method_name in summary table that makes this clearer.

iyerr3 · 2018-07-19T17:37:49Z

src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in

+    # Only an RF model (and not DT) would have num_trees column in summary
+    return columns_exist_in_table(summary_table, ['num_trees'])
+
+def _is_RF_model_with_imp_pre_1_15(group_table, summary_table):


Since goal of function is to check if impurity_var_importance exists in group table, let's name it to reflect that.

iyerr3 · 2018-07-19T17:49:29Z

src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in

+                        unnest(regexp_split_to_array(cat_features, ',')) AS feature,
+                        unnest(cat_var_importance) AS var_importance
+                    FROM {group_table}, {summary_table}
+                """.format(**locals()))


IMO, it's cleaner to see the two queries UNIONed together to get the output.

njayaram2 · 2018-07-19T18:38:37Z

Thank you for the comments @iyerr3 , will make necessary changes.

Closes apache#295

asfgit · 2018-07-19T19:27:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/579/

asfgit · 2018-07-19T20:06:39Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/580/

fmcquillan99 · 2018-07-19T22:30:31Z

Should impurity_var_importance always add up to 100?
From the regression example in the user docs:

DROP TABLE IF EXISTS mt_imp_output;
SELECT madlib.get_var_importance('mt_cars_output','mt_imp_output');
SELECT am, impurity_var_importance FROM mt_imp_output ORDER BY am, impurity_var_importance DESC;

results in


 am | impurity_var_importance 
----+-------------------------
  0 |        35.7664683110879
  0 |        24.7481977075922
  0 |        12.4401197123678
  0 |        12.1559096708347
  0 |        4.88929809351791
  1 |        31.7259035495099
  1 |        29.6146492693988
  1 |        14.9602257795489
  1 |        7.01369118455985
  1 |        6.68552870777581
(10 rows)

which does not add up to 100

		grp 0			grp 1
		35.76646831		31.72590355
		24.74819771		29.61464927
		12.44011971		14.96022578
		12.15590967		7.013691185
		4.889298094		6.685528708
total	89.9999935		89.99999849

The forest was built with 10 trees, so not sure if it is a coincidence that both add up to 90 (maybe 1 tree does not contribute in some way?)

fmcquillan99 · 2018-07-19T22:38:05Z

Another run I got

			grp 0				grp1
			31.01364943			31.66666576
			22.85881741			33.33333245
			13.70257438			0
			6.344527751			3.333333304
			26.0804244			11.66666654
total		99.99999336			79.99999806

so this does seem to be about trees contributing or not. Please check on this and how the normalization is done.

njayaram2 · 2018-07-19T23:01:49Z

@fmcquillan99 the importance scores for all variables within a tree would sum up to 100. But in the case of a forest, we average over the scores for each variable across all trees in the forest, and the sum of those averages might not be 100. One reason could be because some variables might get a importance score of 0 in some trees, and with trees that have only a single node, all the variables would get a score of 0.

iyerr3 · 2018-07-19T23:52:18Z

Considering the above situation, I suggest the variable importance values not be scaled to sum to 100. We can make the normalization within get_var_importance just for the reporting (which is the behavior in rpart). In other words, the output table would keep the original values (for DT and RF) but the helper function would rescale during the report for ease in reading the values.

fmcquillan · 2018-07-19T23:56:57Z

Would this apply to oob too?
Or just impurity?

njayaram2 · 2018-07-20T00:05:42Z

@iyerr3, as you noted, that would change the importance scores that appear in the DT/RF output table, and the get_var_importance output table. Wouldn't that be a little confusing?
One other option is to do the scaling outside the C++ function, inside python just before writing it to the DT/RF output table (although this would force us to call that normalization code in both DT and RF).

njayaram2 · 2018-07-20T00:07:40Z

@fmcquillan only impurity, I don't think we scale oob to 100.

iyerr3 · 2018-07-20T00:33:02Z

So the way R does this is it keeps the original values in its model. But it has a "report" function that outputs multiple things including importance and in this report the values are normalized. The user can access the original values from the model if needed but would mostly use the report to understand the importance.

We could essentially do the same and be clear about this in the documentation. Alternatively, we chuck the normalization to 100. It was done solely to compare with R.

fmcquillan99 · 2018-07-20T22:14:18Z

I like this last suggestion from @iyerr3, that we report raw values for oob and impurity VI in the model output file. (OK to keep the shifted oob > 0 as we do now.)

For the helper/reporting function, compute and report out the scaled/normalized values 0-100 for both oob and impurity VI. These should always add up to 100 unless there is some corner cases, if so pls let us know.

asfgit · 2018-07-23T23:12:04Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/591/

njayaram2

The overall code and refactoring LGTM. A couple of things to address:

Two unit test cases are failing, please fix those. The command to run unit test would be python src/ports/postgres/9.4/modules/recursive_partitioning/test/unit_tests/test_random_forest.py, where postgres/9.4 must be replaced with whatever is appropriate in your environment.
The documentation in DT and RF must change to reflect the unnormalized values for impurity var importance when the model is trained.

njayaram2 · 2018-07-24T00:34:59Z

src/ports/postgres/modules/recursive_partitioning/random_forest.py_in

+                {impurity_var_importance_str}
+            FROM {importance_model_table}, {summary_table}
+        """.format(**locals()))
+# ------------------------------------------------------------------------------


asfgit · 2018-07-24T21:47:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/592/

JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest, by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>

asfgit · 2018-07-25T22:55:41Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/601/

njayaram2

LGTM, but for one requested change.

njayaram2 · 2018-07-26T18:33:37Z

src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in

+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.get_var_importance(
+    message     TEXT
+) RETURNS TEXT AS $$
+  PythonFunction(recursive_partitioning, random_forest, tree_importance_help_message)


tree_importance_help_message is defined in decision_tree.py_in, so select madlib.get_var_importance() fails at the moment.

JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest, by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>

asfgit · 2018-07-26T19:23:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/610/

fmcquillan99 · 2018-08-01T19:47:56Z

LGTM, here is an RF example that seems correct and adds up to 100 per group:

SELECT * FROM mt_imp_output ORDER BY am, oob_var_importance DESC;
 am | feature | oob_var_importance | impurity_var_importance 
----+---------+--------------------+-------------------------
  0 | cyl     |   31.6266798136018 |        8.99888201496216
  0 | disp    |   21.3534749649495 |        30.5938284017064
  0 | vs      |   20.2312669968611 |        25.4855561460076
  0 | wt      |   16.3410741245189 |        19.7783684870616
  0 | qsec    |   10.4475041000687 |        15.1433649502623
  1 | wt      |    34.239597267579 |        24.9348163610914
  1 | disp    |   29.4316514472623 |        31.1638455198447
  1 | cyl     |   21.9435741528927 |        20.1221371309527
  1 | vs      |   14.3851771322661 |        17.5142973837102
  1 | qsec    |                  0 |        6.26490360440106
(10 rows)

JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>

asfgit · 2018-08-01T20:13:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/630/

iyerr3 mentioned this pull request Jul 18, 2018

DT/RF: Ensure cat features are recorded per group #296

Closed

iyerr3 reviewed Jul 19, 2018

View reviewed changes

njayaram2 added a commit to madlib/madlib that referenced this pull request Jul 19, 2018

Address code review comments

9a4e28f

Closes apache#295

iyerr3 approved these changes Jul 19, 2018

View reviewed changes

njayaram2 commented Jul 24, 2018

View reviewed changes

iyerr3 force-pushed the feature/output-importance branch from 2f6acc4 to 7f6e291 Compare July 25, 2018 22:10

njayaram2 commented Jul 26, 2018

View reviewed changes

iyerr3 force-pushed the feature/output-importance branch from 7f6e291 to 3ab7554 Compare July 26, 2018 18:43

njayaram2 and others added 2 commits August 1, 2018 12:58

DT/RF: Fix user doc examples

186390f

iyerr3 force-pushed the feature/output-importance branch from fec0e6d to 186390f Compare August 1, 2018 19:59

asfgit merged commit 186390f into apache:master Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive Partitioning: Add function to report importance scores #295

Recursive Partitioning: Add function to report importance scores #295

njayaram2 commented Jul 18, 2018

asfgit commented Jul 19, 2018

asfgit commented Jul 19, 2018

iyerr3 left a comment

iyerr3 Jul 19, 2018

iyerr3 Jul 19, 2018

iyerr3 Jul 19, 2018

iyerr3 Jul 19, 2018

njayaram2 commented Jul 19, 2018

asfgit commented Jul 19, 2018

asfgit commented Jul 19, 2018

fmcquillan99 commented Jul 19, 2018 •

edited

fmcquillan99 commented Jul 19, 2018 •

edited

njayaram2 commented Jul 19, 2018

iyerr3 commented Jul 19, 2018

fmcquillan commented Jul 19, 2018

njayaram2 commented Jul 20, 2018

njayaram2 commented Jul 20, 2018

iyerr3 commented Jul 20, 2018

fmcquillan99 commented Jul 20, 2018

asfgit commented Jul 23, 2018

njayaram2 left a comment •

edited

njayaram2 Jul 24, 2018

asfgit commented Jul 24, 2018

asfgit commented Jul 25, 2018

njayaram2 left a comment

njayaram2 Jul 26, 2018

iyerr3 Jul 26, 2018

asfgit commented Jul 26, 2018

fmcquillan99 commented Aug 1, 2018 •

edited

asfgit commented Aug 1, 2018

Recursive Partitioning: Add function to report importance scores #295

Recursive Partitioning: Add function to report importance scores #295

Conversation

njayaram2 commented Jul 18, 2018

asfgit commented Jul 19, 2018

asfgit commented Jul 19, 2018

iyerr3 left a comment

Choose a reason for hiding this comment

iyerr3 Jul 19, 2018

Choose a reason for hiding this comment

iyerr3 Jul 19, 2018

Choose a reason for hiding this comment

iyerr3 Jul 19, 2018

Choose a reason for hiding this comment

iyerr3 Jul 19, 2018

Choose a reason for hiding this comment

njayaram2 commented Jul 19, 2018

asfgit commented Jul 19, 2018

asfgit commented Jul 19, 2018

fmcquillan99 commented Jul 19, 2018 • edited

fmcquillan99 commented Jul 19, 2018 • edited

njayaram2 commented Jul 19, 2018

iyerr3 commented Jul 19, 2018

fmcquillan commented Jul 19, 2018

njayaram2 commented Jul 20, 2018

njayaram2 commented Jul 20, 2018

iyerr3 commented Jul 20, 2018

fmcquillan99 commented Jul 20, 2018

asfgit commented Jul 23, 2018

njayaram2 left a comment • edited

Choose a reason for hiding this comment

njayaram2 Jul 24, 2018

Choose a reason for hiding this comment

asfgit commented Jul 24, 2018

asfgit commented Jul 25, 2018

njayaram2 left a comment

Choose a reason for hiding this comment

njayaram2 Jul 26, 2018

Choose a reason for hiding this comment

iyerr3 Jul 26, 2018

Choose a reason for hiding this comment

asfgit commented Jul 26, 2018

fmcquillan99 commented Aug 1, 2018 • edited

asfgit commented Aug 1, 2018

fmcquillan99 commented Jul 19, 2018 •

edited

fmcquillan99 commented Jul 19, 2018 •

edited

njayaram2 left a comment •

edited

fmcquillan99 commented Aug 1, 2018 •

edited