New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive Partitioning: Add function to report importance scores #295
Conversation
JIRA: MADLIB-1254 If tree_train/forest_train is run with grouping enabled and if one of the groups has a categorical feature with just single level, then the categorical feature is eliminated for that group. If other groups retain that feature, then we end up with incorrect "bins" data structure built as part of DT. This commit fixes this issue by recording the categorical features present in each group separately. Closes apache#295
JIRA: MADLIB-1254 If tree_train/forest_train is run with grouping enabled and if one of the groups has a categorical feature with just single level, then the categorical feature is eliminated for that group. If other groups retain that feature, then we end up with incorrect "bins" data structure built as part of DT. This commit fixes this issue by recording the categorical features present in each group separately. Closes apache#295
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested a few changes. IMO, the python functions are better suited for random_forest.py_in
since that file knows about RF and DT, whereas decision_tree.py_in
(ideally) should be RF-agnostic.
_assert(table_exists(group_table), | ||
"Recursive Partitioning: Model group table does not exist.") | ||
# this flag has to be set to true for RF to report importance scores. | ||
isImportance = plpy.execute("SELECT importance FROM {summary_table}". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our convention is to use snake case for Python (i.e. is_importance
). I also suggest changing it to is_importance_set
to make it more explicit.
|
||
def _is_model_for_RF(summary_table): | ||
# Only an RF model (and not DT) would have num_trees column in summary | ||
return columns_exist_in_table(summary_table, ['num_trees']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've a method_name
in summary table that makes this clearer.
# Only an RF model (and not DT) would have num_trees column in summary | ||
return columns_exist_in_table(summary_table, ['num_trees']) | ||
|
||
def _is_RF_model_with_imp_pre_1_15(group_table, summary_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since goal of function is to check if impurity_var_importance
exists in group table, let's name it to reflect that.
unnest(regexp_split_to_array(cat_features, ',')) AS feature, | ||
unnest(cat_var_importance) AS var_importance | ||
FROM {group_table}, {summary_table} | ||
""".format(**locals())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's cleaner to see the two queries UNION
ed together to get the output.
Thank you for the comments @iyerr3 , will make necessary changes. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Should impurity_var_importance always add up to 100?
results in
which does not add up to 100
The forest was built with 10 trees, so not sure if it is a coincidence that both add up to 90 (maybe 1 tree does not contribute in some way?) |
Another run I got
so this does seem to be about trees contributing or not. Please check on this and how the normalization is done. |
@fmcquillan99 the importance scores for all variables within a tree would sum up to 100. But in the case of a forest, we average over the scores for each variable across all trees in the forest, and the sum of those averages might not be 100. One reason could be because some variables might get a importance score of |
Considering the above situation, I suggest the variable importance values not be scaled to sum to 100. We can make the normalization within |
Would this apply to oob too? |
@iyerr3, as you noted, that would change the importance scores that appear in the DT/RF output table, and the |
@fmcquillan only impurity, I don't think we scale oob to 100. |
So the way R does this is it keeps the original values in its model. But it has a "report" function that outputs multiple things including importance and in this report the values are normalized. The user can access the original values from the model if needed but would mostly use the report to understand the importance. We could essentially do the same and be clear about this in the documentation. Alternatively, we chuck the normalization to 100. It was done solely to compare with R. |
I like this last suggestion from @iyerr3, that we report raw values for oob and impurity VI in the model output file. (OK to keep the shifted oob > 0 as we do now.) For the helper/reporting function, compute and report out the scaled/normalized values 0-100 for both oob and impurity VI. These should always add up to 100 unless there is some corner cases, if so pls let us know. |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall code and refactoring LGTM. A couple of things to address:
- Two unit test cases are failing, please fix those. The command to run unit test would be
python src/ports/postgres/9.4/modules/recursive_partitioning/test/unit_tests/test_random_forest.py
, wherepostgres/9.4
must be replaced with whatever is appropriate in your environment. - The documentation in DT and RF must change to reflect the unnormalized values for impurity var importance when the model is trained.
{impurity_var_importance_str} | ||
FROM {importance_model_table}, {summary_table} | ||
""".format(**locals())) | ||
# ------------------------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Refer to this link for build results (access rights to CI server needed): |
2f6acc4
to
7f6e291
Compare
JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest, by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but for one requested change.
CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.get_var_importance( | ||
message TEXT | ||
) RETURNS TEXT AS $$ | ||
PythonFunction(recursive_partitioning, random_forest, tree_importance_help_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree_importance_help_message
is defined in decision_tree.py_in
, so select madlib.get_var_importance()
fails at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
7f6e291
to
3ab7554
Compare
JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest, by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>
Refer to this link for build results (access rights to CI server needed): |
LGTM, here is an RF example that seems correct and adds up to 100 per group:
|
JIRA: MADLIB-925 This commit adds a new MADlib function (get_var_importance) to report the importance scores in decision tree and random forest by unnesting the importance values along with corresponding features. Closes apache#295 Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Jingyi Mei <jmei@pivotal.io> Co-authored-by: Orhan Kislal <okislal@pivotal.io>
fec0e6d
to
186390f
Compare
Refer to this link for build results (access rights to CI server needed): |
JIRA: MADLIB-925
This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest. RF models prior to
MADlib 1.15 used to have variable importance scores reported, but they
also have impurity variable importance from 1.15 onwards. This function
reports both those scores for >=1.15 RF models, and only the oob variable
importance score for <1.15 RF models.
This function when called for a DT model, would return the impurity
variable importance score for >=1.15 DT models.
Co-authored-by: Jingyi Mei jmei@pivotal.io
Co-authored-by: Orhan Kislal okislal@pivotal.io