-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11861][ML] Add feature importances for decision trees #9912
Conversation
Test build #46540 has finished for PR 9912 at commit
|
After some further review, it seems generally accepted in the literature that this method of computing feature importance for decision trees has high variance due to correlated predictors. Some methods for compensating this would be to incorporate surrogate splits in the computation, but surrogate splits are not currently tracked in spark.ml. Despite the shortcomings, since scikit-learn and R (package: rpart) both offer it, I think this is still appropriate. We could include a warning message... thoughts? |
Test build #46550 has finished for PR 9912 at commit
|
Test build #51383 has finished for PR 9912 at commit
|
Thanks for the PR! I'd like to get this into 2.0. I just had a couple of small comments. |
* | ||
* This generalizes the idea of "Gini" importance to other losses, | ||
* following the explanation of Gini importance from "Random Forests" documentation | ||
* by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for including a note about feature importance having high variance for individual trees, and recommending that users use Random Forests to calculate importance more precisely (here and in DecisionTreeRegressor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note in the docs for DecisionTreeRegressor
and DecisionTreeClassifier
. I can update the format or the wording if needed.
val categoricalFeatures = Map.empty[Int, Int] | ||
val df: DataFrame = TreeTests.setMetadata(data, categoricalFeatures, numClasses) | ||
|
||
val importances = rf.fit(df).featureImportances | ||
val mostImportantFeature = importances.argmax | ||
assert(mostImportantFeature === 1) | ||
assert(importances.toArray.sum === 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the feature importance tests here, as well, with additional checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Test build #52531 has finished for PR 9912 at commit
|
Test build #52573 has finished for PR 9912 at commit
|
@jkbradley I addressed comments and added |
Test build #52673 has finished for PR 9912 at commit
|
LGTM |
This patch adds an API entry point for single decision tree feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes apache#9912 from sethah/SPARK-11861.
Hi , |
This patch adds an API entry point for single decision tree feature importances.