-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838
Conversation
Test build #39260 has finished for PR 7838 at commit
|
val importances = new OpenHashMap[Int, Double]() | ||
computeFeatureImportance(tree.rootNode, importances) | ||
// Normalize importance vector for this tree, and add it to total. | ||
val treeNorm = tree.rootNode.impurityStats.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more closely at the sklearn implementation, it looks like they normalize to make "importances" sum to 1 here, rather than normalizing by the number of instances.
Here's where: "normalize" is set to true when called from RandomForest: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3370]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure if I interpret your comment correctly.. But I believe the code is rightly written.
- The normalization to make 'importances' sum to 1 (in your comment) is done at line 1154 below, on the final vector of feature importances.
- The normalization at this piece of code (started at line 1144) is to normalize the weight of the nodes in the tree by dividing #instances at the node by total #instances. The impt scores are then summed up to have final vector of feature importances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In sklearn decision trees, you can either (a) normalize the feature importance vector by # instances, or (b) normalize the importance vector to sum to 1. When that is called to compute feature importance for forests, sklearn does (b) by default. Originally, I had implemented this using (a), but I switched to (b) to match sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley Correctly, I agree with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. They way I look at it is that (a) is a must, the impurity at each node is weighted by #instances divided total#instances, otherwise the importance score would be too big. Then, for normalization, there are several ways: (1) by making the sum equal to 1 (like ones provided in sklearn or in some R packages), or (2) by making the highest importance scaled to 1 (like ones provided in Hastie's book). Anyway, the code looks good to me :).
Test build #39292 has finished for PR 7838 at commit
|
@feynmanliang Thanks for writing those tests. I could not think of a good way to make the tests robust. The issue is that Random Forests could be run in the same way for both MLlib and sklearn, but that would require not resampling on each iteration. If we did that, then all of the trees in the forest would be the same, so it would not be much of a test of the feature importance calculation. So I wrote some tests by hand instead. Not a great solution, but hopefully good enough for now. |
Test build #39324 has finished for PR 7838 at commit
|
Test build #39372 has finished for PR 7838 at commit
|
OK that should finally do it. @namma Would you mind taking a final look? Thank you! |
Test build #39384 has finished for PR 7838 at commit
|
Test build #39385 has finished for PR 7838 at commit
|
LGTM. |
I just realized the Map is not Java-friendly. I might modify this to store numFeatures in the model so that we can use Vector instead. |
@yanboliang Would you mind checking those updates? I think that should be it (and it worked locally). That will be Java-friendly. |
Test build #39438 has finished for PR 7838 at commit
|
@@ -32,7 +29,7 @@ import org.apache.spark.mllib.tree.model.{RandomForestModel => OldRandomForestMo | |||
import org.apache.spark.rdd.RDD | |||
import org.apache.spark.sql.DataFrame | |||
import org.apache.spark.sql.functions._ | |||
import org.apache.spark.sql.types.DoubleType | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant blank line
Looks good except some trivial issues. |
1442c2b
to
86cea5f
Compare
@yanboliang Thank you for taking a look! I just fixed some merge issues. |
Test build #39563 has finished for PR 7838 at commit
|
Test build #1317 has finished for PR 7838 at commit
|
Merging with master and branch-1.5 |
…nd Regressor Added featureImportance to RandomForestClassifier and Regressor. This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341] CC: yanboliang Would you mind taking a look? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Author: Feynman Liang <fliang@databricks.com> Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits: 72a167a [Joseph K. Bradley] fixed unit test 86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map 5aa74f0 [Joseph K. Bradley] finally fixed unit test for real 33df5db [Joseph K. Bradley] fix unit test 42a2d3b [Joseph K. Bradley] fix unit test fe94e72 [Joseph K. Bradley] modified feature importance unit tests cc693ee [Feynman Liang] Add classifier tests 79a6f87 [Feynman Liang] Compare dense vectors in test 21d01fc [Feynman Liang] Added failing SKLearn test ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests (cherry picked from commit ff9169a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@namma Thanks for taking a look! If there are other options, types of feature importance, or other metrics which are important for your use, please make JIRAs and ping me on them. |
Hi , thx |
Added featureImportance to RandomForestClassifier and Regressor.
This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]
CC: @yanboliang Would you mind taking a look? Thanks!