[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

jkbradley · 2015-07-31T19:58:32Z

Added featureImportance to RandomForestClassifier and Regressor.

This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]

CC: @yanboliang Would you mind taking a look? Thanks!

SparkQA · 2015-07-31T20:42:21Z

Test build #39260 has finished for PR 7838 at commit dc33a1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-31T22:08:26Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+      val importances = new OpenHashMap[Int, Double]()
+      computeFeatureImportance(tree.rootNode, importances)
+      // Normalize importance vector for this tree, and add it to total.
+      val treeNorm = tree.rootNode.impurityStats.count


Looking more closely at the sklearn implementation, it looks like they normalize to make "importances" sum to 1 here, rather than normalizing by the number of instances.

Here's where: "normalize" is set to true when called from RandomForest: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3370]

I am not really sure if I interpret your comment correctly.. But I believe the code is rightly written.

The normalization to make 'importances' sum to 1 (in your comment) is done at line 1154 below, on the final vector of feature importances.

The normalization at this piece of code (started at line 1144) is to normalize the weight of the nodes in the tree by dividing #instances at the node by total #instances. The impt scores are then summed up to have final vector of feature importances.

In sklearn decision trees, you can either (a) normalize the feature importance vector by # instances, or (b) normalize the importance vector to sum to 1. When that is called to compute feature importance for forests, sklearn does (b) by default. Originally, I had implemented this using (a), but I switched to (b) to match sklearn.

@jkbradley Correctly, I agree with you.

I see. They way I look at it is that (a) is a must, the impurity at each node is weighted by #instances divided total#instances, otherwise the importance score would be too big. Then, for normalization, there are several ways: (1) by making the sum equal to 1 (like ones provided in sklearn or in some R packages), or (2) by making the highest importance scaled to 1 (like ones provided in Hastie's book). Anyway, the code looks good to me :).

SparkQA · 2015-07-31T23:10:48Z

Test build #39292 has finished for PR 7838 at commit 1ee49f0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-01T04:17:36Z

@feynmanliang Thanks for writing those tests. I could not think of a good way to make the tests robust.

The issue is that Random Forests could be run in the same way for both MLlib and sklearn, but that would require not resampling on each iteration. If we did that, then all of the trees in the forest would be the same, so it would not be much of a test of the feature importance calculation.

So I wrote some tests by hand instead. Not a great solution, but hopefully good enough for now.

SparkQA · 2015-08-01T06:24:43Z

Test build #39324 has finished for PR 7838 at commit 97d3de3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-01T19:05:46Z

Test build #39372 has finished for PR 7838 at commit 41c28d0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-01T22:29:40Z

OK that should finally do it. @namma Would you mind taking a final look? Thank you!
@yanboliang If you have a chance, that would be valuable as well. Thanks!

SparkQA · 2015-08-01T22:34:43Z

Test build #39384 has finished for PR 7838 at commit cd86e18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-01T23:07:48Z

Test build #39385 has finished for PR 7838 at commit 610dd57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2015-08-02T08:16:42Z

LGTM.

jkbradley · 2015-08-02T18:29:13Z

I just realized the Map is not Java-friendly. I might modify this to store numFeatures in the model so that we can use Vector instead.

jkbradley · 2015-08-02T19:21:50Z

@yanboliang Would you mind checking those updates? I think that should be it (and it worked locally). That will be Java-friendly.

SparkQA · 2015-08-02T20:01:39Z

Test build #39438 has finished for PR 7838 at commit 1442c2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2015-08-03T03:37:32Z

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

@@ -32,7 +29,7 @@ import org.apache.spark.mllib.tree.model.{RandomForestModel => OldRandomForestMo
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.DataFrame
 import org.apache.spark.sql.functions._
-import org.apache.spark.sql.types.DoubleType
+



Redundant blank line

yanboliang · 2015-08-03T04:07:07Z

Looks good except some trivial issues.

…add unit tests

jkbradley · 2015-08-03T17:50:20Z

@yanboliang Thank you for taking a look! I just fixed some merge issues.
After tests pass, I'll merge this with master.

SparkQA · 2015-08-03T18:12:28Z

Test build #39563 has finished for PR 7838 at commit 86cea5f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T19:08:35Z

Test build #1317 has finished for PR 7838 at commit 72a167a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-03T19:15:58Z

Merging with master and branch-1.5

…nd Regressor Added featureImportance to RandomForestClassifier and Regressor. This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341] CC: yanboliang Would you mind taking a look? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Author: Feynman Liang <fliang@databricks.com> Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits: 72a167a [Joseph K. Bradley] fixed unit test 86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map 5aa74f0 [Joseph K. Bradley] finally fixed unit test for real 33df5db [Joseph K. Bradley] fix unit test 42a2d3b [Joseph K. Bradley] fix unit test fe94e72 [Joseph K. Bradley] modified feature importance unit tests cc693ee [Feynman Liang] Add classifier tests 79a6f87 [Feynman Liang] Compare dense vectors in test 21d01fc [Feynman Liang] Added failing SKLearn test ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests (cherry picked from commit ff9169a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

jkbradley · 2015-08-03T20:56:33Z

@namma Thanks for taking a look! If there are other options, types of feature importance, or other metrics which are important for your use, please make JIRAs and ping me on them.

Malouke · 2017-05-10T07:02:28Z

Hi ,
somone can telle me if we have this features inside version 1.6.0 of pyspark ?

thx

jkbradley reviewed Jul 31, 2015
View reviewed changes

jkbradley changed the title ~~[SPARK-5133] [ml] [WIP] Added featureImportance to RandomForestClassifier and Regressor~~ [SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor Aug 1, 2015

yanboliang reviewed Aug 3, 2015
View reviewed changes

jkbradley and others added 9 commits August 3, 2015 10:48

Added featureImportance to RandomForestClassifier/Regressor. Need to …

ac0b254

…add unit tests

Added failing SKLearn test

21d01fc

Compare dense vectors in test

79a6f87

Add classifier tests

cc693ee

modified feature importance unit tests

fe94e72

fix unit test

42a2d3b

fix unit test

33df5db

finally fixed unit test for real

5aa74f0

Modified RF featuresImportances to return Vector instead of Map

86cea5f

jkbradley force-pushed the dt-feature-importance branch from 1442c2b to 86cea5f Compare August 3, 2015 17:49

fixed unit test

72a167a

asfgit closed this in ff9169a Aug 3, 2015

jkbradley deleted the dt-feature-importance branch August 3, 2015 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

jkbradley commented Jul 31, 2015

SparkQA commented Jul 31, 2015

jkbradley Jul 31, 2015

namma Aug 1, 2015

jkbradley Aug 1, 2015

yanboliang Aug 2, 2015

namma Aug 3, 2015

SparkQA commented Jul 31, 2015

jkbradley commented Aug 1, 2015

SparkQA commented Aug 1, 2015

SparkQA commented Aug 1, 2015

jkbradley commented Aug 1, 2015

SparkQA commented Aug 1, 2015

SparkQA commented Aug 1, 2015

yanboliang commented Aug 2, 2015

jkbradley commented Aug 2, 2015

jkbradley commented Aug 2, 2015

SparkQA commented Aug 2, 2015

yanboliang Aug 3, 2015

yanboliang commented Aug 3, 2015

jkbradley commented Aug 3, 2015

SparkQA commented Aug 3, 2015

SparkQA commented Aug 3, 2015

jkbradley commented Aug 3, 2015

jkbradley commented Aug 3, 2015

Malouke commented May 10, 2017

[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

Conversation

jkbradley commented Jul 31, 2015

SparkQA commented Jul 31, 2015

jkbradley Jul 31, 2015

Choose a reason for hiding this comment

namma Aug 1, 2015

Choose a reason for hiding this comment

jkbradley Aug 1, 2015

Choose a reason for hiding this comment

yanboliang Aug 2, 2015

Choose a reason for hiding this comment

namma Aug 3, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 31, 2015

jkbradley commented Aug 1, 2015

SparkQA commented Aug 1, 2015

SparkQA commented Aug 1, 2015

jkbradley commented Aug 1, 2015

SparkQA commented Aug 1, 2015

SparkQA commented Aug 1, 2015

yanboliang commented Aug 2, 2015

jkbradley commented Aug 2, 2015

jkbradley commented Aug 2, 2015

SparkQA commented Aug 2, 2015

yanboliang Aug 3, 2015

Choose a reason for hiding this comment

yanboliang commented Aug 3, 2015

jkbradley commented Aug 3, 2015

SparkQA commented Aug 3, 2015

SparkQA commented Aug 3, 2015

jkbradley commented Aug 3, 2015

jkbradley commented Aug 3, 2015

Malouke commented May 10, 2017