Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor #7838

Closed
wants to merge 10 commits into from

Conversation

jkbradley
Copy link
Member

Added featureImportance to RandomForestClassifier and Regressor.

This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]

CC: @yanboliang Would you mind taking a look? Thanks!

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39260 has finished for PR 7838 at commit dc33a1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val importances = new OpenHashMap[Int, Double]()
computeFeatureImportance(tree.rootNode, importances)
// Normalize importance vector for this tree, and add it to total.
val treeNorm = tree.rootNode.impurityStats.count
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more closely at the sklearn implementation, it looks like they normalize to make "importances" sum to 1 here, rather than normalizing by the number of instances.

Here's where: "normalize" is set to true when called from RandomForest: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3370]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure if I interpret your comment correctly.. But I believe the code is rightly written.

  • The normalization to make 'importances' sum to 1 (in your comment) is done at line 1154 below, on the final vector of feature importances.
  • The normalization at this piece of code (started at line 1144) is to normalize the weight of the nodes in the tree by dividing #instances at the node by total #instances. The impt scores are then summed up to have final vector of feature importances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sklearn decision trees, you can either (a) normalize the feature importance vector by # instances, or (b) normalize the importance vector to sum to 1. When that is called to compute feature importance for forests, sklearn does (b) by default. Originally, I had implemented this using (a), but I switched to (b) to match sklearn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley Correctly, I agree with you.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. They way I look at it is that (a) is a must, the impurity at each node is weighted by #instances divided total#instances, otherwise the importance score would be too big. Then, for normalization, there are several ways: (1) by making the sum equal to 1 (like ones provided in sklearn or in some R packages), or (2) by making the highest importance scaled to 1 (like ones provided in Hastie's book). Anyway, the code looks good to me :).

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39292 has finished for PR 7838 at commit 1ee49f0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@feynmanliang Thanks for writing those tests. I could not think of a good way to make the tests robust.

The issue is that Random Forests could be run in the same way for both MLlib and sklearn, but that would require not resampling on each iteration. If we did that, then all of the trees in the forest would be the same, so it would not be much of a test of the feature importance calculation.

So I wrote some tests by hand instead. Not a great solution, but hopefully good enough for now.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39324 has finished for PR 7838 at commit 97d3de3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39372 has finished for PR 7838 at commit 41c28d0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley jkbradley changed the title [SPARK-5133] [ml] [WIP] Added featureImportance to RandomForestClassifier and Regressor [SPARK-5133] [ml] Added featureImportance to RandomForestClassifier and Regressor Aug 1, 2015
@jkbradley
Copy link
Member Author

OK that should finally do it. @namma Would you mind taking a final look? Thank you!
@yanboliang If you have a chance, that would be valuable as well. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39384 has finished for PR 7838 at commit cd86e18.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39385 has finished for PR 7838 at commit 610dd57.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM.

@jkbradley
Copy link
Member Author

I just realized the Map is not Java-friendly. I might modify this to store numFeatures in the model so that we can use Vector instead.

@jkbradley
Copy link
Member Author

@yanboliang Would you mind checking those updates? I think that should be it (and it worked locally). That will be Java-friendly.

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39438 has finished for PR 7838 at commit 1442c2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -32,7 +29,7 @@ import org.apache.spark.mllib.tree.model.{RandomForestModel => OldRandomForestMo
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DoubleType


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant blank line

@yanboliang
Copy link
Contributor

Looks good except some trivial issues.

@jkbradley
Copy link
Member Author

@yanboliang Thank you for taking a look! I just fixed some merge issues.
After tests pass, I'll merge this with master.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39563 has finished for PR 7838 at commit 86cea5f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1317 has finished for PR 7838 at commit 72a167a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Merging with master and branch-1.5

asfgit pushed a commit that referenced this pull request Aug 3, 2015
…nd Regressor

Added featureImportance to RandomForestClassifier and Regressor.

This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]

CC: yanboliang  Would you mind taking a look?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Feynman Liang <fliang@databricks.com>

Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits:

72a167a [Joseph K. Bradley] fixed unit test
86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map
5aa74f0 [Joseph K. Bradley] finally fixed unit test for real
33df5db [Joseph K. Bradley] fix unit test
42a2d3b [Joseph K. Bradley] fix unit test
fe94e72 [Joseph K. Bradley] modified feature importance unit tests
cc693ee [Feynman Liang] Add classifier tests
79a6f87 [Feynman Liang] Compare dense vectors in test
21d01fc [Feynman Liang] Added failing SKLearn test
ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor.  Need to add unit tests

(cherry picked from commit ff9169a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in ff9169a Aug 3, 2015
@jkbradley jkbradley deleted the dt-feature-importance branch August 3, 2015 19:25
@jkbradley
Copy link
Member Author

@namma Thanks for taking a look! If there are other options, types of feature importance, or other metrics which are important for your use, please make JIRAs and ping me on them.

@Malouke
Copy link

Malouke commented May 10, 2017

Hi ,
somone can telle me if we have this features inside version 1.6.0 of pyspark ?

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants