[SPARK-11861][ML] Add feature importances for decision trees #9912

sethah · 2015-11-23T18:00:00Z

This patch adds an API entry point for single decision tree feature importances.

SparkQA · 2015-11-23T18:56:39Z

Test build #46540 has finished for PR 9912 at commit 9378203.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2015-11-23T20:41:13Z

After some further review, it seems generally accepted in the literature that this method of computing feature importance for decision trees has high variance due to correlated predictors. Some methods for compensating this would be to incorporate surrogate splits in the computation, but surrogate splits are not currently tracked in spark.ml.

Despite the shortcomings, since scikit-learn and R (package: rpart) both offer it, I think this is still appropriate. We could include a warning message... thoughts?

SparkQA · 2015-11-23T21:25:38Z

Test build #46550 has finished for PR 9912 at commit 0e8b223.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-16T22:00:57Z

Test build #51383 has finished for PR 9912 at commit 0c80d8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-04T01:59:48Z

Thanks for the PR! I'd like to get this into 2.0. I just had a couple of small comments.

jkbradley · 2016-03-04T01:59:52Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

+   *
+   * This generalizes the idea of "Gini" importance to other losses,
+   * following the explanation of Gini importance from "Random Forests" documentation
+   * by Leo Breiman and Adele Cutler, and following the implementation from scikit-learn.


+1 for including a note about feature importance having high variance for individual trees, and recommending that users use Random Forests to calculate importance more precisely (here and in DecisionTreeRegressor)

I added a note in the docs for DecisionTreeRegressor and DecisionTreeClassifier. I can update the format or the wording if needed.

sethah · 2016-03-06T20:49:46Z

mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala

    val categoricalFeatures = Map.empty[Int, Int]
    val df: DataFrame = TreeTests.setMetadata(data, categoricalFeatures, numClasses)

    val importances = rf.fit(df).featureImportances
    val mostImportantFeature = importances.argmax
    assert(mostImportantFeature === 1)
+    assert(importances.toArray.sum === 1.0)


I updated the feature importance tests here, as well, with additional checks.

SparkQA · 2016-03-06T21:30:58Z

Test build #52531 has finished for PR 9912 at commit cc2eb44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-07T19:03:31Z

Test build #52573 has finished for PR 9912 at commit 57cbfb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-03-08T18:18:46Z

@jkbradley I addressed comments and added @Since annotations to decision trees and random forests. Let me know if you see anything else, thanks!

SparkQA · 2016-03-08T18:57:56Z

Test build #52673 has finished for PR 9912 at commit 30637d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-09T22:44:15Z

LGTM
Thanks a lot for the PR!
Merging with master

This patch adds an API entry point for single decision tree feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes apache#9912 from sethah/SPARK-11861.

Malouke · 2017-05-05T03:42:47Z

Hi ,
Thank you for Work but il don t Knowles where il can found documentation about the features importances???
I use pyspark 1.6

sethah added 6 commits February 16, 2016 11:26

adding feature importance to decision trees

4ec5c98

adding/fixing some docs and cleaning imports

0537b16

minor fix

f6614a0

changing some scaladocs

58ae3c4

generalizing error message

663d372

merge master

0c80d8e

sethah force-pushed the SPARK-11861 branch from 0e8b223 to 0c80d8e Compare February 16, 2016 21:03

jkbradley reviewed Mar 4, 2016
View reviewed changes

addressing comments

cc2eb44

sethah reviewed Mar 6, 2016
View reviewed changes

remove extra blank line

57cbfb5

adding since tags to feature importances

30637d4

asfgit closed this in e1772d3 Mar 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11861][ML] Add feature importances for decision trees #9912

[SPARK-11861][ML] Add feature importances for decision trees #9912

sethah commented Nov 23, 2015

SparkQA commented Nov 23, 2015

sethah commented Nov 23, 2015

SparkQA commented Nov 23, 2015

SparkQA commented Feb 16, 2016

jkbradley commented Mar 4, 2016

jkbradley Mar 4, 2016

sethah Mar 6, 2016

sethah Mar 6, 2016

jkbradley Mar 9, 2016

SparkQA commented Mar 6, 2016

SparkQA commented Mar 7, 2016

sethah commented Mar 8, 2016

SparkQA commented Mar 8, 2016

jkbradley commented Mar 9, 2016

Malouke commented May 5, 2017 •

edited

Loading

[SPARK-11861][ML] Add feature importances for decision trees #9912

[SPARK-11861][ML] Add feature importances for decision trees #9912

Conversation

sethah commented Nov 23, 2015

SparkQA commented Nov 23, 2015

sethah commented Nov 23, 2015

SparkQA commented Nov 23, 2015

SparkQA commented Feb 16, 2016

jkbradley commented Mar 4, 2016

jkbradley Mar 4, 2016

Choose a reason for hiding this comment

sethah Mar 6, 2016

Choose a reason for hiding this comment

sethah Mar 6, 2016

Choose a reason for hiding this comment

jkbradley Mar 9, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 6, 2016

SparkQA commented Mar 7, 2016

sethah commented Mar 8, 2016

SparkQA commented Mar 8, 2016

jkbradley commented Mar 9, 2016

Malouke commented May 5, 2017 • edited Loading

Malouke commented May 5, 2017 •

edited

Loading