[SPARK-10524][ML] Use the soft prediction to order categories' bins #8734

viirya · 2015-09-13T15:30:55Z

JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (ImpurityCalculator.predict) to order categories' bins. But we should use the soft prediction.

SparkQA · 2015-09-13T17:08:54Z

Test build #42382 has finished for PR 8734 at commit 84260ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-10-06T21:40:26Z

mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

+      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
+    )))
+    val nodeQueue = new mutable.Queue[(Int, Node)]()
+    DecisionTree.findBestSplits(baggedInput, metadata, Array(topNode),


Can you please update this test to call binsToBestSplit directly? You can change it to be private[tree] so that it's callable from this test suite.

In order to call binsToBestSplit directly, we need to expose many details of findBestSplits too, e.g., binSeqOp, getNodeToFeatures and partitionAggregates...etc., because binsToBestSplit needs binAggregates and featuresForNode..etc. as parameters. Is it a good idea?

jkbradley · 2015-12-30T17:05:00Z

I'll have bandwidth to get this merged now, so I'll watch for updates. Thanks!

jkbradley · 2016-01-14T01:16:41Z

Ping! Please let me know if you don't have time to work on this, and I can take it over. Thanks

viirya · 2016-01-14T01:56:06Z

@jkbradley Sorry for replying late. I will try to finish this soon. Thanks.

jkbradley · 2016-01-14T01:59:20Z

OK thanks!

SparkQA · 2016-01-20T04:59:31Z

Test build #49757 has finished for PR 8734 at commit 2c32350.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-20T17:15:36Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -740,7 +740,7 @@ private[ml] object RandomForest extends Logging {
              val categoryStats =
                binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
              val centroid = if (categoryStats.count != 0) {
-                categoryStats.predict
+                categoryStats.prob(categoryStats.predict)


I don't believe this is correct. Ordering by the probability of the prediction is essentially the same as ordering by impurity. That's because when the impurity is low, the predicted value will have high probability and vice versa.

From Hastie, Tibshirani, and Friedman:
"We order the predictor classes according to the proportion falling in outcome class 1. Then we split this predictor as if it were an ordered predictor."

For binary category I think it should be as @jkbradley suggested categoryStats.stats(1)

As I saw from the implementation, categoryStats.stats(1) is just the count of class 1, not the proportion falling in outcome class 1. Are we going to order bins by that?

Finding the proportion falling in outcome class 1 simply requires division of the counts by a constant. Since we're just using that number for an ordering, constant division won't matter. They are the same.

My initial comment has a typo. It should say for a "binary outcome", not "binary category".

Yeah, I see. I was thinking we are going to order them by soft prediction of each bin. Actually what we want is to order them by soft prediction of certain class.

SparkQA · 2016-01-21T07:37:52Z

Test build #49865 has finished for PR 8734 at commit cd25214.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-21T09:48:43Z

Test build #49874 has finished for PR 8734 at commit a37d3d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-22T18:35:36Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -740,7 +740,11 @@ private[ml] object RandomForest extends Logging {
              val categoryStats =
                binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
              val centroid = if (categoryStats.count != 0) {
-                categoryStats.predict
+                if (categoryStats.count == 2) {


I think you meant categoryStats.stats.length == 2. categoryStats.count is the count of data points falling into that particular bin. Since we are trying to determine here whether this is regression or binary classification, I think checking if (binAggregates.metadata.isClassification) is more clear.

Additionally, the code under the if and else statements of centroidForCategories is identical except for a single line. It seems cleaner to restructure to something like:

val centroidForCategories = Range(0, numCategories).map { case featureValue => val categoryStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue) val centroid = if (categoryStats.count != 0) { if (binAggregates.metadata.isMulticlass) { // multiclass classification categoryStats.calculate() } else if (binAggregates.metadata.isClassification) { // binary classification categoryStats.stats(1) } else { // regression categoryStats.predict } } else { Double.MaxValue } (featureValue, centroid) }

@sethah Thanks. You are right. I didn't read this part of codes thoroughly.

SparkQA · 2016-01-23T09:14:03Z

Test build #49934 has finished for PR 8734 at commit 5c44e23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-02-09T23:04:54Z

@viirya Thanks for the updates.

I think the code is correct now, but I'm going to send you a PR (to update this PR) in order to improve the test. I agree with @sethah that the current test does not really test anything.

@sethah Does it look good to you, other than the test?

sethah · 2016-02-09T23:46:08Z

Yes, LGTM pending the improved test, thanks!

jkbradley · 2016-02-09T23:56:28Z

Here it is: [https://github.com/viirya/spark-1 /pull/1]

Fixed unit test and added one to spark.ml

viirya · 2016-02-10T00:12:22Z

@jkbradley Great thanks. I've merged your PR.

SparkQA · 2016-02-10T00:55:37Z

Test build #51012 has finished for PR 8734 at commit 2bbe037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-02-10T01:09:55Z

LGTM! Thanks @viirya and @sethah

I'll merge with master and see how far back I can backport it easily.

JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc6) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Use the soft prediction to order categories' bins.

84260ca

jkbradley reviewed Oct 6, 2015
View reviewed changes

viirya added 2 commits January 20, 2016 12:08

Merge remote-tracking branch 'upstream/master' into dt-soft-centroids

dfa114c

Instead of calculate(), we should call prob() to get soft prediction.

2c32350

sethah reviewed Jan 20, 2016
View reviewed changes

For comments.

cd25214

Merge remote-tracking branch 'upstream/master' into dt-soft-centroids

a37d3d8

sethah reviewed Jan 22, 2016
View reviewed changes

For comments.

5c44e23

fixed unit test

c10872b

Merge pull request #1 from jkbradley/viirya-dt-soft-centroids

2bbe037

Fixed unit test and added one to spark.ml

asfgit closed this in 9267bc6 Feb 10, 2016

viirya deleted the dt-soft-centroids branch December 27, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10524][ML] Use the soft prediction to order categories' bins #8734

[SPARK-10524][ML] Use the soft prediction to order categories' bins #8734

viirya commented Sep 13, 2015

SparkQA commented Sep 13, 2015

jkbradley Oct 6, 2015

jkbradley Dec 30, 2015

viirya Jan 20, 2016

jkbradley commented Dec 30, 2015

jkbradley commented Jan 14, 2016

viirya commented Jan 14, 2016

jkbradley commented Jan 14, 2016

SparkQA commented Jan 20, 2016

sethah Jan 20, 2016

viirya Jan 21, 2016

sethah Jan 21, 2016

viirya Jan 21, 2016

SparkQA commented Jan 21, 2016

SparkQA commented Jan 21, 2016

sethah Jan 22, 2016

viirya Jan 23, 2016

SparkQA commented Jan 23, 2016

jkbradley commented Feb 9, 2016

sethah commented Feb 9, 2016

jkbradley commented Feb 9, 2016

viirya commented Feb 10, 2016

SparkQA commented Feb 10, 2016

jkbradley commented Feb 10, 2016

[SPARK-10524][ML] Use the soft prediction to order categories' bins #8734

[SPARK-10524][ML] Use the soft prediction to order categories' bins #8734

Conversation

viirya commented Sep 13, 2015

SparkQA commented Sep 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Dec 30, 2015

jkbradley commented Jan 14, 2016

viirya commented Jan 14, 2016

jkbradley commented Jan 14, 2016

SparkQA commented Jan 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 21, 2016

SparkQA commented Jan 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 23, 2016

jkbradley commented Feb 9, 2016

sethah commented Feb 9, 2016

jkbradley commented Feb 9, 2016

viirya commented Feb 10, 2016

SparkQA commented Feb 10, 2016

jkbradley commented Feb 10, 2016