[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332

chouqin · 2014-09-09T09:03:14Z

These two parameters can act as early stop rules to do pre-pruning. When a split cause cause left or right child to have less than minInstancesPerNode or has less information gain than minInfoGain, current node will not be split by this split.

When there is no possible splits that satisfy requirements, there is no useful information gain stats, but we still need to calculate the predict value for current node. So I separated calculation of predict from calculation of information gain, which can also save computation when the number of possible splits is large. Please see SPARK-3272 for more details.

CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.

…rune Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala

SparkQA · 2014-09-09T09:47:32Z

Can one of the admins verify this patch?

manishamde · 2014-09-09T09:53:47Z

@chouqin Thanks for the PR. I won't be able to comment since I am on a break now. @jkbradley and @mengxr reviews should be sufficient. :-)

mengxr · 2014-09-09T19:01:22Z

Jenkins, add to whitelist.

mengxr · 2014-09-09T19:01:32Z

this is ok to test

jkbradley · 2014-09-09T19:19:08Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

            val gainStats =
              calculateGainForSplit(leftChildStats, rightChildStats, nodeImpurity, level, metadata)
            (splitIdx, gainStats)
          }.maxBy(_._2.gain)
-        (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats)
+        if (bestFeatureGainStats == InformationGainStats.invalidInformationGainStats) {


I think you could avoid explicitly checking for invalidInformationGainStats since the gain is Double.minValue. At the very end of the maxBy calls, you could then check to see if the information gain is Double.minValue, in which case we know that no split is worth doing. That should simplify the code here and in the other maxBy calls below.

jkbradley · 2014-09-09T19:22:59Z

@chouqin Thanks for this update---it will be great to have these 2 options supported. My comments are mostly about simplifying the code: removing Predict and Split.noSplit, and the related simplifications. We will not save much computation by avoiding calculating predictions. However, we will definitely save a lot of computation by supporting these 2 options to allow early stopping on some datasets!

SparkQA · 2014-09-09T19:51:40Z

QA tests have started for PR 2332 at commit efcc736.

This patch merges cleanly.

SparkQA · 2014-09-09T20:51:46Z

QA tests have finished for PR 2332 at commit efcc736.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Predict(

chouqin · 2014-09-09T23:51:42Z

@jkbradley thanks for your comments, I will change my code accordingly. As for the Predict class, I still think it is needed, for the following reasons:

Saving of computation, for each split, it will traverse a array of bins two times(one to add, one to find the index that has the maximum value), I don't think this saving is trival.
As for code simplicity, I think predict value for a node should not tied to information gain for a split(information gain ). I have read Weka and scikit-learn's decision tree code, they don't store a predict value along with split's information gain stats. I think the changed code may be easy to understand somehow.
For the early return of calculate calculateGainForSplit, when left child or right child has less than minInstancesPerNode we can just return an invalid information gain stats, without calculate the predict value.If all splits are early returned, we need a way to calculate the predict value for current node.

If you don't like creating a new Predict class, I can use a tuple to replace that, but this seems to be harder to understand.

jkbradley · 2014-09-10T00:05:53Z

@chouqin Thanks for your responses. I think you've convinced me that Predict is reasonable, since it is a different concept from info gain. Could you please make it private[tree] though?

Clarification for 1.: By "array of bins," do you mean the array of classes to calculate the prediction (for classification)? Unless there are a very large number of classes, I do not think the savings will be that much.

Thanks!

jkbradley · 2014-09-10T00:06:59Z

@chouqin Could you also please add tag [mllib] to the PR title?

SparkQA · 2014-09-10T00:51:30Z

QA tests have started for PR 2332 at commit d593ec7.

This patch merges cleanly.

SparkQA · 2014-09-10T01:57:38Z

QA tests have finished for PR 2332 at commit d593ec7.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Predict(

chouqin · 2014-09-10T02:34:42Z

@jkbradley I have removed noSplit object and add private[tree] to Predict.

SparkQA · 2014-09-10T02:52:06Z

QA tests have started for PR 2332 at commit 0278a11.

This patch merges cleanly.

SparkQA · 2014-09-10T03:55:46Z

QA tests have finished for PR 2332 at commit 0278a11.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
- case class LastFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction
- case class Abs(child: Expression) extends UnaryExpression

jkbradley · 2014-09-10T04:17:19Z

@chouqin Thanks for the updates! This looks basically ready, except for the edge cases in the test suite. I tested it and it ran fine. I think those complaints about public classes are unrelated. Once the test suite is updated, I'd say it is ready.

chouqin · 2014-09-10T14:26:53Z

@jkbradley thanks for your replies. as I replied in your comments, I have changed minInstancePerNode to 2 in test cases, and add one more test case to test that when a split doesn't satisfy min instances per node requirements, this split will not be chosen, even though the info gain is large(in this test case, total number of instances is 4, and we can find a split to let both left and right child have 2 instances). Do you think this is OK?

SparkQA · 2014-09-10T14:51:58Z

QA tests have started for PR 2332 at commit f1d11d1.

This patch merges cleanly.

SparkQA · 2014-09-10T15:45:08Z

QA tests have finished for PR 2332 at commit f1d11d1.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-09-10T20:26:22Z

test this please

jkbradley · 2014-09-10T20:29:06Z

@chouqin Thanks for the updates! LGTM

mengxr · 2014-09-10T20:35:39Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

@@ -898,6 +928,10 @@ object DecisionTree extends Serializable with Logging {
        (bestFeatureSplit, bestFeatureGainStats)
      }
    }.maxBy(_._2.gain)
+
+    require(predict.isDefined, "must calculate predict for each node")


nit: Use assert instead of require. The latter throws IllegalArgumentException, which doesn't apply here. (not necessary to update)

mengxr · 2014-09-10T20:37:17Z

@chouqin I made very minor inline comments. It is not necessary to update the PR. I'm going to merge this if Jenkins is happy, and @jkbradley will make those changes in his following PR.

SparkQA · 2014-09-10T20:41:11Z

QA tests have started for PR 2332 at commit f1d11d1.

This patch merges cleanly.

SparkQA · 2014-09-10T20:52:26Z

QA tests have started for PR 2332 at commit f1d11d1.

This patch merges cleanly.

SparkQA · 2014-09-10T21:48:51Z

QA tests have finished for PR 2332 at commit f1d11d1.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-10T21:57:04Z

QA tests have finished for PR 2332 at commit f1d11d1.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-09-10T22:48:22Z

Merged into master. Thanks!

qiping.lqp added 10 commits September 9, 2014 11:17

add min info gain and min instances per node parameters in decision tree

ac42378

separate calculation of predict of node from calculation of info gain

ff34845

fix bug

987cbf4

fix style

f195e83

fix style

845c6fa

add comments

e72c7e4

fix bug

46b891f

add api docs

cadd569

Merge branch 'master' of https://github.com/apache/spark into dt-prep…

bb465ca

…rune Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala

minor fix: remove empty lines

6728fad

chouqin mentioned this pull request Sep 9, 2014

[SPARK-3272][MLLib]Calculate prediction for nodes separately from calculating information gain for splits in decision tree #2180

Closed

qiping.lqp added 2 commits September 9, 2014 18:10

fix style

10b8012

fix bug

efcc736

jkbradley reviewed Sep 9, 2014
View reviewed changes

fix docs and change minInstancesPerNode to 1

d593ec7

remove noSplit and set Predict private to tree

0278a11

jkbradley mentioned this pull request Sep 10, 2014

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

Closed

change edge minInstancesPerNode to 2 and add one more test

39f9b60

chouqin added 2 commits September 10, 2014 22:27

fix typo

c7ebaf1

fix typo

f1d11d1

mengxr reviewed Sep 10, 2014
View reviewed changes

asfgit closed this in 79cdb9b Sep 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332

[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332

chouqin commented Sep 9, 2014

SparkQA commented Sep 9, 2014

manishamde commented Sep 9, 2014

mengxr commented Sep 9, 2014

mengxr commented Sep 9, 2014

jkbradley Sep 9, 2014

jkbradley commented Sep 9, 2014

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

chouqin commented Sep 9, 2014

jkbradley commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

chouqin commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

jkbradley commented Sep 10, 2014

chouqin commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

mengxr commented Sep 10, 2014

jkbradley commented Sep 10, 2014

mengxr Sep 10, 2014

mengxr commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

mengxr commented Sep 10, 2014

[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332

[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332

Conversation

chouqin commented Sep 9, 2014

SparkQA commented Sep 9, 2014

manishamde commented Sep 9, 2014

mengxr commented Sep 9, 2014

mengxr commented Sep 9, 2014

jkbradley Sep 9, 2014

Choose a reason for hiding this comment

jkbradley commented Sep 9, 2014

SparkQA commented Sep 9, 2014

SparkQA commented Sep 9, 2014

chouqin commented Sep 9, 2014

jkbradley commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

chouqin commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

jkbradley commented Sep 10, 2014

chouqin commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

mengxr commented Sep 10, 2014

jkbradley commented Sep 10, 2014

mengxr Sep 10, 2014

Choose a reason for hiding this comment

mengxr commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

mengxr commented Sep 10, 2014