-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree. #2332
Conversation
…rune Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
Can one of the admins verify this patch? |
@chouqin Thanks for the PR. I won't be able to comment since I am on a break now. @jkbradley and @mengxr reviews should be sufficient. :-) |
Jenkins, add to whitelist. |
this is ok to test |
val gainStats = | ||
calculateGainForSplit(leftChildStats, rightChildStats, nodeImpurity, level, metadata) | ||
(splitIdx, gainStats) | ||
}.maxBy(_._2.gain) | ||
(splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) | ||
if (bestFeatureGainStats == InformationGainStats.invalidInformationGainStats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could avoid explicitly checking for invalidInformationGainStats since the gain is Double.minValue. At the very end of the maxBy calls, you could then check to see if the information gain is Double.minValue, in which case we know that no split is worth doing. That should simplify the code here and in the other maxBy calls below.
@chouqin Thanks for this update---it will be great to have these 2 options supported. My comments are mostly about simplifying the code: removing Predict and Split.noSplit, and the related simplifications. We will not save much computation by avoiding calculating predictions. However, we will definitely save a lot of computation by supporting these 2 options to allow early stopping on some datasets! |
QA tests have started for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
@jkbradley thanks for your comments, I will change my code accordingly. As for the Predict class, I still think it is needed, for the following reasons:
If you don't like creating a new |
@chouqin Thanks for your responses. I think you've convinced me that Predict is reasonable, since it is a different concept from info gain. Could you please make it private[tree] though? Clarification for 1.: By "array of bins," do you mean the array of classes to calculate the prediction (for classification)? Unless there are a very large number of classes, I do not think the savings will be that much. Thanks! |
@chouqin Could you also please add tag [mllib] to the PR title? |
QA tests have started for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
@jkbradley I have removed |
QA tests have started for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
@chouqin Thanks for the updates! This looks basically ready, except for the edge cases in the test suite. I tested it and it ran fine. I think those complaints about public classes are unrelated. Once the test suite is updated, I'd say it is ready. |
@jkbradley thanks for your replies. as I replied in your comments, I have changed |
QA tests have started for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
test this please |
@chouqin Thanks for the updates! LGTM |
@@ -898,6 +928,10 @@ object DecisionTree extends Serializable with Logging { | |||
(bestFeatureSplit, bestFeatureGainStats) | |||
} | |||
}.maxBy(_._2.gain) | |||
|
|||
require(predict.isDefined, "must calculate predict for each node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use assert
instead of require
. The latter throws IllegalArgumentException
, which doesn't apply here. (not necessary to update)
@chouqin I made very minor inline comments. It is not necessary to update the PR. I'm going to merge this if Jenkins is happy, and @jkbradley will make those changes in his following PR. |
QA tests have started for PR 2332 at commit
|
QA tests have started for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
QA tests have finished for PR 2332 at commit
|
Merged into master. Thanks! |
These two parameters can act as early stop rules to do pre-pruning. When a split cause cause left or right child to have less than
minInstancesPerNode
or has less information gain thanminInfoGain
, current node will not be split by this split.When there is no possible splits that satisfy requirements, there is no useful information gain stats, but we still need to calculate the predict value for current node. So I separated calculation of predict from calculation of information gain, which can also save computation when the number of possible splits is large. Please see SPARK-3272 for more details.
CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.