[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc #3374

mengxr · 2014-11-20T00:35:44Z

There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.

WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
GradientBoosting -> GradientBoostedTrees
Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
Remove trainClassifier and trainRegressor from GradientBoostedTrees because they are the same as train
Rename class train method to run because it hides the static methods with the same name in Java. Deprecated DecisionTree.train class method.
Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
Fix a bug in GradientBoostedTreesSuite with AbsoluteError
doc updates

@manishamde @jkbradley

…rategy

mengxr · 2014-11-20T00:36:51Z

mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala


-        val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError,


It was SquaredError before.

Thanks for fixing this. I am taking a look at it.

Here are my findings. I added two more test cases with numIterations = 100.

numIterations = 10, learningRate = 1.0, subsamplingRate = 1.0 metric = 0.8400000000000005 numIterations = 100, learningRate = 1.0, subsamplingRate = 1.0 metric = 0.5344090056285183 numIterations = 10, learningRate = 0.1, subsamplingRate = 1.0 metric = 0.08399999999999984 numIterations = 10, learningRate = 1.0, subsamplingRate = 0.75 metric = 0.8102205882352937 numIterations = 100, learningRate = 1.0, subsamplingRate = 0.75 metric = 0.565608647936787 numIterations = 10, learningRate = 0.1, subsamplingRate = 0.75 metric = 0.11179411764705861

A learning rate of 1 doesn't work very well especially with low number of iterations. Our default learning rate is 0.1 which should be fine.

Suggestion: We remove the learningRate = 1 option from the absolute error test. I can do more testing to check what settings work well for our GBT model and include it as a part of the documentation. I will also compare with scikit-learn to see how much additional loss do we get from an ideal implementation during the documentation phase.

cc: @jkbradley

@manishamde Thanks for checking this test! Let's fix it in a separate PR. We are going to cut a release candidate and I hope we can update the API before that. Let me know when you finish a pass, I will update the PR following your suggestions.

SparkQA · 2014-11-20T00:40:14Z

Test build #23643 has started for PR 3374 at commit 4aae3b7.

This patch merges cleanly.

manishamde · 2014-11-20T00:59:26Z

Will we have to rename GradientBoostedTrees back to GradientBoosting when we add generic weak learner support? I think we should not modify the name of the algorithm and make it tree-specific to avoid renaming it in the future.

mengxr · 2014-11-20T01:05:17Z

@manishamde The current impl is attached to trees. Even if we rename it back to GradientBoosting. it has to live under mllib.tree instead of mllib.ensemble. When we have a generalized boosting implementation in the future, we don't rename GradientBoostedTrees. Instead, we can add mllib.ensemble.GradientBoosting, and let tree.GradientBoostedTrees extend that.

manishamde · 2014-11-20T01:06:58Z

@mengxr The plan to move to mllib.ensemble namespace with a new class sounds good to me.

manishamde · 2014-11-20T01:28:55Z

Should thetrainClassifier and trainRegressor methods from DecisionTree and RandomForest classes also be the deprecated?

manishamde · 2014-11-20T01:30:34Z

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
 *    but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError.


Should now read something like this "but tree predictions are not computed accurately for LogLoss or AbsoluteError loss functions since they use the mean of the samples at each leaf node of the decision tree".

cc: @jkbradley

@manishamde The current explanation is correct for the original Gradient Boosting algorithm, which uses weak hypothesis weights and is oblivious to the weak learner being used. Your suggested explanation is really for TreeBoost, Friedman's improvement to the original algorithm which is specialized for trees (which we should add at some point but isn't what we're claiming to have now, I'd say).

@jkbradley Agree. Having said that, I am not sure whether the algorithm predictions are changed or not based upon the loss function in other weak learners such as LR. Let's refine this later.

SparkQA · 2014-11-20T02:13:27Z

Test build #23643 has finished for PR 3374 at commit 4aae3b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T02:13:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23643/
Test PASSed.

manishamde · 2014-11-20T05:27:35Z

Completed my pass. LGTM! 👍

SparkQA · 2014-11-20T06:05:03Z

Test build #23662 has started for PR 3374 at commit 98dea09.

This patch merges cleanly.

jkbradley · 2014-11-20T06:06:35Z

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

 *    Currently, gradients are computed correctly for the available loss functions,
- *    but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError.
- *    Running with those losses will likely behave reasonably, but lacks the same guarantees.
+ *    but tree predictions are not computed correctly for LogLoss or AbsoluteError since they


(copying comment here since it was on an outdated diff)
The original explanation is correct for the original Gradient Boosting algorithm, which uses weak hypothesis weights and is oblivious to the weak learner being used. This updated explanation is really for TreeBoost, Friedman's improvement to the original algorithm which is specialized for trees (which we should add at some point but isn't what we're claiming to have now, I'd say). So I think the original explanation is more accurate since we do not claim to implement TreeBoost.

reverted the changes

jkbradley · 2014-11-20T06:28:36Z

@mengxr Thanks for the updates! Just added a few small comments. Other than those, LGTM

SparkQA · 2014-11-20T06:50:02Z

Test build #23663 has started for PR 3374 at commit 7097251.

This patch merges cleanly.

SparkQA · 2014-11-20T07:37:37Z

Test build #23662 has finished for PR 3374 at commit 98dea09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T07:37:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23662/
Test PASSed.

SparkQA · 2014-11-20T08:16:47Z

Test build #23663 has finished for PR 3374 at commit 7097251.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T08:16:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23663/
Test PASSed.

mengxr · 2014-11-20T08:50:01Z

@manishamde @jkbradley Thanks! Merged into master and branch-1.2.

There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent. 1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly. 1. GradientBoosting -> GradientBoostedTrees 1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy 1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.) 1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train` 1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method. 1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting. 1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError 1. doc updates manishamde jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3374 from mengxr/SPARK-4486 and squashes the following commits: 7097251 [Xiangrui Meng] address joseph's comments 98dea09 [Xiangrui Meng] address manish's comments 4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy ea4c467 [Xiangrui Meng] fix unit tests 751da4e [Xiangrui Meng] rename class method train -> run 19030a5 [Xiangrui Meng] update boosting public APIs (cherry picked from commit 15cacc8) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr added 4 commits November 19, 2014 13:17

update boosting public APIs

19030a5

rename class method train -> run

751da4e

fix unit tests

ea4c467

add RandomForestModel and GradientBoostedTreesModel, hide CombiningSt…

4aae3b7

…rategy

mengxr reviewed Nov 20, 2014
View reviewed changes

manishamde reviewed Nov 20, 2014
View reviewed changes

address manish's comments

98dea09

jkbradley reviewed Nov 20, 2014
View reviewed changes

address joseph's comments

7097251

mengxr mentioned this pull request Nov 20, 2014

[SPARK-4439] [MLlib] add python api for random forest #3320

Closed

asfgit closed this in 15cacc8 Nov 20, 2014

peter-toth mentioned this pull request Jun 21, 2020

[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse #28885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc #3374

[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc #3374

mengxr commented Nov 20, 2014

mengxr Nov 20, 2014

manishamde Nov 20, 2014

manishamde Nov 20, 2014

mengxr Nov 20, 2014

SparkQA commented Nov 20, 2014

manishamde commented Nov 20, 2014

mengxr commented Nov 20, 2014

manishamde commented Nov 20, 2014

manishamde commented Nov 20, 2014

manishamde Nov 20, 2014

jkbradley Nov 20, 2014

manishamde Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

manishamde commented Nov 20, 2014

SparkQA commented Nov 20, 2014

jkbradley Nov 20, 2014

manishamde Nov 20, 2014

mengxr Nov 20, 2014

jkbradley commented Nov 20, 2014

SparkQA commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

mengxr commented Nov 20, 2014


		val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError,

		@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
		* but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError.

[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc #3374

[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc #3374

Conversation

mengxr commented Nov 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2014

manishamde commented Nov 20, 2014

mengxr commented Nov 20, 2014

manishamde commented Nov 20, 2014

manishamde commented Nov 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

manishamde commented Nov 20, 2014

SparkQA commented Nov 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Nov 20, 2014

SparkQA commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

mengxr commented Nov 20, 2014