Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees #3951

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
7 participants
Contributor

kazk1018 commented Jan 8, 2015

This PR is implementing the Gradient Boosted Trees for Python API.

Can one of the admins verify this patch?

Contributor

mengxr commented Jan 8, 2015

add to whitelist

Contributor

mengxr commented Jan 8, 2015

ok to test

SparkQA commented Jan 8, 2015

Test build #25255 has started for PR 3951 at commit 2b6a8b0.

  • This patch merges cleanly.
Contributor

jkbradley commented Jan 8, 2015

@kazk1018 It would be nice to support some of the key parameters from BoostingStrategy and tree.Strategy:

loss
numIterations
learningRate
maxDepth
categoricalFeaturesInfo

Would you mind adding those?

Also, could you please add a unit test to mllib/tests.py? Thank you!

SparkQA commented Jan 8, 2015

Test build #25255 has finished for PR 3951 at commit 2b6a8b0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25255/
Test FAILed.

@kazk1018 kazk1018 changed the title from [SPARK-5094][MLlib] Add Pythoin API for Gradient Boosted Trees to [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees Jan 9, 2015

SparkQA commented Jan 9, 2015

Test build #25310 has started for PR 3951 at commit d1ef58b.

  • This patch merges cleanly.

SparkQA commented Jan 9, 2015

Test build #25310 has finished for PR 3951 at commit d1ef58b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25310/
Test PASSed.

Contributor

jkbradley commented Jan 9, 2015

@kazk1018 It looks like there are merge issues. Can you please fix these? Thanks!

SparkQA commented Jan 10, 2015

Test build #25357 has started for PR 3951 at commit a34bec5.

  • This patch merges cleanly.

SparkQA commented Jan 10, 2015

Test build #25357 has finished for PR 3951 at commit a34bec5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GradientBoostedTreesModel(JavaModelWrapper):
    • class GradientBoostedTrees(object):

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25357/
Test PASSed.

Contributor

jkbradley commented Jan 14, 2015

Taking a look now & will add comments soon!

Contributor

jkbradley commented Jan 14, 2015

@kazk1018 Thanks for the PR! A few high-level items:

  • Will it reduce duplicate code to abstract the "TreeEnsembleModel" concept, as in Scala? Forests and boosting produce models which are very similar. GradientBoostedTreesModel and RandomForestModel could wrap the abstract class.
  • Default parameter values: You state default parameter values in the docs for trainClassifier/Regressor, but they are not actually set in the method declarations. Could you please fix that?

@jkbradley jkbradley commented on an outdated diff Jan 14, 2015

...rg/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -21,6 +21,8 @@ import java.io.OutputStream
import java.nio.{ByteBuffer, ByteOrder}
import java.util.{ArrayList => JArrayList, List => JList, Map => JMap}
+import org.apache.spark.mllib.tree.loss.Losses
@jkbradley

jkbradley Jan 14, 2015

Contributor

Organize imports, ordered as: scala/java, outside libraries, spark (alphabetized within groups)

SparkQA commented Jan 15, 2015

Test build #25587 has started for PR 3951 at commit bb3357d.

  • This patch does not merge cleanly.

SparkQA commented Jan 15, 2015

Test build #25589 has started for PR 3951 at commit f2b77d8.

  • This patch merges cleanly.

SparkQA commented Jan 15, 2015

Test build #25589 has finished for PR 3951 at commit f2b77d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25589/
Test PASSed.

SparkQA commented Jan 15, 2015

Test build #25587 has finished for PR 3951 at commit bb3357d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25587/
Test PASSed.

@jkbradley jkbradley commented on an outdated diff Jan 25, 2015

...rg/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+ algoStr: String,
+ categoricalFeaturesInfo: JMap[Int, Int],
+ lossStr: String,
+ numIterations: Int,
+ learningRate: Double,
+ maxDepth: Int): GradientBoostedTreesModel = {
+ val boostingStrategy = BoostingStrategy.defaultParams(algoStr)
+ boostingStrategy.setLoss(Losses.fromString(lossStr))
+ boostingStrategy.setNumIterations(numIterations)
+ boostingStrategy.setLearningRate(learningRate)
+ boostingStrategy.treeStrategy.setMaxDepth(maxDepth)
+ boostingStrategy.treeStrategy.categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap
+
+ val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
+ try {
+ GradientBoostedTrees.train(data, boostingStrategy)
@jkbradley

jkbradley Jan 25, 2015

Contributor

"data" --> "cached"

@jkbradley jkbradley commented on an outdated diff Jan 25, 2015

python/pyspark/mllib/tree.py
@@ -24,7 +24,41 @@
from pyspark.mllib.linalg import _convert_to_vector
from pyspark.mllib.regression import LabeledPoint
-__all__ = ['DecisionTreeModel', 'DecisionTree', 'RandomForestModel', 'RandomForest']
+__all__ = ['TreeEnsembleModel', 'DecisionTreeModel', 'DecisionTree', 'RandomForestModel',
+ 'RandomForest', 'GradientBoostedTrees']
+
+
+class TreeEnsembleModel(JavaModelWrapper):
@jkbradley

jkbradley Jan 25, 2015

Contributor

In Scala, TreeEnsembleModel is private[tree]. I'd add an underscore before the name (and maybe also add some documentation noting that this class is a private API).

@jkbradley jkbradley commented on an outdated diff Jan 25, 2015

python/pyspark/mllib/tree.py
+
+class TreeEnsembleModel(JavaModelWrapper):
+ def predict(self, x):
+ """
+ Predict values for a single data point or an RDD of points using
+ the model trained.
+ """
+ if isinstance(x, RDD):
+ return self.call("predict", x.map(_convert_to_vector))
+
+ else:
+ return self.call("predict", _convert_to_vector(x))
+
+ def numTrees(self):
+ """
+ Get number of trees in forest.
@jkbradley

jkbradley Jan 25, 2015

Contributor

"forest" --> "ensemble"

@jkbradley jkbradley commented on an outdated diff Jan 25, 2015

python/pyspark/mllib/tree.py
+ """
+ if isinstance(x, RDD):
+ return self.call("predict", x.map(_convert_to_vector))
+
+ else:
+ return self.call("predict", _convert_to_vector(x))
+
+ def numTrees(self):
+ """
+ Get number of trees in forest.
+ """
+ return self.call("numTrees")
+
+ def totalNumNodes(self):
+ """
+ Get total number of nodes, summed over all trees in the forest.
@jkbradley

jkbradley Jan 25, 2015

Contributor

"forest" --> ensemble"

@jkbradley jkbradley commented on an outdated diff Jan 25, 2015

python/pyspark/mllib/tree.py
@@ -383,6 +387,129 @@ def trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetSt
featureSubsetStrategy, impurity, maxDepth, maxBins, seed)
+class GradientBoostedTreesModel(TreeEnsembleModel):
+ """
+ Represents a gradient-boosted tree model.
+
+ EXPERIMENTAL: This is an experimental API.
@jkbradley

jkbradley Jan 25, 2015

Contributor

Would you mind correcting the experimental tag here and in TreeEnsembleModel and RandomForestModel? (I know it was there before your PR.) I'd follow the example in stat.py's chiSqTest:

        .. note:: Experimental

That makes for better formatting in the generating docs. Thanks!

@jkbradley jkbradley commented on the diff Jan 25, 2015

python/pyspark/mllib/tree.py
+ """
+
+
+class GradientBoostedTrees(object):
+
+ @classmethod
+ def _train(cls, data, algo, categoricalFeaturesInfo,
+ loss, numIterations, learningRate, maxDepth):
+ first = data.first()
+ assert isinstance(first, LabeledPoint), "the data should be RDD of LabeledPoint"
+ model = callMLlibFunc("trainGradientBoostedTreesModel", data, algo, categoricalFeaturesInfo,
+ loss, numIterations, learningRate, maxDepth)
+ return GradientBoostedTreesModel(model)
+
+ @classmethod
+ def trainClassifier(cls, data, categoricalFeaturesInfo,
@jkbradley

jkbradley Jan 25, 2015

Contributor

Can you please use the same defaults as in the Scala API (here and for trainRegressor)?

Contributor

jkbradley commented Jan 25, 2015

@kazk1018 Thanks for the updates; sorry for the delayed response. Please ping me if updates are added & ready for review.

The 2 other items which would be good to have are (a) unit tests in python/pyspark/mllib/tests.py and (b) an example in examples/src/main/python/mllib/. The code freeze for the next release is in 1 week (Jan 31). Would you have time to add these by then? Thanks again!

cthom commented Jan 27, 2015

Is there anyway to maintain some kind state about the model as it's being built? For GBT models, one usually sees a plot of the error vs number of trees in the model. If the model setup incorporates a hold-out or test/validation data set, we can determine after the fact the optimal number of trees in the model (any more and it starts to overfit).

At the moment, my solution has been to extract the trees from the model, and iteratively recreate a sub-model, and score the test data against the submodel. But this is fairly expensive. I figure there must be an internal assessment of model performance at each step of the building phase....if this was retained, I think there would be a lot of value. I'm a little unsure how to implement it though.

Contributor

jkbradley commented Jan 27, 2015

@cthom Validation on-the-fly during training would be great to have. Let's discuss it in a separate JIRA; I just created one: [https://issues.apache.org/jira/browse/SPARK-5436]

SparkQA commented Jan 28, 2015

Test build #26201 has started for PR 3951 at commit 7dc1aab.

  • This patch merges cleanly.

SparkQA commented Jan 28, 2015

Test build #26201 has finished for PR 3951 at commit 7dc1aab.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26201/
Test FAILed.

Any chance this one will make it into the 1.3 release? We'd really like to see this one!

SparkQA commented Jan 28, 2015

Test build #26208 has started for PR 3951 at commit 6e4ead8.

  • This patch merges cleanly.

SparkQA commented Jan 28, 2015

Test build #26208 has finished for PR 3951 at commit 6e4ead8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26208/
Test PASSed.

@mengxr mengxr commented on an outdated diff Jan 28, 2015

examples/src/main/python/mllib/gradient_boosted_trees.py
+"""
+
+import sys
+
+from pyspark.context import SparkContext
+from pyspark.mllib.tree import GradientBoostedTrees
+from pyspark.mllib.util import MLUtils
+
+
+def testClassification(trainingData, testData):
+ # Train a GradientBoostedTrees model.
+ # Empty categoricalFeaturesInfo indicates all features are continuous.
+ model = GradientBoostedTrees.trainClassifier(trainingData,
+ categoricalFeaturesInfo={},
+ numIterations=30,
+ maxDepth=4)
@mengxr

mengxr Jan 28, 2015

Contributor

For the code style, we don't chop down arguments in method calls. For example: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tree.py#L137

So this should be

    model = GradientBoostedTrees.trainClassifier(trainingData, categoricalFeaturesInfo={},
             numIterations=30, maxDepth=4)

or

    model = GradientBoostedTrees.trainClassifier(
             trainingData, categoricalFeaturesInfo={}, numIterations=30, maxDepth=4)

@mengxr mengxr commented on an outdated diff Jan 28, 2015

python/pyspark/mllib/tests.py
@@ -179,10 +179,27 @@ def test_classification(self):
self.assertTrue(dt_model.predict(features[2]) <= 0)
self.assertTrue(dt_model.predict(features[3]) > 0)
+ rf_model = \
@mengxr

mengxr Jan 28, 2015

Contributor

Similarly, this should be

        rf_model = RandomForest.trainClassifier(
            rdd, numClasses=2, categoricalFeaturesInfo=categoricalFeaturesInfo, numTrees=100)

SparkQA commented Jan 28, 2015

Test build #26220 has started for PR 3951 at commit 56f6c97.

  • This patch merges cleanly.

SparkQA commented Jan 28, 2015

Test build #26220 has finished for PR 3951 at commit 56f6c97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26220/
Test PASSed.

@mengxr mengxr commented on the diff Jan 28, 2015

python/pyspark/mllib/tree.py
+ model = callMLlibFunc("trainGradientBoostedTreesModel", data, algo, categoricalFeaturesInfo,
+ loss, numIterations, learningRate, maxDepth)
+ return GradientBoostedTreesModel(model)
+
+ @classmethod
+ def trainClassifier(cls, data, categoricalFeaturesInfo,
+ loss="logLoss", numIterations=100, learningRate=0.1, maxDepth=3):
+ """
+ Method to train a gradient-boosted trees model for classification.
+
+ :param data: Training dataset: RDD of LabeledPoint. Labels should take values {0, 1}.
+ :param categoricalFeaturesInfo: Map storing arity of categorical
+ features. E.g., an entry (n -> k) indicates that feature
+ n is categorical with k categories indexed from 0:
+ {0, 1, ..., k-1}.
+ :param loss: Loss function used for minimization during gradient boosting.
@mengxr

mengxr Jan 28, 2015

Contributor

What losses are available to users? This needs documentation.

@mengxr mengxr commented on the diff Jan 28, 2015

python/pyspark/mllib/tree.py
+ return cls._train(data, "classification", categoricalFeaturesInfo,
+ loss, numIterations, learningRate, maxDepth)
+
+ @classmethod
+ def trainRegressor(cls, data, categoricalFeaturesInfo,
+ loss="leastSquaresError", numIterations=100, learningRate=0.1, maxDepth=3):
+ """
+ Method to train a gradient-boosted trees model for regression.
+
+ :param data: Training dataset: RDD of LabeledPoint. Labels are
+ real numbers.
+ :param categoricalFeaturesInfo: Map storing arity of categorical
+ features. E.g., an entry (n -> k) indicates that feature
+ n is categorical with k categories indexed from 0:
+ {0, 1, ..., k-1}.
+ :param loss: Loss function used for minimization during gradient boosting.
@mengxr

mengxr Jan 28, 2015

Contributor

Same here.

@mengxr mengxr commented on an outdated diff Jan 28, 2015

python/pyspark/mllib/tree.py
+ >>> model.numTrees()
+ 100
+ >>> model.totalNumNodes()
+ 300
+ >>> print model, # it already has newline
+ TreeEnsembleModel classifier with 100 trees
+ >>> model.predict([2.0])
+ 1.0
+ >>> model.predict([0.0])
+ 0.0
+ >>> rdd = sc.parallelize([[2.0], [0.0]])
+ >>> model.predict(rdd).collect()
+ [1.0, 0.0]
+ """
+ return cls._train(data, "classification", categoricalFeaturesInfo,
+ loss, numIterations, learningRate, maxDepth)
@mengxr

mengxr Jan 28, 2015

Contributor

remove one space after loss,

@mengxr mengxr commented on the diff Jan 28, 2015

python/pyspark/mllib/tree.py
+ >>> model = GradientBoostedTrees.trainClassifier(sc.parallelize(data), {})
+ >>> model.numTrees()
+ 100
+ >>> model.totalNumNodes()
+ 300
+ >>> print model, # it already has newline
+ TreeEnsembleModel classifier with 100 trees
+ >>> model.predict([2.0])
+ 1.0
+ >>> model.predict([0.0])
+ 0.0
+ >>> rdd = sc.parallelize([[2.0], [0.0]])
+ >>> model.predict(rdd).collect()
+ [1.0, 0.0]
+ """
+ return cls._train(data, "classification", categoricalFeaturesInfo,
@mengxr

mengxr Jan 28, 2015

Contributor

Add "classification" to BoostingStrategy.defaultParams, which only recognizes Classification. We have some inconsistency here. I prefer using the lowercase "classification" as in RandomForest and DecisionTree. But since we already take "Classification" in BoostingStrategy.defaultParams, we should make it backward-compatible.

@kazk1018

kazk1018 Jan 29, 2015

Contributor

@mengxr The input of BoostringStrategy.defaultParams in master branch was changed from algo: String to Algo.fromString: Algo. So, if I fix "classification" to "Classification", the CI test will be failed.

@mengxr

mengxr Jan 30, 2015

Contributor

I see. I didn't merge master when I ran this example code. Thanks for pointing it out! Actually that was an incompatible change. I'm going to merge this PR and then submit a PR to accept both classification and Classification.

@mengxr mengxr commented on the diff Jan 28, 2015

python/pyspark/mllib/tree.py
+ ... ]
+ >>>
+ >>> model = GradientBoostedTrees.trainRegressor(sc.parallelize(sparse_data), {})
+ >>> model.numTrees()
+ 100
+ >>> model.totalNumNodes()
+ 102
+ >>> model.predict(SparseVector(2, {1: 1.0}))
+ 1.0
+ >>> model.predict(SparseVector(2, {0: 1.0}))
+ 0.0
+ >>> rdd = sc.parallelize([[0.0, 1.0], [1.0, 0.0]])
+ >>> model.predict(rdd).collect()
+ [1.0, 0.0]
+ """
+ return cls._train(data, "regression", categoricalFeaturesInfo,
@mengxr

mengxr Jan 28, 2015

Contributor

Same about `regression".

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
Check lint-python and lint-scala
[SPARK-5094][MLlib] Add some key params for Gradient Boosted Trees in Python API
Fix issues
Fix some issues
Fix the issues (for changing BoostingStrategy.defaultParams() in master)
Fix the issues

Added comments about loss functions

SparkQA commented Jan 30, 2015

Test build #26364 has started for PR 3951 at commit 620d247.

  • This patch merges cleanly.

SparkQA commented Jan 30, 2015

Test build #26364 has finished for PR 3951 at commit 620d247.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TreeEnsembleModel(JavaModelWrapper):
    • class DecisionTreeModel(JavaModelWrapper):
    • class RandomForestModel(TreeEnsembleModel):
    • class GradientBoostedTreesModel(TreeEnsembleModel):
    • class GradientBoostedTrees(object):

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26364/
Test PASSed.

@asfgit asfgit closed this in bc1fc9b Jan 30, 2015

Contributor

mengxr commented Jan 30, 2015

LGTM. Merged into master. Thanks!!

zhzhan added a commit to zhzhan/spark that referenced this pull request Feb 18, 2015

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
This PR is implementing the Gradient Boosted Trees for Python API.

Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>

Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:

620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees

preaudc added a commit to preaudc/spark that referenced this pull request Apr 17, 2015

[SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
This PR is implementing the Gradient Boosted Trees for Python API.

Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>

Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:

620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment