[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs #21097

WeichenXu123 · 2018-04-18T11:47:17Z

What changes were proposed in this pull request?

Provide evaluateEachIteration method or equivalent for spark.ml GBTs.

How was this patch tested?

UT.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2018-04-18T11:50:13Z

Test build #89499 has finished for PR 21097 at commit 836d760.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-18T13:05:24Z

Test build #89500 has finished for PR 21097 at commit 16fd4d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2018-04-19T00:03:36Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest {
    assert(mostImportantFeature !== mostIF)
  }

+  test("model evaluateEachIteration") {
+    for (lossType <- Seq("logistic")) {


there is only one lossType. for is not necessary.

Yes. But I think it can fit for future, if we add more loss type for GBT classifier.

OK. It makes sense.

jkbradley

Thanks! Just a few comments.

jkbradley · 2018-04-27T17:51:40Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

+        .setLossType(lossType)
+      val model = gbt.fit(trainData.toDF)
+      val eval1 = model.evaluateEachIteration(validationData.toDF)
+      val eval2 = GradientBoostedTrees.evaluateEachIteration(validationData,


This is testing the spark.ml implementation against itself. I was about to recommend using the old spark.mllib implementation as a reference. However, the old implementation is not tested at all. Would you be able to test against a standard implementation in R or scikit-learn (following the patterns used elsewhere in MLlib)?

I search scikit-learn doc, there seems no similar method like evaluateEachIteration, we can only use staged_predict in sklearn.ensemble.GradientBoostingRegressor and then use metric functions to evaluate them. And I doubt the implementation differ slightly in other library will be troublesome. In R package I also do not find this method.

Now I update the unit test, to just compare with hardcoded result.

jkbradley · 2018-04-27T17:59:53Z

mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala

+   * @param dataset Dataset for validation.
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = {


Do we want to support evaluation on other losses, as in the old API? It might be nice to be able to without having to modify the Model's loss Param value.

SparkQA · 2018-05-04T11:00:58Z

Test build #90188 has finished for PR 21097 at commit a2af286.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

For unit tests, what about this?

Used a fixed random seed.
Run for maxIter = 3
Create models with 1 and 2 trees by manually getting the trees and constructing new GBT models.
Check to make sure the loss for a model with 1 tree matches the first value returned by evaluateEachIteration for the other 2 models.
Check to make sure the loss for a model with 2 trees matches the second value returned by evaluateEachIteration for the model with 3 trees.

jkbradley · 2018-05-07T22:07:50Z

mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala

+  /**
+   * Method to compute error or loss for every iteration of gradient boosting.
+   *
+   * @param dataset Dataset for validation.


Add doc for "loss" arg, including what the options are

SparkQA · 2018-05-08T03:40:56Z

Test build #90351 has finished for PR 21097 at commit c32b5a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Just a tiny comment left. Thanks!

jkbradley · 2018-05-08T20:59:42Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

+    val model2 = new GBTClassificationModel("gbt-cls-model-test2",
+      model3.trees.take(2), model3.treeWeights.take(2), model3.numFeatures, model3.numClasses)
+
+    for (evalLossType <- GBTClassifier.supportedLossTypes) {


evalLossType is not used, so I'd remove this loop.

SparkQA · 2018-05-09T09:05:20Z

Test build #90404 has finished for PR 21097 at commit 0e7311f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-05-09T18:08:53Z

LGTM
Merging with master
Thanks @WeichenXu123 ! Would you mind creating & linking a JIRA for the Python API update?

init pr

836d760

fix style

16fd4d6

wangmiao1981 reviewed Apr 19, 2018

View reviewed changes

jkbradley reviewed Apr 27, 2018

View reviewed changes

add loss param & update unit test

a2af286

jkbradley reviewed May 7, 2018

View reviewed changes

address comments and update unittest

c32b5a8

jkbradley reviewed May 8, 2018

View reviewed changes

update tests

0e7311f

asfgit closed this in 7aaa148 May 9, 2018

WeichenXu123 deleted the GBTeval branch May 9, 2018 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs #21097

[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs #21097

WeichenXu123 commented Apr 18, 2018

SparkQA commented Apr 18, 2018

SparkQA commented Apr 18, 2018

wangmiao1981 Apr 19, 2018

WeichenXu123 Apr 19, 2018

wangmiao1981 Apr 19, 2018

jkbradley left a comment

jkbradley Apr 27, 2018

WeichenXu123 May 4, 2018 •

edited

jkbradley Apr 27, 2018

SparkQA commented May 4, 2018

jkbradley left a comment

jkbradley May 7, 2018

SparkQA commented May 8, 2018

jkbradley left a comment

jkbradley May 8, 2018

SparkQA commented May 9, 2018

jkbradley commented May 9, 2018

[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs #21097

[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs #21097

Conversation

WeichenXu123 commented Apr 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 18, 2018

SparkQA commented Apr 18, 2018

wangmiao1981 Apr 19, 2018

Choose a reason for hiding this comment

WeichenXu123 Apr 19, 2018

Choose a reason for hiding this comment

wangmiao1981 Apr 19, 2018

Choose a reason for hiding this comment

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley Apr 27, 2018

Choose a reason for hiding this comment

WeichenXu123 May 4, 2018 • edited

Choose a reason for hiding this comment

jkbradley Apr 27, 2018

Choose a reason for hiding this comment

SparkQA commented May 4, 2018

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley May 7, 2018

Choose a reason for hiding this comment

SparkQA commented May 8, 2018

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley May 8, 2018

Choose a reason for hiding this comment

SparkQA commented May 9, 2018

jkbradley commented May 9, 2018

WeichenXu123 May 4, 2018 •

edited