[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

sueann · 2017-01-05T21:45:49Z

What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

sueann · 2017-01-05T21:51:30Z

@jkbradley @thunterdb

sueann · 2017-01-05T21:52:21Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
      multiclassLabeled.unpersist()
    }

+    val instrLog = Instrumentation.create(this, multiclassLabeled)


i think multiclassLabeled is the dataset we want, but i'm not sure what it's supposed to be used for in instrumentation.scala so i could be wrong

I'd actually use the input "dataset" since it has more information (columns), though either should work.

Btw, can you please rename this to "instr" to match other classes? I see ALS is also named instrLog, but it's the only one. Could you change ALS to "instr" as well in this PR?

jkbradley

Thanks for the PR! I added some comments.

Also, for all of these, I'd move the logging earlier in the train/fit methods so that you log info as soon as it's available.

jkbradley · 2017-01-05T21:56:44Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
      multiclassLabeled.unpersist()
    }

+    val instrLog = Instrumentation.create(this, multiclassLabeled)


I'd actually use the input "dataset" since it has more information (columns), though either should work.

jkbradley · 2017-01-05T21:58:18Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
      multiclassLabeled.unpersist()
    }

+    val instrLog = Instrumentation.create(this, multiclassLabeled)


Btw, can you please rename this to "instr" to match other classes? I see ALS is also named instrLog, but it's the only one. Could you change ALS to "instr" as well in this PR?

jkbradley · 2017-01-05T22:00:05Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
      multiclassLabeled.unpersist()
    }

+    val instrLog = Instrumentation.create(this, multiclassLabeled)
+    instrLog.logParams(labelCol, featuresCol, predictionCol)
+    instrLog.logNumClasses(numClasses)


Also log numFeatures, which you can get from models.head.numFeatures

jkbradley · 2017-01-05T22:02:04Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

+    val instrLog = Instrumentation.create(this, multiclassLabeled)
+    instrLog.logParams(labelCol, featuresCol, predictionCol)
+    instrLog.logNumClasses(numClasses)
+    instrLog.logNamedValue("classifier", $(classifier).getClass.getSimpleName)


Use getCanonicalName instead

jkbradley · 2017-01-05T22:04:43Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

@@ -116,13 +116,17 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St
    }
    validationDataset.unpersist()

+    val instrLog = Instrumentation.create(this, dataset)
+    instrLog.logParams(trainRatio)


jkbradley · 2017-01-05T22:06:30Z

mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala

   * Logs the value with customized name field.
   */
-  def logNamedValue(name: String, num: Long): Unit = {
+  def logNamedValue(name: String, num: JValue): Unit = {


Since this class doesn't expose json4s APIs, let's stick with basic types in the public API. Just do String for now (since the Long version is not yet used).

"num" -> "value"

jkbradley · 2017-01-05T22:08:05Z

mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala

+   * @param estimatorParamMaps different params tried by the tuning estimator
+   * @param evaluator evaluator used to compute the metric for each estimator param value
+   */
+  def logTuningParams(


I like that this is separated out for all tuning algorithms, but it belongs in ml.tuning. How about as a static (object) method in ValidatorParams?

moved into ValidatorParams as a class method -- let me know if you feel strongly about it being static. thanks!

sueann · 2017-01-05T22:15:45Z

Unit test runs & logs:

TrainValidationSplit

$ build/sbt "test-only org.apache.spark.ml.tuning.TrainValidationSplitSuite”
$ less mllib/target/unit-tests.log 
...
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: {"trainRatio":0.5}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: {"estimator":"LinearRegression","evaluator":"RegressionEvaluator","numModels":4}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Train validation split metrics: WrappedArray(7.79718164555543, 7.734602725775261, 7.79718164555543, 0.11150408019144314)
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Best set of parameters:
{
        linReg_8e7bf3d889f1-maxIter: 10,
        linReg_8e7bf3d889f1-regParam: 0.001
}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Best train validation split metric: 0.11150408019144314.
...
17/01/05 11:21:52.087 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: training finished

CrossValdator

$ build/sbt "test-only org.apache.spark.ml.tuning.CrossValidatorSuite"
$ less mllib/target/unit-tests.log 
...
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: {"numFolds":3}
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: {"estimator":"LogisticRegression","evaluator":"BinaryClassificationEvaluator","numModels":4}
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Average cross-validation metrics: WrappedArray(0.5, 0.5, 0.8333037031406596, 0.8333037031406596)
17/01/05 11:27:26.862 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Best set of parameters:
{
        logreg_e763c2efb948-maxIter: 10,
        logreg_e763c2efb948-regParam: 0.001
}
17/01/05 11:27:26.863 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Best cross-validation metric: 0.8333037031406596.
...
17/01/05 11:27:27.040 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: training finished
...

OneVsRest

$ build/sbt "test-only org.apache.spark.ml.classification.OneVsRestSuite"
$ less mllib/target/unit-tests.log 
...
17/01/05 13:26:36.017 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"labelCol":"indexed","featuresCol":"f","predictionCol":"p"}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"numClasses":3}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"classifier":"LogisticRegression"}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: training finished
...

sueann · 2017-01-05T23:00:04Z

Now the logs show the full class path for the estimator/evaluator/classifier:

INFO Instrumentation: CrossValidator-cv_acb968c4de59-968285285-1: {"estimator":"org.apache.spark.ml.classification.LogisticRegression","evaluator":"org.apache.spark.ml.evaluation.BinaryClassificationEvaluator","numModels":4}

jkbradley

Thanks! Just a few comments

jkbradley · 2017-01-05T23:08:20Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+  /**
+   * Instrumentation logging for tuning params including the inner estimator and evaluator info.
+   *
+   * @param instrumentation instrumentation logger


I'd remove this comment since it doesn't add info.

jkbradley · 2017-01-05T23:10:10Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+   * @param instrumentation instrumentation logger
+   */
+  protected def logTuningParams(instrumentation: Instrumentation[_]): Unit = {
+    instrumentation.log(compact(render(map2jvalue(Map[String, JValue](


I'd say just use instrumentation.logNamedValue for each of these, rather than handling JSON here.

jkbradley · 2017-01-05T23:12:03Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala


    if (handlePersistence) {
      multiclassLabeled.unpersist()
    }

+


remove extra newline

jkbradley · 2017-01-06T07:11:58Z

ok to test

jkbradley · 2017-01-06T07:12:06Z

add to whitelist

jkbradley · 2017-01-06T07:12:22Z

LGTM pending Jenkins tests
Thanks @sueann !

SparkQA · 2017-01-06T07:12:45Z

Test build #70970 has started for PR 16480 at commit aab8dd7.

SparkQA · 2017-01-06T07:18:40Z

Test build #70972 has started for PR 16480 at commit aab8dd7.

SparkQA · 2017-01-06T19:15:40Z

Test build #3522 has finished for PR 16480 at commit aab8dd7.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

…TrainValidationSplit

SparkQA · 2017-01-07T01:12:23Z

Test build #70996 has finished for PR 16480 at commit 0034461.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-01-07T02:52:45Z

Merging with master.
Thanks @sueann !

…rainValidationSplit ## What changes were proposed in this pull request? Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions. ## How was this patch tested? Ran unit tests and checked the log file (see output in comments). Author: sueann <sueann@databricks.com> Closes apache#16480 from sueann/SPARK-18194.

sueann commented Jan 5, 2017

View reviewed changes

jkbradley reviewed Jan 5, 2017

View reviewed changes

sueann added 4 commits January 6, 2017 16:10

[SPARK-18194] log stuff

724dfa6

[SPARK-18194] instrumentation logging for OneVsRest, CrossValidator, …

1acd8a6

…TrainValidationSplit

addressing comments

23bfc44

log each tuning param separately

0034461

sueann force-pushed the SPARK-18194 branch from aab8dd7 to 0034461 Compare January 7, 2017 00:10

asfgit closed this in d60f6f6 Jan 7, 2017

sueann deleted the SPARK-18194 branch January 10, 2017 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

sueann commented Jan 5, 2017 •

edited

sueann commented Jan 5, 2017

sueann Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley left a comment

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

sueann Jan 5, 2017

sueann commented Jan 5, 2017

sueann commented Jan 5, 2017

jkbradley left a comment

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley Jan 5, 2017

jkbradley commented Jan 6, 2017

jkbradley commented Jan 6, 2017

jkbradley commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 7, 2017

jkbradley commented Jan 7, 2017 •

edited

[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

Conversation

sueann commented Jan 5, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

sueann commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sueann commented Jan 5, 2017

sueann commented Jan 5, 2017

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Jan 6, 2017

jkbradley commented Jan 6, 2017

jkbradley commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 7, 2017

jkbradley commented Jan 7, 2017 • edited

sueann commented Jan 5, 2017 •

edited

jkbradley commented Jan 7, 2017 •

edited