Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit #16480

Closed
wants to merge 4 commits into from

Conversation

sueann
Copy link
Contributor

@sueann sueann commented Jan 5, 2017

What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

@sueann
Copy link
Contributor Author

sueann commented Jan 5, 2017

@jkbradley @thunterdb

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
multiclassLabeled.unpersist()
}

val instrLog = Instrumentation.create(this, multiclassLabeled)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think multiclassLabeled is the dataset we want, but i'm not sure what it's supposed to be used for in instrumentation.scala so i could be wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually use the input "dataset" since it has more information (columns), though either should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, can you please rename this to "instr" to match other classes? I see ALS is also named instrLog, but it's the only one. Could you change ALS to "instr" as well in this PR?

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added some comments.

Also, for all of these, I'd move the logging earlier in the train/fit methods so that you log info as soon as it's available.

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
multiclassLabeled.unpersist()
}

val instrLog = Instrumentation.create(this, multiclassLabeled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually use the input "dataset" since it has more information (columns), though either should work.

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
multiclassLabeled.unpersist()
}

val instrLog = Instrumentation.create(this, multiclassLabeled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, can you please rename this to "instr" to match other classes? I see ALS is also named instrLog, but it's the only one. Could you change ALS to "instr" as well in this PR?

@@ -344,6 +344,10 @@ final class OneVsRest @Since("1.4.0") (
multiclassLabeled.unpersist()
}

val instrLog = Instrumentation.create(this, multiclassLabeled)
instrLog.logParams(labelCol, featuresCol, predictionCol)
instrLog.logNumClasses(numClasses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also log numFeatures, which you can get from models.head.numFeatures

val instrLog = Instrumentation.create(this, multiclassLabeled)
instrLog.logParams(labelCol, featuresCol, predictionCol)
instrLog.logNumClasses(numClasses)
instrLog.logNamedValue("classifier", $(classifier).getClass.getSimpleName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getCanonicalName instead

@@ -116,13 +116,17 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St
}
validationDataset.unpersist()

val instrLog = Instrumentation.create(this, dataset)
instrLog.logParams(trainRatio)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log seed

* Logs the value with customized name field.
*/
def logNamedValue(name: String, num: Long): Unit = {
def logNamedValue(name: String, num: JValue): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class doesn't expose json4s APIs, let's stick with basic types in the public API. Just do String for now (since the Long version is not yet used).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"num" -> "value"

* @param estimatorParamMaps different params tried by the tuning estimator
* @param evaluator evaluator used to compute the metric for each estimator param value
*/
def logTuningParams(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this is separated out for all tuning algorithms, but it belongs in ml.tuning. How about as a static (object) method in ValidatorParams?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into ValidatorParams as a class method -- let me know if you feel strongly about it being static. thanks!

@sueann
Copy link
Contributor Author

sueann commented Jan 5, 2017

Unit test runs & logs:

  • TrainValidationSplit
$ build/sbt "test-only org.apache.spark.ml.tuning.TrainValidationSplitSuite”
$ less mllib/target/unit-tests.log 
...
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: {"trainRatio":0.5}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: {"estimator":"LinearRegression","evaluator":"RegressionEvaluator","numModels":4}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Train validation split metrics: WrappedArray(7.79718164555543, 7.734602725775261, 7.79718164555543, 0.11150408019144314)
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Best set of parameters:
{
        linReg_8e7bf3d889f1-maxIter: 10,
        linReg_8e7bf3d889f1-regParam: 0.001
}
17/01/05 11:21:51.877 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO TrainValidationSplit: Best train validation split metric: 0.11150408019144314.
...
17/01/05 11:21:52.087 pool-1-thread-1-ScalaTest-running-TrainValidationSplitSuite INFO Instrumentation: TrainValidationSplit-tvs_8e7090288e01-206645650-7: training finished
  • CrossValdator
$ build/sbt "test-only org.apache.spark.ml.tuning.CrossValidatorSuite"
$ less mllib/target/unit-tests.log 
...
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: {"numFolds":3}
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: {"estimator":"LogisticRegression","evaluator":"BinaryClassificationEvaluator","numModels":4}
17/01/05 11:27:26.860 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Average cross-validation metrics: WrappedArray(0.5, 0.5, 0.8333037031406596, 0.8333037031406596)
17/01/05 11:27:26.862 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Best set of parameters:
{
        logreg_e763c2efb948-maxIter: 10,
        logreg_e763c2efb948-regParam: 0.001
}
17/01/05 11:27:26.863 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO CrossValidator: Best cross-validation metric: 0.8333037031406596.
...
17/01/05 11:27:27.040 pool-1-thread-1-ScalaTest-running-CrossValidatorSuite INFO Instrumentation: CrossValidator-cv_f60fd2c0f5ce-428174643-13: training finished
...
  • OneVsRest
$ build/sbt "test-only org.apache.spark.ml.classification.OneVsRestSuite"
$ less mllib/target/unit-tests.log 
...
17/01/05 13:26:36.017 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: training: numPartitions=2 storageLevel=StorageLevel(1 replicas)
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"labelCol":"indexed","featuresCol":"f","predictionCol":"p"}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"numClasses":3}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: {"classifier":"LogisticRegression"}
17/01/05 13:26:36.018 pool-1-thread-1-ScalaTest-running-OneVsRestSuite INFO Instrumentation: OneVsRest-oneVsRest_551a9c8d41e4-1366242214-12: training finished
...

@sueann
Copy link
Contributor Author

sueann commented Jan 5, 2017

Now the logs show the full class path for the estimator/evaluator/classifier:

INFO Instrumentation: CrossValidator-cv_acb968c4de59-968285285-1: {"estimator":"org.apache.spark.ml.classification.LogisticRegression","evaluator":"org.apache.spark.ml.evaluation.BinaryClassificationEvaluator","numModels":4}

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few comments

/**
* Instrumentation logging for tuning params including the inner estimator and evaluator info.
*
* @param instrumentation instrumentation logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment since it doesn't add info.

* @param instrumentation instrumentation logger
*/
protected def logTuningParams(instrumentation: Instrumentation[_]): Unit = {
instrumentation.log(compact(render(map2jvalue(Map[String, JValue](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say just use instrumentation.logNamedValue for each of these, rather than handling JSON here.


if (handlePersistence) {
multiclassLabeled.unpersist()
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra newline

@jkbradley
Copy link
Member

ok to test

@jkbradley
Copy link
Member

add to whitelist

@jkbradley
Copy link
Member

LGTM pending Jenkins tests
Thanks @sueann !

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70970 has started for PR 16480 at commit aab8dd7.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70972 has started for PR 16480 at commit aab8dd7.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #3522 has finished for PR 16480 at commit aab8dd7.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2017

Test build #70996 has finished for PR 16480 at commit 0034461.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

jkbradley commented Jan 7, 2017

Merging with master.
Thanks @sueann !

@asfgit asfgit closed this in d60f6f6 Jan 7, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
…rainValidationSplit

## What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

## How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

Author: sueann <sueann@databricks.com>

Closes apache#16480 from sueann/SPARK-18194.
@sueann sueann deleted the SPARK-18194 branch January 10, 2017 23:44
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…rainValidationSplit

## What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

## How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

Author: sueann <sueann@databricks.com>

Closes apache#16480 from sueann/SPARK-18194.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants