[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

zapletal-martin · 2015-06-24T20:54:14Z

Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
Refactored CrossValidator to have both validators share code
External API of CrossValidator should stay unchanged

… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.

SparkQA · 2015-06-24T21:00:11Z

Test build #35720 has finished for PR 6996 at commit dff51c7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CrossValidator(uid: String)
- class TrainValidationSplit(uid: String)

mengxr · 2015-06-24T22:06:41Z

@harsha2010 Do you have time to help review this PR? Thanks!

SparkQA · 2015-06-24T22:21:04Z

Test build #35721 has finished for PR 6996 at commit d033da4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CrossValidator(uid: String)
- class TrainValidationSplit(uid: String)

harsha2010 · 2015-06-25T14:59:28Z

mllib/src/main/scala/org/apache/spark/ml/tuning/Validation.scala

+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.types.StructType
+
+import scala.reflect.ClassTag


sort imports in the order specified in the style guide here:
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. (#6996)> 이 다음과 같은 이유로 전송 실패했습니다.

받는 사람이 회원님의 메일을 수신차단 하였습니다.

harsha2010 · 2015-06-25T17:30:36Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

+   * Default: 0.75
+   * @group param
+   */
+  val trainRatio: DoubleParam = new DoubleParam(this, "numFolds",


param name should be trainRatio instead of numFolds

SparkQA · 2015-06-25T18:26:20Z

Test build #35799 has finished for PR 6996 at commit ead6212.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CrossValidator(uid: String)
- class TrainValidationSplit(uid: String)

SparkQA · 2015-06-25T18:41:42Z

Test build #35804 has finished for PR 6996 at commit 7992881.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CrossValidator(uid: String)
- class TrainValidatorSplit(uid: String)

SparkQA · 2015-06-25T19:57:54Z

Test build #35805 has finished for PR 6996 at commit be64a13.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CrossValidator(uid: String)
- class TrainValidatorSplit(uid: String)

mengxr · 2015-07-07T04:43:45Z

mllib/src/main/scala/org/apache/spark/ml/tuning/Validator.scala

+    logInfo(s"Average validation metrics: ${metrics.toSeq}")
+    val (bestMetric, bestIndex) = metrics.zipWithIndex.maxBy(_._1)
+    logInfo(s"Best set of parameters:\n${epm(bestIndex)}")
+    logInfo(s"Best cross-validation metric: $bestMetric.")


Same here. cross-validation would appear under TrainValidationSplit.

mengxr · 2015-07-07T04:51:47Z

@zapletal-martin Sorry for my late comment! But this PR contains much more content than I expected. We should try to keep each PR minimal. For example, implementing TrainValidationSplit alone would make the PR much easier to review. Using ValidatorParams for both CrossValidator and TrainValidationSplit sounds good to me. But the Validator refactoring could be done in a follow-up PR, and the changes to MLUtils could be in a separate PR as well. Does it sound good to you?

zapletal-martin · 2015-07-07T19:18:14Z

@mengxr sure, sounds good. There was quite a lot of code duplication so I decided to refactor it. I will create a new PR just for TrainValidationSplit without the refactor and we can address the resulting duplication in a later PR.

mengxr · 2015-07-07T19:35:03Z

@zapletal-martin We can use ValidatorParams in the new PR, just without Validator refactoring and MLUtils changes. Thanks for your understanding!

- [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. - [X] Simplified replacement of #6996 Author: martinzapletal <zapletal-martin@email.cz> Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits: cafc949 [martinzapletal] Review comments #7337. 511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to #7337 00c4f5a [martinzapletal] SPARK-8484. Styling. d699506 [martinzapletal] SPARK-8484. Styling. 93ed2ee [martinzapletal] Styling. 3bc1853 [martinzapletal] SPARK-8484. Styling. 2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. 21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.

zapletal-martin added 2 commits June 24, 2015 16:36

SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It…

1161a2e

… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.

SPARK-8484. Naming.

dff51c7

SPARK-8484. Newlines.

d033da4

harsha2010 reviewed Jun 25, 2015
View reviewed changes

Import sorting.

ead6212

harsha2010 reviewed Jun 25, 2015
View reviewed changes

SPARK-8484. PR comments apache#6996

7992881

SPARK-8484. PR comments apache#6996

be64a13

mengxr reviewed Jul 7, 2015
View reviewed changes

zapletal-martin closed this Jul 10, 2015

zapletal-martin mentioned this pull request Jul 10, 2015

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #7337

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

zapletal-martin commented Jun 24, 2015

SparkQA commented Jun 24, 2015

mengxr commented Jun 24, 2015

SparkQA commented Jun 24, 2015

harsha2010 Jun 25, 2015

sujkh85 Jun 25, 2015

harsha2010 Jun 25, 2015

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

mengxr Jul 7, 2015

mengxr commented Jul 7, 2015

zapletal-martin commented Jul 7, 2015

mengxr commented Jul 7, 2015

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

Conversation

zapletal-martin commented Jun 24, 2015

SparkQA commented Jun 24, 2015

mengxr commented Jun 24, 2015

SparkQA commented Jun 24, 2015

harsha2010 Jun 25, 2015

Choose a reason for hiding this comment

sujkh85 Jun 25, 2015

Choose a reason for hiding this comment

NAVER - http://www.naver.com/

harsha2010 Jun 25, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

mengxr Jul 7, 2015

Choose a reason for hiding this comment

mengxr commented Jul 7, 2015

zapletal-martin commented Jul 7, 2015

mengxr commented Jul 7, 2015