Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996

Closed
wants to merge 6 commits into from

Conversation

zapletal-martin
Copy link
Contributor

  • Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
  • Refactored CrossValidator to have both validators share code
  • External API of CrossValidator should stay unchanged

… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35720 has finished for PR 6996 at commit dff51c7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CrossValidator(uid: String)
    • class TrainValidationSplit(uid: String)

@mengxr
Copy link
Contributor

mengxr commented Jun 24, 2015

@harsha2010 Do you have time to help review this PR? Thanks!

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35721 has finished for PR 6996 at commit d033da4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CrossValidator(uid: String)
    • class TrainValidationSplit(uid: String)

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.StructType

import scala.reflect.ClassTag

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. (#6996)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.


* Default: 0.75
* @group param
*/
val trainRatio: DoubleParam = new DoubleParam(this, "numFolds",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param name should be trainRatio instead of numFolds

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35799 has finished for PR 6996 at commit ead6212.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CrossValidator(uid: String)
    • class TrainValidationSplit(uid: String)

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35804 has finished for PR 6996 at commit 7992881.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CrossValidator(uid: String)
    • class TrainValidatorSplit(uid: String)

@SparkQA
Copy link

SparkQA commented Jun 25, 2015

Test build #35805 has finished for PR 6996 at commit be64a13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CrossValidator(uid: String)
    • class TrainValidatorSplit(uid: String)

logInfo(s"Average validation metrics: ${metrics.toSeq}")
val (bestMetric, bestIndex) = metrics.zipWithIndex.maxBy(_._1)
logInfo(s"Best set of parameters:\n${epm(bestIndex)}")
logInfo(s"Best cross-validation metric: $bestMetric.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. cross-validation would appear under TrainValidationSplit.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@zapletal-martin Sorry for my late comment! But this PR contains much more content than I expected. We should try to keep each PR minimal. For example, implementing TrainValidationSplit alone would make the PR much easier to review. Using ValidatorParams for both CrossValidator and TrainValidationSplit sounds good to me. But the Validator refactoring could be done in a follow-up PR, and the changes to MLUtils could be in a separate PR as well. Does it sound good to you?

@zapletal-martin
Copy link
Contributor Author

@mengxr sure, sounds good. There was quite a lot of code duplication so I decided to refactor it. I will create a new PR just for TrainValidationSplit without the refactor and we can address the resulting duplication in a later PR.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@zapletal-martin We can use ValidatorParams in the new PR, just without Validator refactoring and MLUtils changes. Thanks for your understanding!

asfgit pushed a commit that referenced this pull request Jul 23, 2015
- [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
- [X] Simplified replacement of #6996

Author: martinzapletal <zapletal-martin@email.cz>

Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits:

cafc949 [martinzapletal] Review comments #7337.
511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit
f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to #7337
00c4f5a [martinzapletal] SPARK-8484. Styling.
d699506 [martinzapletal] SPARK-8484. Styling.
93ed2ee [martinzapletal] Styling.
3bc1853 [martinzapletal] SPARK-8484. Styling.
2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants