-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996
Conversation
zapletal-martin
commented
Jun 24, 2015
- Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
- Refactored CrossValidator to have both validators share code
- External API of CrossValidator should stay unchanged
… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
Test build #35720 has finished for PR 6996 at commit
|
@harsha2010 Do you have time to help review this PR? Thanks! |
Test build #35721 has finished for PR 6996 at commit
|
import org.apache.spark.sql.DataFrame | ||
import org.apache.spark.sql.types.StructType | ||
|
||
import scala.reflect.ClassTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort imports in the order specified in the style guide here:
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAVER - http://www.naver.com/
sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. (#6996)> 이 다음과 같은 이유로 전송 실패했습니다.
받는 사람이 회원님의 메일을 수신차단 하였습니다.
* Default: 0.75 | ||
* @group param | ||
*/ | ||
val trainRatio: DoubleParam = new DoubleParam(this, "numFolds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param name should be trainRatio instead of numFolds
Test build #35799 has finished for PR 6996 at commit
|
Test build #35804 has finished for PR 6996 at commit
|
Test build #35805 has finished for PR 6996 at commit
|
logInfo(s"Average validation metrics: ${metrics.toSeq}") | ||
val (bestMetric, bestIndex) = metrics.zipWithIndex.maxBy(_._1) | ||
logInfo(s"Best set of parameters:\n${epm(bestIndex)}") | ||
logInfo(s"Best cross-validation metric: $bestMetric.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. cross-validation
would appear under TrainValidationSplit
.
@zapletal-martin Sorry for my late comment! But this PR contains much more content than I expected. We should try to keep each PR minimal. For example, implementing |
@mengxr sure, sounds good. There was quite a lot of code duplication so I decided to refactor it. I will create a new PR just for TrainValidationSplit without the refactor and we can address the resulting duplication in a later PR. |
@zapletal-martin We can use |
- [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. - [X] Simplified replacement of #6996 Author: martinzapletal <zapletal-martin@email.cz> Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits: cafc949 [martinzapletal] Review comments #7337. 511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to #7337 00c4f5a [martinzapletal] SPARK-8484. Styling. d699506 [martinzapletal] SPARK-8484. Styling. 93ed2ee [martinzapletal] Styling. 3bc1853 [martinzapletal] SPARK-8484. Styling. 2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. 21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.