-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #7337
[SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #7337
Conversation
zapletal-martin
commented
Jul 10, 2015
- Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
- Simplified replacement of [SPARK-8484] [ML]. Added TrainValidationSplit for hyper-parameter tuning. #6996
… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
… randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
Test build #36991 has finished for PR 7337 at commit
|
Test build #36993 has finished for PR 7337 at commit
|
* :: Experimental :: | ||
* Validation for hyper-parameter tuning. | ||
* Randomly splits the input dataset into train and validation sets. | ||
* And uses evaluation metric on the validation set to select the best model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "...validation sets, and uses ..." (comma instead of period)
Done for now. This looks like it's in good shape. |
Actually, is there much difference between this and Unless I'm missing something, maybe we can simply extend |
@feynmanliang thanks for your comments. Yes there is quite a lot of duplicated code. I attempted to refactor that slightly under #6996. It is an interesting idea to just call CrossValidator instead of implementing the logic. I will have a look into that, it could simplify the code even a bit more. But I assume we still need the Params and Model specific for TrainValidationSplit? We also need to decide if we want to do that as part of this review or separately. cc @mengxr |
Both CrossValidator and TrainValidationSplit use sampling to split the data to training and validation. Currently CrossValidator does
TrainValidationSplit does
Therefore the logic is different and using TrainValidationSplit is not the same as just calling CrossValidator. Please let me know if the logic implemented by TrainValidationSplit is what was expected. We can then potentially address the code duplication. |
If we give a Note that in this case we would have to change the default |
Sorry, didn't address your questions.
|
Thanks @feynmanliang. As I mentioned I tried to address the code duplication in my previous PR differently than you propose, but we decided to go with the simplest option for now. I agree what you are proposing makes sense. The only thing that worries me would be unclear purpose of trainRatio when numFolds != 1. In that case CrossValidator splits the dataset to numFolds subsets of the same size and the ratio of training and validation sets is given (e.g. with numFolds set to 4 the training set is 0.75 and validation is 0.25) and therefore the trainRatio param would not be used? We could do that, document that approach and essentially get rid of TrainValidatorSplit or the other option would be to preserve TrainValidatorSplit as a wrapper around the functionality to avoid confusion of CrossValidator having the trainRatio param. |
Ah, I overlooked your point about when I agree with you that |
Test build #37057 has finished for PR 7337 at commit
|
I think we need a decision how to approach this. I would prefer to focus on the public api and avoid the refactor in this review and then address that in another review as discussed in #6996. |
/** | ||
* Params for [[TrainValidatorSplit]] and [[TrainValidatorSplitModel]]. | ||
*/ | ||
private[ml] trait TrainValidatorSplitParams extends ValidatorParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainValidatorSplit
-> TrainValidationSplit
@feynmanliang @zapletal-martin The changes in this PR look good to me except a few minor comments. As discussed in #6996, let's focus on the public API to get this merged first. We can have another PR for code reuse. There would be more discussion, e.g., having a base class handling arbitrary slicing of the input data and making |
Test build #38015 has finished for PR 7337 at commit
|
LGTM. Merged into master. Please create JIRAs for follow-up work, e.g, Python API, user guide, and refactoring (if it is useful). Thanks! |