Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrainTestSplit random seed #1635

Closed
petterton opened this issue Nov 15, 2018 · 7 comments
Closed

TrainTestSplit random seed #1635

petterton opened this issue Nov 15, 2018 · 7 comments
Assignees
Labels
bug Something isn't working
Projects
Milestone

Comments

@petterton
Copy link

petterton commented Nov 15, 2018

I am repeatedly calling TrainTestSplit for a data set (for cross validation) and see that the resulting split is the same every call. In sklearn, the train_test_split function has the possibility of taking a seed for a random number generator as an input. Could this be added also in ML.NET?

@najeeb-kazmi
Copy link
Member

Hi @petterton - in ML.NET, the seed is set at the environment level. You can set and change the seed when you create the MLContext object as in this sample.

Does that answer your question?

@petterton
Copy link
Author

petterton commented Nov 15, 2018

Hi @najeeb-kazmi , I was wondering if that was the case, but changing the seed when creating MLContext did not change the split for me. This also raises another question:
If calling TrainTestSplit multiple times, is every iteration expected to start from the same seed, giving the same split? (this is what I get now) Or do I have to recreate the MLContext for every split?

@najeeb-kazmi
Copy link
Member

Actually, I was wrong about the seed in MLContext. That does not affect the behavior of TrainTestSplit, which has a deterministic behavior implemented in TrainContextBase here.

@Zruty0 any thoughts on how we can get this in? Currently, we are using RangeFilter to get the splits for both TrainTestSplit and for creating the CV folds in CrossValidateTrain methods. Getting random splits of the data is a pretty common scenario, and something supported by most other toolkits. We should find a way to do this in ML.NET as well.

@petterton if you are trying to get different splits for doing cross validation specifically, you can use the CrossValidate methods in all the training contexts, e.g. MLContext.BinaryClassification.CrossValidate().

@najeeb-kazmi najeeb-kazmi added the bug Something isn't working label Nov 16, 2018
@Zruty0
Copy link
Contributor

Zruty0 commented Nov 18, 2018

Yep, it's a bug. We need to make TrainTestSplit take a random seed

@petterton
Copy link
Author

@Zruty0 : Will this be fixed in 0.8? (Not assigned to a milestone yet...)

@petterton
Copy link
Author

@Ivanidzo4ka , @Zruty0 : I tested this in v0.9, and this still does not work as I expected. If I don't set a stratificationColumn, the seed works as expected (changing the seed changes the split). But if the stratificationColumn is set, changing the seed seems to have no effect.

@dckorben
Copy link

dckorben commented Jan 29, 2019

The usage of seeds seems to be unclear still in v0.9. I think in some workflows, if the Context seed is provided, you are probably going for deterministic outcomes for testing. However, if the Context seed isn't provided, you are probably doing a real training. It doesn't seem that the Context seed has any effect on the Split. In some cases, you might have a need to call split multiple times, which of course probably provide a seed. But if you load, split and train with a null Context, shouldn't we expect a different Split outcome?

@shauheen shauheen added this to the 0119 milestone Feb 6, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
No open projects
v0.9
  
Done
Development

No branches or pull requests

6 participants