New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: correct readme example #50
Conversation
README.md
Outdated
@@ -38,10 +43,10 @@ sample_df = train_df.sample(frac=0.01, random_state=0) | |||
sample_df.sort_values("AvSigVersion", inplace=True) | |||
|
|||
# define the validation scheme | |||
cv = KFold(n_splits=4, shuffle=False, random_state=0) | |||
cv = KFold(n_splits=4, shuffle=True, random_state=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change kfold params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it changed with a new sklearn version. Anyway, you can set the random_state to None but shuffle should stay as False since it is doing a lazy time split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it seems like the API has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to add a comment and make the example a bit more general:
sample_df.sort_values("AvSigVersion", inplace=True)
->
sample_df = sample_df.sample(frac=1) # Shuffling rows before CV
and
cv = KFold(n_splits=4, shuffle=False) # No shuffling to keep the same folds for each feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought anyone could search for the competition name and get the data but we can also share the link to the competition on readme? We can also comment that AvSigVersion is a proxy for time because the data had no time column.
And having a time split validation is maybe not the best readme example you want to show.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the most common use case.
Ok, now that I read more the readme, you are right that section is called "Example on Kaggle's Microsoft Malware Prediction Competition"
So
sample_df.sort_values("AvSigVersion", inplace=True) # Sort by time for time split validation
cv = KFold(n_splits=4, shuffle=False) # Don't shuffle to keep the time split split validation
Maybe good to add a generic example based on sklearn data like Iris.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments should make it easier to understand 👍
LOFO is usually more useful for non-random split problems but such example could also be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KameniAlexNea can you please add these 2 comments above that @stephanecollot shared since you already update the readme? Then I can merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just make a push now
Target attribute, not exist in first code example