Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: correct readme example #50

Merged
merged 4 commits into from Dec 8, 2022
Merged

docs: correct readme example #50

merged 4 commits into from Dec 8, 2022

Conversation

KameniAlexNea
Copy link
Contributor

Target attribute, not exist in first code example

README.md Outdated
@@ -38,10 +43,10 @@ sample_df = train_df.sample(frac=0.01, random_state=0)
sample_df.sort_values("AvSigVersion", inplace=True)

# define the validation scheme
cv = KFold(n_splits=4, shuffle=False, random_state=0)
cv = KFold(n_splits=4, shuffle=True, random_state=None)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change kfold params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

We can't set shuffle to False and random_state set to 0 as in this capture of my notebook.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it changed with a new sklearn version. Anyway, you can set the random_state to None but shuffle should stay as False since it is doing a lazy time split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it seems like the API has changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to add a comment and make the example a bit more general:

sample_df.sort_values("AvSigVersion", inplace=True)
->
sample_df = sample_df.sample(frac=1) # Shuffling rows before CV

and
cv = KFold(n_splits=4, shuffle=False) # No shuffling to keep the same folds for each feature

Copy link
Owner

@aerdem4 aerdem4 Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought anyone could search for the competition name and get the data but we can also share the link to the competition on readme? We can also comment that AvSigVersion is a proxy for time because the data had no time column.

And having a time split validation is maybe not the best readme example you want to show.

Why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the most common use case.
Ok, now that I read more the readme, you are right that section is called "Example on Kaggle's Microsoft Malware Prediction Competition"
So

sample_df.sort_values("AvSigVersion", inplace=True)  # Sort by time for time split validation

cv = KFold(n_splits=4, shuffle=False)  # Don't shuffle to keep the time split split validation

Maybe good to add a generic example based on sklearn data like Iris.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments should make it easier to understand 👍

LOFO is usually more useful for non-random split problems but such example could also be nice.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KameniAlexNea can you please add these 2 comments above that @stephanecollot shared since you already update the readme? Then I can merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just make a push now

@aerdem4 aerdem4 merged commit c735439 into aerdem4:master Dec 8, 2022
@KameniAlexNea KameniAlexNea deleted the alex branch December 8, 2022 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants