-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always shuffle data in default automl data split strategies #1265
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1265 +/- ##
=======================================
Coverage 99.93% 99.93%
=======================================
Files 207 208 +1
Lines 13157 13211 +54
=======================================
+ Hits 13149 13203 +54
Misses 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry This looks good to me!
# if shuffle is disabled, the mean value learned on each CV fold's training data will be incredible inaccurate, | ||
# thus yielding an R^2 well below 0. | ||
|
||
n = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test doesn't take too long it might be worth running it with n=1000 so that the KFold CV is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah shoot you're right. I'll track this down. I upped the number because I wasn't reproing it as reliably at lower values.
@@ -645,6 +645,32 @@ def generate_fake_dataset(rows): | |||
assert automl.data_split.test_size == (automl._LARGE_DATA_PERCENT_VALIDATION) | |||
|
|||
|
|||
def test_data_split_shuffle(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -5,7 +5,7 @@ | |||
class TrainingValidationSplit(BaseCrossValidator): | |||
"""Split the training data into training and validation sets""" | |||
|
|||
def __init__(self, test_size=None, train_size=None, shuffle=True, stratify=None, random_state=0): | |||
def __init__(self, test_size=None, train_size=None, shuffle=False, stratify=None, random_state=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be shuffle=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsheni I updated this to be the same as the other sklearn-defined CV methods, which have shuffle=False
by default. Then, in the automl code which calls this, I set shuffle=True
. I just did it to be consistent with sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
e3887a4
to
935bb1c
Compare
Fix #1259