-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with TrainingValidationSplit and custom index #1348
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1348 +/- ##
==========================================
+ Coverage 99.95% 99.95% +0.01%
==========================================
Files 213 213
Lines 13632 13642 +10
==========================================
+ Hits 13625 13635 +10
Misses 7 7
Continue to review full report at Codecov.
|
0385dde
to
192f254
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Are there any side effects of using an numpy array instead of pandas?
@jeremyliweishih Not that I'm aware of! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
splitter = TrainingValidationSplit(train_size=0.75, shuffle=True, random_state=random_state) | ||
splits = splitter.split(X, y=None) | ||
assert np.all(np.logical_and(splits[0][0] < N, splits[0][0] >= 0)) | ||
assert np.all(np.logical_and(splits[0][1] < N, splits[0][1] >= 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I see you defined a custom index here. That makes sense. Shouldn't we check that the output still has the custom indices?
np.testing.assert_equal(20000 + np.arange(0, round(int(0.75*N))), splits[0][0].index)
np.testing.assert_equal(20000 + np.arange(round(int(0.75*N)), N), splits[0][1].index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our splitters don't split the data, they just provide the indices at which the data should be split. But after we call .iloc on these indices, the custom index will be preserved. Should I add that to the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, never mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were gonna add another unit test concerning this stuff, it'd be an automl test which runs with max_pipelines=2
and a dataset with a custom index, mocks pipeline fit
and makes sure that the data handed to the pipelines is exactly what we expect it to be. Not required for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton nice! Thanks for chasing this down. I left one testing request
@@ -41,5 +42,5 @@ def split(self, X, y=None): | |||
Returns: | |||
list: indices to split data into training and test set | |||
""" | |||
train, test = train_test_split(X.index, test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state) | |||
train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yep, I think this makes sense and aligns with sklearn's behavior.
The following
from sklearn.model_selection import KFold
N = 11000
X = pd.DataFrame({'col1': np.arange(0, N)}, index=np.arange(20000, 20000 + N))
sk_splits = list(KFold(2, random_state=0).split(X))
sk_splits[0]
outputs splits generated in terms of the row number, ignoring the custom index:
(array([ 5500, 5501, 5502, ..., 10997, 10998, 10999]),
array([ 0, 1, 2, ..., 5497, 5498, 5499]))
and therefore our automl search code which uses iloc
to make the dataframes for each split will work correctly.
For the TrainingValidationSplit, using np.arange(X.shape[0])
as input to train_test_split
as you've done here produces the same behavior: the result is the row number, and completely ignores the custom index, which is appropriate for iloc
.
Great fix!
192f254
to
800831d
Compare
Pull Request Description
Fixes #1126
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.