Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with TrainingValidationSplit and custom index #1348

Merged
merged 3 commits into from
Oct 27, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1126


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Oct 26, 2020

Codecov Report

Merging #1348 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1348      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13632    13642      +10     
==========================================
+ Hits        13625    13635      +10     
  Misses          7        7              
Impacted Files Coverage Δ
...automl/data_splitters/training_validation_split.py 100.00% <100.00%> (ø)
...sts/automl_tests/test_training_validation_split.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d757545...800831d. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Are there any side effects of using an numpy array instead of pandas?

@freddyaboulton
Copy link
Contributor Author

@jeremyliweishih Not that I'm aware of! iloc works with pandas, numpy, lists and I believe the sklearn splitters return numpy arrays anyways.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

splitter = TrainingValidationSplit(train_size=0.75, shuffle=True, random_state=random_state)
splits = splitter.split(X, y=None)
assert np.all(np.logical_and(splits[0][0] < N, splits[0][0] >= 0))
assert np.all(np.logical_and(splits[0][1] < N, splits[0][1] >= 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I see you defined a custom index here. That makes sense. Shouldn't we check that the output still has the custom indices?

np.testing.assert_equal(20000 + np.arange(0, round(int(0.75*N))), splits[0][0].index)
np.testing.assert_equal(20000 + np.arange(round(int(0.75*N)), N), splits[0][1].index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our splitters don't split the data, they just provide the indices at which the data should be split. But after we call .iloc on these indices, the custom index will be preserved. Should I add that to the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, never mind!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were gonna add another unit test concerning this stuff, it'd be an automl test which runs with max_pipelines=2 and a dataset with a custom index, mocks pipeline fit and makes sure that the data handed to the pipelines is exactly what we expect it to be. Not required for this PR though.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton nice! Thanks for chasing this down. I left one testing request

@@ -41,5 +42,5 @@ def split(self, X, y=None):
Returns:
list: indices to split data into training and test set
"""
train, test = train_test_split(X.index, test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep, I think this makes sense and aligns with sklearn's behavior.

The following

from sklearn.model_selection import KFold
N = 11000
X = pd.DataFrame({'col1': np.arange(0, N)}, index=np.arange(20000, 20000 + N))
sk_splits = list(KFold(2, random_state=0).split(X))
sk_splits[0]

outputs splits generated in terms of the row number, ignoring the custom index:

(array([ 5500,  5501,  5502, ..., 10997, 10998, 10999]),
 array([   0,    1,    2, ..., 5497, 5498, 5499]))

and therefore our automl search code which uses iloc to make the dataframes for each split will work correctly.

For the TrainingValidationSplit, using np.arange(X.shape[0]) as input to train_test_split as you've done here produces the same behavior: the result is the row number, and completely ignores the custom index, which is appropriate for iloc.

Great fix!

@freddyaboulton freddyaboulton deleted the 1126-index-error-large-dataset branch June 3, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IndexError: positional indexers are out-of-bounds
4 participants