Fix bug with TrainingValidationSplit and custom index #1348

freddyaboulton · 2020-10-26T20:38:10Z

Pull Request Description

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-10-26T20:46:01Z

Codecov Report

Merging #1348 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1348      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13632    13642      +10     
==========================================
+ Hits        13625    13635      +10     
  Misses          7        7

Impacted Files	Coverage Δ
...automl/data_splitters/training_validation_split.py	`100.00% <100.00%> (ø)`
...sts/automl_tests/test_training_validation_split.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d757545...800831d. Read the comment docs.

jeremyliweishih

Looks good to me. Are there any side effects of using an numpy array instead of pandas?

freddyaboulton · 2020-10-27T14:32:55Z

@jeremyliweishih Not that I'm aware of! iloc works with pandas, numpy, lists and I believe the sklearn splitters return numpy arrays anyways.

bchen1116

LGTM!

dsherry · 2020-10-27T19:50:17Z

evalml/tests/automl_tests/test_training_validation_split.py

+    splitter = TrainingValidationSplit(train_size=0.75, shuffle=True, random_state=random_state)
+    splits = splitter.split(X, y=None)
+    assert np.all(np.logical_and(splits[0][0] < N, splits[0][0] >= 0))
+    assert np.all(np.logical_and(splits[0][1] < N, splits[0][1] >= 0))


@freddyaboulton I see you defined a custom index here. That makes sense. Shouldn't we check that the output still has the custom indices?

np.testing.assert_equal(20000 + np.arange(0, round(int(0.75*N))), splits[0][0].index) np.testing.assert_equal(20000 + np.arange(round(int(0.75*N)), N), splits[0][1].index)

Our splitters don't split the data, they just provide the indices at which the data should be split. But after we call .iloc on these indices, the custom index will be preserved. Should I add that to the test?

Ah, you're right, never mind!

If I were gonna add another unit test concerning this stuff, it'd be an automl test which runs with max_pipelines=2 and a dataset with a custom index, mocks pipeline fit and makes sure that the data handed to the pipelines is exactly what we expect it to be. Not required for this PR though.

dsherry

@freddyaboulton nice! Thanks for chasing this down. I left one testing request

dsherry · 2020-10-27T20:51:22Z

evalml/automl/data_splitters/training_validation_split.py

@@ -41,5 +42,5 @@ def split(self, X, y=None):
            Returns:
                list: indices to split data into training and test set
        """
-        train, test = train_test_split(X.index, test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
+        train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)


Ok, yep, I think this makes sense and aligns with sklearn's behavior.

The following

from sklearn.model_selection import KFold N = 11000 X = pd.DataFrame({'col1': np.arange(0, N)}, index=np.arange(20000, 20000 + N)) sk_splits = list(KFold(2, random_state=0).split(X)) sk_splits[0]

outputs splits generated in terms of the row number, ignoring the custom index:

(array([ 5500, 5501, 5502, ..., 10997, 10998, 10999]), array([ 0, 1, 2, ..., 5497, 5498, 5499]))

and therefore our automl search code which uses iloc to make the dataframes for each split will work correctly.

For the TrainingValidationSplit, using np.arange(X.shape[0]) as input to train_test_split as you've done here produces the same behavior: the result is the row number, and completely ignores the custom index, which is appropriate for iloc.

Great fix!

…ataframe index.

freddyaboulton force-pushed the 1126-index-error-large-dataset branch from 0385dde to 192f254 Compare October 26, 2020 22:05

freddyaboulton marked this pull request as ready for review October 26, 2020 22:23

freddyaboulton requested review from dsherry, bchen1116, jeremyliweishih, eccabay and christopherbunn and removed request for christopherbunn, jeremyliweishih, bchen1116 and eccabay October 26, 2020 22:23

jeremyliweishih approved these changes Oct 27, 2020

View reviewed changes

bchen1116 approved these changes Oct 27, 2020

View reviewed changes

dsherry reviewed Oct 27, 2020

View reviewed changes

dsherry approved these changes Oct 27, 2020

View reviewed changes

freddyaboulton added 3 commits October 27, 2020 17:39

TrainingValidationSplit will split on the location index instead of d…

088aeb1

…ataframe index.

Linting import order

771f7ee

Adding PR 1348 to release notes.

800831d

freddyaboulton force-pushed the 1126-index-error-large-dataset branch from 192f254 to 800831d Compare October 27, 2020 21:39

freddyaboulton merged commit cf2c2db into main Oct 27, 2020

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

freddyaboulton mentioned this pull request Nov 5, 2020

AutoMLSearch: data splitting fails with IndexError #1069

Closed

freddyaboulton deleted the 1126-index-error-large-dataset branch June 3, 2021 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with TrainingValidationSplit and custom index #1348

Fix bug with TrainingValidationSplit and custom index #1348

freddyaboulton commented Oct 26, 2020

codecov bot commented Oct 26, 2020 •

edited

Loading

jeremyliweishih left a comment

freddyaboulton commented Oct 27, 2020

bchen1116 left a comment

dsherry Oct 27, 2020

freddyaboulton Oct 27, 2020

dsherry Oct 27, 2020

dsherry Oct 27, 2020

dsherry left a comment

dsherry Oct 27, 2020

Fix bug with TrainingValidationSplit and custom index #1348

Fix bug with TrainingValidationSplit and custom index #1348

Conversation

freddyaboulton commented Oct 26, 2020

Pull Request Description

codecov bot commented Oct 26, 2020 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

freddyaboulton commented Oct 27, 2020

bchen1116 left a comment

Choose a reason for hiding this comment

dsherry Oct 27, 2020

Choose a reason for hiding this comment

freddyaboulton Oct 27, 2020

Choose a reason for hiding this comment

dsherry Oct 27, 2020

Choose a reason for hiding this comment

dsherry Oct 27, 2020

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry Oct 27, 2020

Choose a reason for hiding this comment

codecov bot commented Oct 26, 2020 •

edited

Loading