Skip to content

Large Data Training Validation Split#877

Merged
eccabay merged 27 commits intomasterfrom
840_large_data
Jun 29, 2020
Merged

Large Data Training Validation Split#877
eccabay merged 27 commits intomasterfrom
840_large_data

Conversation

@eccabay
Copy link
Copy Markdown
Contributor

@eccabay eccabay commented Jun 24, 2020

Closes #840 by moving the cross validation argument in automl to a more generic data splitting argument that can handle a single training/validation split, to which it defaults if there are more than 100k rows in the training dataset.

@codecov
Copy link
Copy Markdown

codecov bot commented Jun 24, 2020

Codecov Report

Merging #877 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #877      +/-   ##
==========================================
+ Coverage   99.77%   99.78%   +0.01%     
==========================================
  Files         193      195       +2     
  Lines        8881     8950      +69     
==========================================
+ Hits         8861     8931      +70     
+ Misses         20       19       -1     
Impacted Files Coverage Δ
evalml/automl/__init__.py 100.00% <100.00%> (ø)
evalml/automl/automl_search.py 98.73% <100.00%> (+0.28%) ⬆️
evalml/automl/data_splitters/__init__.py 100.00% <100.00%> (ø)
...automl/data_splitters/training_validation_split.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
.../automl_tests/test_automl_search_classification.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96aecd8...c5fcb43. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review June 24, 2020 17:08
@eccabay eccabay requested review from angela97lin and dsherry June 24, 2020 17:08
@dsherry
Copy link
Copy Markdown
Contributor

dsherry commented Jun 24, 2020

@eccabay cool, I'm excited to look at this!

I see there's one test failing right now, evalml.tests.automl_tests.test_auto_classification_search::test_categorical_classification:

ValueError: Cannot use median strategy with non-numeric data: could not convert string to float: 'Braund, Mr. Owen Harris'

I skimmed the PR and there's no obvious reason why this would be failing given your changes... my hunch is that this is happening because you currently have shuffle=True in the StratifiedKFold, and that's changed the order of calls to the random number generator such that median is being recommended for SimpleImputer. @angela97lin and I were just talking about this in slack yesterday. FYI I just filed this as a bug in #881

Copy link
Copy Markdown
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments but excited to see this work get in as a great start to supporting larger data sets :D

Comment thread docs/source/changelog.rst Outdated
Comment thread evalml/automl/auto_regression_search.py Outdated
Comment thread evalml/tests/automl_tests/test_autobase.py Outdated
Copy link
Copy Markdown
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay great! I liked your error handling and the unit tests. I left comments and tips for updating the limit check and the evaluation impl logic, as well as a suggestion on mocking in the tests. From the looks of it, the next round will be good to go.

Comment thread docs/source/changelog.rst Outdated
Comment thread evalml/automl/auto_regression_search.py Outdated
Comment thread evalml/automl/auto_search_base.py Outdated
Comment thread evalml/automl/auto_search_base.py Outdated
Comment thread evalml/automl/auto_search_base.py Outdated
Comment thread evalml/automl/auto_search_base.py
Comment thread evalml/automl/data_splitters/training_validation_split.py
Comment thread evalml/tests/automl_tests/test_autobase.py Outdated
@dsherry dsherry mentioned this pull request Jun 26, 2020
Comment thread evalml/automl/auto_search_base.py
Comment thread evalml/automl/data_splitters/training_validation_split.py Outdated
@dsherry
Copy link
Copy Markdown
Contributor

dsherry commented Jun 26, 2020

@eccabay #871 was just merged and I have a lot of automl test conflicts in my refit PR! Looks like you've got some too. I hope they're easy to resolve. If you need a hand feel free to ping someone!

@eccabay eccabay requested a review from dsherry June 29, 2020 12:44
Copy link
Copy Markdown
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I left a few more comments. The only ones blocking merge from my perspective are one final simplification to the data split logic in AutoMLSearch.__init__, and backing out the changes in _compute_cv_scores. The rest could be addressed in a separate PR once this one's merged.

Comment thread docs/source/changelog.rst
**Breaking Changes**
* Pipelines' static ``component_graph`` field must contain either ``ComponentBase`` subclasses or ``str``, instead of ``ComponentBase`` subclass instances :pr:`850`
* Rename ``handle_component`` to ``handle_component_class``. Now standardizes to ``ComponentBase`` subclasses instead of ``ComponentBase`` subclass instances :pr:`850`
* Renamed `automl` `cv` argument to `data_split` :pr:`877`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread evalml/automl/automl_search.py Outdated
if self.data_split is not None and not issubclass(self.data_split.__class__, BaseCrossValidator):
raise ValueError("Not a valid data splitter")
if self.problem_type != self.objective.problem_type:
raise ValueError("Given objective {} is not compatible with a {} problem.".format(self.objective.name, self.problem_type.value))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This is a part of the validation tracked by #865. I'm in favor of adding this, even if it's not required for this PR.

Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated
list: training and testing split of both X and y inputs
"""
train, test = train_test_split(X.index, test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
return [(train, test)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 awesome!

Comment thread evalml/automl/data_splitters/training_validation_split.py Outdated
AutoMLSearch(problem_type='binary', data_split=data_splitter)


@patch('evalml.pipelines.BinaryClassificationPipeline.score')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay : can you please also add this line to all the tests where you mock score

@patch('evalml.pipelines.BinaryClassificationPipeline.fit')

This means pipelines' fit methods won't run, and can speed up tests by 5-10x

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, when I add this, none of the tests pass - the expected scores I set with mock_score don't match and/or an error is thrown complaining about the pipelines not being fit, and there's no change in the amount of time the tests take.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange! Fine to ignore

with pytest.raises(ValueError, match="is not compatible with a"):
auto = AutoMLSearch(problem_type='binary', objective='R2')
auto.search(X, y, data_checks=EmptyDataChecks())
AutoMLSearch(problem_type='binary', objective='R2')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@eccabay eccabay merged commit 24cc793 into master Jun 29, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
@dsherry dsherry deleted the 840_large_data branch October 29, 2020 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large data support: add a threshold to switch default automl data splitting from CV to TV

3 participants