Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Data Training Validation Split #877

Merged
merged 27 commits into from Jun 29, 2020
Merged

Large Data Training Validation Split #877

merged 27 commits into from Jun 29, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jun 24, 2020

Closes #840 by moving the cross validation argument in automl to a more generic data splitting argument that can handle a single training/validation split, to which it defaults if there are more than 100k rows in the training dataset.

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #877 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #877      +/-   ##
==========================================
+ Coverage   99.77%   99.78%   +0.01%     
==========================================
  Files         193      195       +2     
  Lines        8881     8950      +69     
==========================================
+ Hits         8861     8931      +70     
+ Misses         20       19       -1     
Impacted Files Coverage Δ
evalml/automl/__init__.py 100.00% <100.00%> (ø)
evalml/automl/automl_search.py 98.73% <100.00%> (+0.28%) ⬆️
evalml/automl/data_splitters/__init__.py 100.00% <100.00%> (ø)
...automl/data_splitters/training_validation_split.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
.../automl_tests/test_automl_search_classification.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96aecd8...c5fcb43. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review Jun 24, 2020
@eccabay eccabay requested review from dsherry and angela97lin Jun 24, 2020
@dsherry
Copy link
Collaborator

dsherry commented Jun 24, 2020

@eccabay cool, I'm excited to look at this!

I see there's one test failing right now, evalml.tests.automl_tests.test_auto_classification_search::test_categorical_classification:

ValueError: Cannot use median strategy with non-numeric data: could not convert string to float: 'Braund, Mr. Owen Harris'

I skimmed the PR and there's no obvious reason why this would be failing given your changes... my hunch is that this is happening because you currently have shuffle=True in the StratifiedKFold, and that's changed the order of calls to the random number generator such that median is being recommended for SimpleImputer. @angela97lin and I were just talking about this in slack yesterday. FYI I just filed this as a bug in #881

Copy link
Contributor

@angela97lin angela97lin left a comment

Left minor comments but excited to see this work get in as a great start to supporting larger data sets :D

docs/source/changelog.rst Outdated Show resolved Hide resolved
evalml/automl/auto_regression_search.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_autobase.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dsherry dsherry left a comment

@eccabay great! I liked your error handling and the unit tests. I left comments and tips for updating the limit check and the evaluation impl logic, as well as a suggestion on mocking in the tests. From the looks of it, the next round will be good to go.

docs/source/changelog.rst Outdated Show resolved Hide resolved
evalml/automl/auto_regression_search.py Outdated Show resolved Hide resolved
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_autobase.py Outdated Show resolved Hide resolved
@dsherry dsherry mentioned this pull request Jun 26, 2020
evalml/automl/auto_search_base.py Outdated Show resolved Hide resolved
@dsherry
Copy link
Collaborator

dsherry commented Jun 26, 2020

@eccabay #871 was just merged and I have a lot of automl test conflicts in my refit PR! Looks like you've got some too. I hope they're easy to resolve. If you need a hand feel free to ping someone!

@eccabay eccabay requested a review from dsherry Jun 29, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Looks great!

I left a few more comments. The only ones blocking merge from my perspective are one final simplification to the data split logic in AutoMLSearch.__init__, and backing out the changes in _compute_cv_scores. The rest could be addressed in a separate PR once this one's merged.

@@ -47,6 +48,7 @@ Changelog
**Breaking Changes**
* Pipelines' static ``component_graph`` field must contain either ``ComponentBase`` subclasses or ``str``, instead of ``ComponentBase`` subclass instances :pr:`850`
* Rename ``handle_component`` to ``handle_component_class``. Now standardizes to ``ComponentBase`` subclasses instead of ``ComponentBase`` subclass instances :pr:`850`
* Renamed `automl` `cv` argument to `data_split` :pr:`877`
Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
if self.data_split is not None and not issubclass(self.data_split.__class__, BaseCrossValidator):
raise ValueError("Not a valid data splitter")
if self.problem_type != self.objective.problem_type:
raise ValueError("Given objective {} is not compatible with a {} problem.".format(self.objective.name, self.problem_type.value))
Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This is a part of the validation tracked by #865. I'm in favor of adding this, even if it's not required for this PR.

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
list: training and testing split of both X and y inputs
"""
train, test = train_test_split(X.index, test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
return [(train, test)]
Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 awesome!

evalml/automl/data_splitters/training_validation_split.py Outdated Show resolved Hide resolved
AutoMLSearch(problem_type='binary', data_split=data_splitter)


@patch('evalml.pipelines.BinaryClassificationPipeline.score')
Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay : can you please also add this line to all the tests where you mock score

@patch('evalml.pipelines.BinaryClassificationPipeline.fit')

This means pipelines' fit methods won't run, and can speed up tests by 5-10x

Copy link
Contributor Author

@eccabay eccabay Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, when I add this, none of the tests pass - the expected scores I set with mock_score don't match and/or an error is thrown complaining about the pipelines not being fit, and there's no change in the amount of time the tests take.

Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange! Fine to ignore

@@ -423,8 +489,7 @@ def test_verifies_allowed_pipelines(X_y_reg):
def test_obj_matches_problem_type(X_y):
X, y = X_y
with pytest.raises(ValueError, match="is not compatible with a"):
auto = AutoMLSearch(problem_type='binary', objective='R2')
auto.search(X, y, data_checks=EmptyDataChecks())
AutoMLSearch(problem_type='binary', objective='R2')
Copy link
Collaborator

@dsherry dsherry Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@eccabay eccabay merged commit 24cc793 into master Jun 29, 2020
2 checks passed
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
@dsherry dsherry deleted the 840_large_data branch Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large data support: add a threshold to switch default automl data splitting from CV to TV
3 participants