Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed default large dataset train/test splitting behavior #1205

Merged
merged 6 commits into from
Sep 29, 2020

Conversation

christopherbunn
Copy link
Contributor

Initial pass of managing large datasets. If a data splitter is not set, the default data splitter will only use 25% of the dataset or up to 100k rows (whichever one is lower).

The values here are parameterized in the code. If users want to change these values, they can pass over their own data splitter.

Resolves #1061

@christopherbunn christopherbunn changed the title Added initial tweaks to large dataset train/test splitting Changed default large dataset train/test splitting behavior Sep 21, 2020
@codecov
Copy link

codecov bot commented Sep 22, 2020

Codecov Report

Merging #1205 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1205   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         201      201           
  Lines       12489    12511   +22     
=======================================
+ Hits        12480    12502   +22     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.58% <100.00%> (+<0.01%) ⬆️
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e3ca42...fb38a84. Read the comment docs.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I left a few things, should be ready to merge next round!

@@ -67,7 +67,9 @@
class AutoMLSearch:
"""Automated Pipeline search."""
_MAX_NAME_LEN = 40
_MAX_TRAINING_ROWS = int(1e5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn what's the difference between _MAX_TRAINING_ROWS and _LARGE_DATA_ROW_THRESHOLD? They have the same value right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_LARGE_DATA_ROW_THRESHOLD defines the number of rows a dataset needs to have before it's considered "large" (and thus uses the large dataset default splitter). _MAX_TRAINING_ROWS defines the maximum number of rows that is used for the training dataset. Regardless of the entire size of the dataset, we should only take _MAX_TRAINING_ROWS of data.

They just so happen to be the same value, I'm definitely open to adjusting these values as necessary.

_LARGE_DATA_ROW_THRESHOLD = int(1e5)
_LARGE_DATA_PERCENT_TEST = 0.75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the training or validation split size? If its training, I think we wanted to lower it, right? 10% would be a good value to start with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now that you included "TEST" in the name. Got it. Can you please say "VALIDATION" instead? That's consistent with the rest of our automl code. When I hear "test" I think "holdout" and we don't have a representation for that yet in automl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more consideration. By defining this as a class property, the only pythonic way to change it is to redefine the class. We could set it directly, but you'd be in trouble with the python police ⚠😂

Let's add this as an arg to the constructor: _large_data_percent_validation

Eventually we'll probably give up and have the constructor take a config object, but for now feels fine to add more parameters there. We can keep it private since its a property of the automl algo.

Sound good?

Copy link
Contributor Author

@christopherbunn christopherbunn Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how the name can be confusing; I'll rename it to _LARGE_DATA_PERCENT_VALIDATION.

Re: setting it as a constructor arg, I purposefully make it hard to change this variable. My reasoning is that if a user wanted to change the percentage, they could just define their own data_split. Especially since this is only applicable to large datasets, it might be confusing as what constitutes a "large" dataset (and the value of which is a class property anyways) and thus when it would kick in.

It might be worthwhile to update the docs to include a section on how we handle large datasets. Here, it might make sense to describe this default behavior and mention defining a new data_splitter as the best way to change this percentage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn thanks for following up!

RE the docs, I agree. We're currently missing a User Guide page describing the automl algorithm, and the heuristics we use to make decisions about the automl algorithm like what data splitting strategy to use and when we use CV vs train-validation splitting. We will get around to adding that. However, I don't think that information is critical for users. Users need to be able to pass in data, get models and select one; the details of how the models were discovered is only of interest to advanced users.

RE constructor arg: gotcha, I follow. The advantage of adding it as a "private" constructor arg is that we don't need to preserve backwards compatibility for it if we decide to remove it in the future, but we can easily change it in our perf tests, or direct users on how to change it. Not required to do it that way, but I think it'd keep us flexible.

@@ -389,7 +391,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
default_data_split = StratifiedKFold(n_splits=3, random_state=self.random_state)

if X.shape[0] > self._LARGE_DATA_ROW_THRESHOLD:
default_data_split = TrainingValidationSplit(test_size=0.25)
test_size = min(self._LARGE_DATA_PERCENT_TEST, float(self._MAX_TRAINING_ROWS / X.shape[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I suggest we update this to

default_data_split = TrainingValidationSplit(self._large_data_percent_validation)

Let's keep it simple and see what kind of perf test results we can get on larger datasets. If we find evidence for adding the max rows cap we can add it. Or, if you wanna add it in order to perf test it later, that sounds good, but let's disable it by default to start. Sound good?

Copy link
Contributor Author

@christopherbunn christopherbunn Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's definitely worthwhile to revert to this simpler logic for now to get the perf data then evaluate if it makes sense to have the row limiting logic. I'll change it in the next push.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn looks great! I left a couple minor test comments, but good to merge.

@@ -68,6 +68,7 @@ class AutoMLSearch:
"""Automated Pipeline search."""
_MAX_NAME_LEN = 40
_LARGE_DATA_ROW_THRESHOLD = int(1e5)
_LARGE_DATA_PERCENT_VALIDATION = 0.75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn so, 25% training, 75% validation? Sure, that seems like a fine starting point, and certainly better than 75% training which is what we were doing before. Let's do it.

@@ -585,6 +585,31 @@ def test_large_dataset_regression(mock_score):
assert automl.results['pipeline_results'][pipeline_id]['cv_data'][0]['score'] == 1.234


@patch('evalml.pipelines.RegressionPipeline.score')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn please also mock fit, that will save us the fit time in automl.search

mock_score.return_value = {automl.objective.name: 1.234}
assert automl.data_split is None

under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to call this over_max_rows?

under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1
X, y = generate_fake_dataset(under_max_rows)
automl.search(X, y)
assert isinstance(automl.data_split, TrainingValidationSplit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Do we have coverage where we check that if we're under the max row threshold, the result is a CrossValidationSplit ? If not, let's add it.

@christopherbunn christopherbunn merged commit 615026c into main Sep 29, 2020
@christopherbunn christopherbunn deleted the 1061_tv_split_parameter branch September 29, 2020 18:04
@angela97lin angela97lin mentioned this pull request Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large datasets: update TV split amount, and make it configurable
2 participants