-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed default large dataset train/test splitting behavior #1205
Conversation
b1d99b8
to
10c2225
Compare
Codecov Report
@@ Coverage Diff @@
## main #1205 +/- ##
=======================================
Coverage 99.92% 99.92%
=======================================
Files 201 201
Lines 12489 12511 +22
=======================================
+ Hits 12480 12502 +22
Misses 9 9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn I left a few things, should be ready to merge next round!
evalml/automl/automl_search.py
Outdated
@@ -67,7 +67,9 @@ | |||
class AutoMLSearch: | |||
"""Automated Pipeline search.""" | |||
_MAX_NAME_LEN = 40 | |||
_MAX_TRAINING_ROWS = int(1e5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn what's the difference between _MAX_TRAINING_ROWS
and _LARGE_DATA_ROW_THRESHOLD
? They have the same value right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_LARGE_DATA_ROW_THRESHOLD
defines the number of rows a dataset needs to have before it's considered "large" (and thus uses the large dataset default splitter). _MAX_TRAINING_ROWS
defines the maximum number of rows that is used for the training dataset. Regardless of the entire size of the dataset, we should only take _MAX_TRAINING_ROWS
of data.
They just so happen to be the same value, I'm definitely open to adjusting these values as necessary.
evalml/automl/automl_search.py
Outdated
_LARGE_DATA_ROW_THRESHOLD = int(1e5) | ||
_LARGE_DATA_PERCENT_TEST = 0.75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the training or validation split size? If its training, I think we wanted to lower it, right? 10% would be a good value to start with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now that you included "TEST" in the name. Got it. Can you please say "VALIDATION" instead? That's consistent with the rest of our automl code. When I hear "test" I think "holdout" and we don't have a representation for that yet in automl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more consideration. By defining this as a class property, the only pythonic way to change it is to redefine the class. We could set it directly, but you'd be in trouble with the python police ⚠😂
Let's add this as an arg to the constructor: _large_data_percent_validation
Eventually we'll probably give up and have the constructor take a config object, but for now feels fine to add more parameters there. We can keep it private since its a property of the automl algo.
Sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how the name can be confusing; I'll rename it to _LARGE_DATA_PERCENT_VALIDATION
.
Re: setting it as a constructor arg, I purposefully make it hard to change this variable. My reasoning is that if a user wanted to change the percentage, they could just define their own data_split
. Especially since this is only applicable to large datasets, it might be confusing as what constitutes a "large" dataset (and the value of which is a class property anyways) and thus when it would kick in.
It might be worthwhile to update the docs to include a section on how we handle large datasets. Here, it might make sense to describe this default behavior and mention defining a new data_splitter as the best way to change this percentage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn thanks for following up!
RE the docs, I agree. We're currently missing a User Guide page describing the automl algorithm, and the heuristics we use to make decisions about the automl algorithm like what data splitting strategy to use and when we use CV vs train-validation splitting. We will get around to adding that. However, I don't think that information is critical for users. Users need to be able to pass in data, get models and select one; the details of how the models were discovered is only of interest to advanced users.
RE constructor arg: gotcha, I follow. The advantage of adding it as a "private" constructor arg is that we don't need to preserve backwards compatibility for it if we decide to remove it in the future, but we can easily change it in our perf tests, or direct users on how to change it. Not required to do it that way, but I think it'd keep us flexible.
evalml/automl/automl_search.py
Outdated
@@ -389,7 +391,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
default_data_split = StratifiedKFold(n_splits=3, random_state=self.random_state) | |||
|
|||
if X.shape[0] > self._LARGE_DATA_ROW_THRESHOLD: | |||
default_data_split = TrainingValidationSplit(test_size=0.25) | |||
test_size = min(self._LARGE_DATA_PERCENT_TEST, float(self._MAX_TRAINING_ROWS / X.shape[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn I suggest we update this to
default_data_split = TrainingValidationSplit(self._large_data_percent_validation)
Let's keep it simple and see what kind of perf test results we can get on larger datasets. If we find evidence for adding the max rows cap we can add it. Or, if you wanna add it in order to perf test it later, that sounds good, but let's disable it by default to start. Sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's definitely worthwhile to revert to this simpler logic for now to get the perf data then evaluate if it makes sense to have the row limiting logic. I'll change it in the next push.
3c26ec3
to
8a0b894
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn looks great! I left a couple minor test comments, but good to merge.
@@ -68,6 +68,7 @@ class AutoMLSearch: | |||
"""Automated Pipeline search.""" | |||
_MAX_NAME_LEN = 40 | |||
_LARGE_DATA_ROW_THRESHOLD = int(1e5) | |||
_LARGE_DATA_PERCENT_VALIDATION = 0.75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn so, 25% training, 75% validation? Sure, that seems like a fine starting point, and certainly better than 75% training which is what we were doing before. Let's do it.
@@ -585,6 +585,31 @@ def test_large_dataset_regression(mock_score): | |||
assert automl.results['pipeline_results'][pipeline_id]['cv_data'][0]['score'] == 1.234 | |||
|
|||
|
|||
@patch('evalml.pipelines.RegressionPipeline.score') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn please also mock fit
, that will save us the fit time in automl.search
mock_score.return_value = {automl.objective.name: 1.234} | ||
assert automl.data_split is None | ||
|
||
under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to call this over_max_rows
?
under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1 | ||
X, y = generate_fake_dataset(under_max_rows) | ||
automl.search(X, y) | ||
assert isinstance(automl.data_split, TrainingValidationSplit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Do we have coverage where we check that if we're under the max row threshold, the result is a CrossValidationSplit
? If not, let's add it.
5820016
to
d962f7b
Compare
Initial pass of managing large datasets. If a data splitter is not set, the default data splitter will only use 25% of the dataset or up to 100k rows (whichever one is lower).
The values here are parameterized in the code. If users want to change these values, they can pass over their own data splitter.
Resolves #1061