Changed default large dataset train/test splitting behavior #1205

christopherbunn · 2020-09-21T15:07:25Z

Initial pass of managing large datasets. If a data splitter is not set, the default data splitter will only use 25% of the dataset or up to 100k rows (whichever one is lower).

The values here are parameterized in the code. If users want to change these values, they can pass over their own data splitter.

Resolves #1061

codecov · 2020-09-22T15:54:39Z

Codecov Report

Merging #1205 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1205   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         201      201           
  Lines       12489    12511   +22     
=======================================
+ Hits        12480    12502   +22     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.58% <100.00%> (+<0.01%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e3ca42...fb38a84. Read the comment docs.

dsherry

@christopherbunn I left a few things, should be ready to merge next round!

dsherry · 2020-09-24T22:45:16Z

evalml/automl/automl_search.py

@@ -67,7 +67,9 @@
 class AutoMLSearch:
    """Automated Pipeline search."""
    _MAX_NAME_LEN = 40
+    _MAX_TRAINING_ROWS = int(1e5)


@christopherbunn what's the difference between _MAX_TRAINING_ROWS and _LARGE_DATA_ROW_THRESHOLD? They have the same value right now.

_LARGE_DATA_ROW_THRESHOLD defines the number of rows a dataset needs to have before it's considered "large" (and thus uses the large dataset default splitter). _MAX_TRAINING_ROWS defines the maximum number of rows that is used for the training dataset. Regardless of the entire size of the dataset, we should only take _MAX_TRAINING_ROWS of data.

They just so happen to be the same value, I'm definitely open to adjusting these values as necessary.

dsherry · 2020-09-24T22:47:00Z

evalml/automl/automl_search.py

    _LARGE_DATA_ROW_THRESHOLD = int(1e5)
+    _LARGE_DATA_PERCENT_TEST = 0.75


Is this the training or validation split size? If its training, I think we wanted to lower it, right? 10% would be a good value to start with.

Oh I see now that you included "TEST" in the name. Got it. Can you please say "VALIDATION" instead? That's consistent with the rest of our automl code. When I hear "test" I think "holdout" and we don't have a representation for that yet in automl.

I have one more consideration. By defining this as a class property, the only pythonic way to change it is to redefine the class. We could set it directly, but you'd be in trouble with the python police ⚠😂

Let's add this as an arg to the constructor: _large_data_percent_validation

Eventually we'll probably give up and have the constructor take a config object, but for now feels fine to add more parameters there. We can keep it private since its a property of the automl algo.

Sound good?

I can see how the name can be confusing; I'll rename it to _LARGE_DATA_PERCENT_VALIDATION.

Re: setting it as a constructor arg, I purposefully make it hard to change this variable. My reasoning is that if a user wanted to change the percentage, they could just define their own data_split. Especially since this is only applicable to large datasets, it might be confusing as what constitutes a "large" dataset (and the value of which is a class property anyways) and thus when it would kick in.

It might be worthwhile to update the docs to include a section on how we handle large datasets. Here, it might make sense to describe this default behavior and mention defining a new data_splitter as the best way to change this percentage.

@christopherbunn thanks for following up!

RE the docs, I agree. We're currently missing a User Guide page describing the automl algorithm, and the heuristics we use to make decisions about the automl algorithm like what data splitting strategy to use and when we use CV vs train-validation splitting. We will get around to adding that. However, I don't think that information is critical for users. Users need to be able to pass in data, get models and select one; the details of how the models were discovered is only of interest to advanced users.

RE constructor arg: gotcha, I follow. The advantage of adding it as a "private" constructor arg is that we don't need to preserve backwards compatibility for it if we decide to remove it in the future, but we can easily change it in our perf tests, or direct users on how to change it. Not required to do it that way, but I think it'd keep us flexible.

dsherry · 2020-09-25T00:19:39Z

evalml/automl/automl_search.py

@@ -389,7 +391,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
            default_data_split = StratifiedKFold(n_splits=3, random_state=self.random_state)

        if X.shape[0] > self._LARGE_DATA_ROW_THRESHOLD:
-            default_data_split = TrainingValidationSplit(test_size=0.25)
+            test_size = min(self._LARGE_DATA_PERCENT_TEST, float(self._MAX_TRAINING_ROWS / X.shape[0]))


@christopherbunn I suggest we update this to

default_data_split = TrainingValidationSplit(self._large_data_percent_validation)

Let's keep it simple and see what kind of perf test results we can get on larger datasets. If we find evidence for adding the max rows cap we can add it. Or, if you wanna add it in order to perf test it later, that sounds good, but let's disable it by default to start. Sound good?

I think it's definitely worthwhile to revert to this simpler logic for now to get the perf data then evaluate if it makes sense to have the row limiting logic. I'll change it in the next push.

dsherry

@christopherbunn looks great! I left a couple minor test comments, but good to merge.

dsherry · 2020-09-29T17:06:17Z

evalml/automl/automl_search.py

@@ -68,6 +68,7 @@ class AutoMLSearch:
    """Automated Pipeline search."""
    _MAX_NAME_LEN = 40
    _LARGE_DATA_ROW_THRESHOLD = int(1e5)
+    _LARGE_DATA_PERCENT_VALIDATION = 0.75


@christopherbunn so, 25% training, 75% validation? Sure, that seems like a fine starting point, and certainly better than 75% training which is what we were doing before. Let's do it.

dsherry · 2020-09-29T17:11:46Z

evalml/tests/automl_tests/test_automl.py

@@ -585,6 +585,31 @@ def test_large_dataset_regression(mock_score):
        assert automl.results['pipeline_results'][pipeline_id]['cv_data'][0]['score'] == 1.234


+@patch('evalml.pipelines.RegressionPipeline.score')


@christopherbunn please also mock fit, that will save us the fit time in automl.search

dsherry · 2020-09-29T17:12:30Z

evalml/tests/automl_tests/test_automl.py

+    mock_score.return_value = {automl.objective.name: 1.234}
+    assert automl.data_split is None
+
+    under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1


Did you mean to call this over_max_rows?

dsherry · 2020-09-29T17:13:15Z

evalml/tests/automl_tests/test_automl.py

+    under_max_rows = automl._LARGE_DATA_ROW_THRESHOLD + 1
+    X, y = generate_fake_dataset(under_max_rows)
+    automl.search(X, y)
+    assert isinstance(automl.data_split, TrainingValidationSplit)


This is great. Do we have coverage where we check that if we're under the max row threshold, the result is a CrossValidationSplit ? If not, let's add it.

christopherbunn changed the title ~~Added initial tweaks to large dataset train/test splitting~~ Changed default large dataset train/test splitting behavior Sep 21, 2020

christopherbunn force-pushed the 1061_tv_split_parameter branch from b1d99b8 to 10c2225 Compare September 21, 2020 19:21

christopherbunn requested review from jeremyliweishih, dsherry and freddyaboulton September 23, 2020 13:36

dsherry suggested changes Sep 25, 2020

View reviewed changes

christopherbunn force-pushed the 1061_tv_split_parameter branch from 3c26ec3 to 8a0b894 Compare September 25, 2020 18:28

christopherbunn requested a review from dsherry September 29, 2020 14:19

dsherry approved these changes Sep 29, 2020

View reviewed changes

christopherbunn added 5 commits September 29, 2020 13:30

Added initial tweaks to train/test

3a8be4f

Updated changelog

695a83e

Changed large split to be percentage across the board and var name

e9ba40f

Updated test case

2f0b988

Removed max data rows

d962f7b

christopherbunn force-pushed the 1061_tv_split_parameter branch from 5820016 to d962f7b Compare September 29, 2020 17:31

Refactored tests

fb38a84

christopherbunn merged commit 615026c into main Sep 29, 2020

christopherbunn deleted the 1061_tv_split_parameter branch September 29, 2020 18:04

angela97lin mentioned this pull request Sep 29, 2020

Release v0.14.1 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed default large dataset train/test splitting behavior #1205

Changed default large dataset train/test splitting behavior #1205

christopherbunn commented Sep 21, 2020

codecov bot commented Sep 22, 2020 •

edited

Loading

dsherry left a comment

dsherry Sep 24, 2020

christopherbunn Sep 25, 2020

dsherry Sep 24, 2020

dsherry Sep 24, 2020

dsherry Sep 25, 2020

christopherbunn Sep 25, 2020 •

edited

Loading

dsherry Sep 25, 2020

dsherry Sep 25, 2020

christopherbunn Sep 25, 2020 •

edited

Loading

dsherry left a comment

dsherry Sep 29, 2020

dsherry Sep 29, 2020

dsherry Sep 29, 2020

dsherry Sep 29, 2020

		_LARGE_DATA_ROW_THRESHOLD = int(1e5)
		_LARGE_DATA_PERCENT_TEST = 0.75

		@@ -585,6 +585,31 @@ def test_large_dataset_regression(mock_score):
		assert automl.results['pipeline_results'][pipeline_id]['cv_data'][0]['score'] == 1.234


		@patch('evalml.pipelines.RegressionPipeline.score')

Changed default large dataset train/test splitting behavior #1205

Changed default large dataset train/test splitting behavior #1205

Conversation

christopherbunn commented Sep 21, 2020

codecov bot commented Sep 22, 2020 • edited Loading

Codecov Report

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn Sep 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn Sep 25, 2020 • edited Loading

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 22, 2020 •

edited

Loading

christopherbunn Sep 25, 2020 •

edited

Loading

christopherbunn Sep 25, 2020 •

edited

Loading