AutoMLSearch: hold reference to training data#1597
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1597 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18056 18092 +36
=========================================
+ Hits 18048 18084 +36
Misses 8 8
Continue to review full report at Codecov.
|
|
|
||
| if not isinstance(y, ww.DataColumn): | ||
| logger.warning("`y` passed was not a DataColumn. EvalML will try to convert the input as a Woodwork DataTable and types will be inferred. To control this behavior, please pass in a Woodwork DataTable instead.") | ||
| y = _convert_to_woodwork_structure(y) |
There was a problem hiding this comment.
I removed these warnings because we were planning on doing so (#1555) and because it was easier than fixing the failing test coverage for them 😅
| return {objective_name: float(score) / n_folds for objective_name, score in scores.items()} | ||
|
|
||
| def _compute_cv_scores(self, pipeline, X, y): | ||
| def _compute_cv_scores(self, pipeline): |
There was a problem hiding this comment.
It would be great to get more eyes on the changes here. I think I did it right but can't hurt to check.
| ordered_scores.update(scores) | ||
| ordered_scores.update({"# Training": y_train.shape[0]}) | ||
| ordered_scores.update({"# Testing": y_test.shape[0]}) | ||
| ordered_scores.update({"# Validation": y_valid.shape[0]}) |
There was a problem hiding this comment.
I renamed this field because I think "validation" is a better description
| gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows | ||
| since we are interested in | ||
| n_splits (int): number of data splits to make. | ||
| test_size (float): What percentage of data points should be included in the test set. Defaults to None. |
There was a problem hiding this comment.
Oops, I will remove test_size, left over
| """ | ||
| train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state) | ||
| return [(train, test)] | ||
| return iter([(train, test)]) |
There was a problem hiding this comment.
I believe this was necessary because BaseCrossValidator subclasses are expected to return iterators. Some tests failed without it. I had to update other tests to call list(...) 🤷♂️
| train, test = next(CV_method.split(X.to_dataframe(), y.to_series())) | ||
| data_splitter = None | ||
| if is_time_series(problem_type): | ||
| data_splitter = TrainingValidationSplit(test_size=test_size, shuffle=False, stratify=None, random_state=random_state) |
There was a problem hiding this comment.
This is the change @freddyaboulton and I were discussion. Once this PR is merged we can update looking glass to use split_data
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Left a couple comments but I think you caught everything!
| "\n", | ||
| "Below EvalML utilizes Bayesian optimization (EvalML's default optimizer) to search and find the best pipeline defined by the given objective.\n", | ||
| "\n", | ||
| "EvalML provides a number of parameters to control the search process. `max_batches` is one of the parameters which controls the stopping criterion for the AutoML search. It indicates the maximum number of rounds of AutoML to evaluate, where each round may train and score a variable number of pipelines. In this example, `max_batches` is set to 1.\n", |
There was a problem hiding this comment.
-
it might be better to move the explanation of
max_batchesuntil after the code example. it doesn't mean anything to see that parameter defined until it's actually used. for example, see in the featuretools docs how we 1. show the function 2. explain what the function did 3. explain the parameters that controlled the function.

-
not sure it's helpful to introduce the word "round" here since it's basically functioning as a synonym for batch without actually defining the word batch.
-
We also don't give the user any intuition of when or why they'd want to use the parameter i.e it controls the "budget" evalml has to look for the best model. increasing the value allows evalml to search deeper for the best model, at the risk of overfitting.
There was a problem hiding this comment.
i see that this text wasn't actually change in the PR, so maybe best done as a follow up
…size, shuffle, random_state
…upervised learning
3bdad96 to
49c3038
Compare
* First round * Removed RMSLE, MSLE, and MAPE from non_core_objectives * Add objective parameter to all data_check subclasses * Add new data_check_message_code for a target that is incompatible with an objective * Add check to invalid_target_data_check for RMSLE, MSLE, and MAPE * Invalid Target Data Check update * Fix test case to not include invalid objectives in regression_core_objectives * Test updates * Invalid target data check update * Lint * Release notes * Tests and updates to code * Release notes * Latest Dependencies Version update * Lint fix * Dependency check * Fixing minor issues * test none objective update * Update to reflect changes in #1597 * Make objective mandatory for DefaultDataChecks and InvalidTargetDataCheck * Mock data check to see if objective is being passed from AutoML * Fix breaking tests due to mandatory objective param in DefaultDataChecks, and check error output from DefaultDataChecks * Change label to all positive so RMSLE, MSLE, and MAPE can be included in regression_core_objectives * Raise error for None objective in InvalidTargetDataCheck * Docstring example fix * Lint and docstring example * Jupyter notebook update for build_docs * Jupyter notebook update * remove print statement * Added test cases and docstring, updated error message * test case fixes * Test update * lint error * Add positive_only class attribute for ObjectiveBase * Adding @classproperty to ObjectiveBase for positive_only check * Lint error * Fix conflicts with #1551 * Lint
Changes
AutoMLSearchconstructor to take training data asX_trainandy_train. Updatesearchandadd_to_leaderboardto no longer require datasplit_datahelper to acceptproblem_type,problem_configuration,test_size,shuffle, and to defaultrandom_stateto 0evalml/automl/data_splitters/toevalml/preprocessing/data_splitters/to avoid circular importis_regression,is_classification,is_timeseries,is_binary,is_multiclass