-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoMLSearch: hold reference to training data #1597
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1597 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18056 18092 +36
=========================================
+ Hits 18048 18084 +36
Misses 8 8
Continue to review full report at Codecov.
|
text_columns = list(text_column_vals.to_dataframe().columns) | ||
if len(text_columns) == 0: | ||
text_columns = None | ||
|
||
if not isinstance(y, ww.DataColumn): | ||
logger.warning("`y` passed was not a DataColumn. EvalML will try to convert the input as a Woodwork DataTable and types will be inferred. To control this behavior, please pass in a Woodwork DataTable instead.") | ||
y = _convert_to_woodwork_structure(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these warnings because we were planning on doing so (#1555) and because it was easier than fixing the failing test coverage for them 😅
@@ -614,29 +599,26 @@ def _get_mean_cv_scores_for_all_objectives(cv_data): | |||
scores[field] += value | |||
return {objective_name: float(score) / n_folds for objective_name, score in scores.items()} | |||
|
|||
def _compute_cv_scores(self, pipeline, X, y): | |||
def _compute_cv_scores(self, pipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to get more eyes on the changes here. I think I did it right but can't hurt to check.
@@ -683,7 +667,7 @@ def _compute_cv_scores(self, pipeline, X, y): | |||
ordered_scores.update({self.objective.name: score}) | |||
ordered_scores.update(scores) | |||
ordered_scores.update({"# Training": y_train.shape[0]}) | |||
ordered_scores.update({"# Testing": y_test.shape[0]}) | |||
ordered_scores.update({"# Validation": y_valid.shape[0]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this field because I think "validation" is a better description
@@ -21,6 +21,8 @@ def __init__(self, max_delay=0, gap=0, n_splits=3): | |||
of rows of the current split to avoid "throwing out" more data than in necessary. | |||
gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows | |||
since we are interested in | |||
n_splits (int): number of data splits to make. | |||
test_size (float): What percentage of data points should be included in the test set. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I will remove test_size
, left over
@@ -43,4 +43,4 @@ def split(self, X, y=None): | |||
list: indices to split data into training and test set | |||
""" | |||
train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state) | |||
return [(train, test)] | |||
return iter([(train, test)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was necessary because BaseCrossValidator
subclasses are expected to return iterators. Some tests failed without it. I had to update other tests to call list(...)
🤷♂️
train, test = next(CV_method.split(X.to_dataframe(), y.to_series())) | ||
data_splitter = None | ||
if is_time_series(problem_type): | ||
data_splitter = TrainingValidationSplit(test_size=test_size, shuffle=False, stratify=None, random_state=random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change @freddyaboulton and I were discussion. Once this PR is merged we can update looking glass to use split_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a couple comments but I think you caught everything!
"\n", | ||
"Below EvalML utilizes Bayesian optimization (EvalML's default optimizer) to search and find the best pipeline defined by the given objective.\n", | ||
"\n", | ||
"EvalML provides a number of parameters to control the search process. `max_batches` is one of the parameters which controls the stopping criterion for the AutoML search. It indicates the maximum number of rounds of AutoML to evaluate, where each round may train and score a variable number of pipelines. In this example, `max_batches` is set to 1.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
it might be better to move the explanation of
max_batches
until after the code example. it doesn't mean anything to see that parameter defined until it's actually used. for example, see in the featuretools docs how we 1. show the function 2. explain what the function did 3. explain the parameters that controlled the function.
-
not sure it's helpful to introduce the word "round" here since it's basically functioning as a synonym for batch without actually defining the word batch.
-
We also don't give the user any intuition of when or why they'd want to use the parameter i.e it controls the "budget" evalml has to look for the best model. increasing the value allows evalml to search deeper for the best model, at the risk of overfitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see that this text wasn't actually change in the PR, so maybe best done as a follow up
…size, shuffle, random_state
…upervised learning
3bdad96
to
49c3038
Compare
* First round * Removed RMSLE, MSLE, and MAPE from non_core_objectives * Add objective parameter to all data_check subclasses * Add new data_check_message_code for a target that is incompatible with an objective * Add check to invalid_target_data_check for RMSLE, MSLE, and MAPE * Invalid Target Data Check update * Fix test case to not include invalid objectives in regression_core_objectives * Test updates * Invalid target data check update * Lint * Release notes * Tests and updates to code * Release notes * Latest Dependencies Version update * Lint fix * Dependency check * Fixing minor issues * test none objective update * Update to reflect changes in #1597 * Make objective mandatory for DefaultDataChecks and InvalidTargetDataCheck * Mock data check to see if objective is being passed from AutoML * Fix breaking tests due to mandatory objective param in DefaultDataChecks, and check error output from DefaultDataChecks * Change label to all positive so RMSLE, MSLE, and MAPE can be included in regression_core_objectives * Raise error for None objective in InvalidTargetDataCheck * Docstring example fix * Lint and docstring example * Jupyter notebook update for build_docs * Jupyter notebook update * remove print statement * Added test cases and docstring, updated error message * test case fixes * Test update * lint error * Add positive_only class attribute for ObjectiveBase * Adding @classproperty to ObjectiveBase for positive_only check * Lint error * Fix conflicts with #1551 * Lint
Changes
AutoMLSearch
constructor to take training data asX_train
andy_train
. Updatesearch
andadd_to_leaderboard
to no longer require datasplit_data
helper to acceptproblem_type
,problem_configuration
,test_size
,shuffle
, and to defaultrandom_state
to 0evalml/automl/data_splitters/
toevalml/preprocessing/data_splitters/
to avoid circular importis_regression
,is_classification
,is_timeseries
,is_binary
,is_multiclass