Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoMLSearch: hold reference to training data #1597

Merged
merged 35 commits into from
Dec 29, 2020
Merged

AutoMLSearch: hold reference to training data #1597

merged 35 commits into from
Dec 29, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Dec 23, 2020

Changes

  • Update AutoMLSearch constructor to take training data as X_train and y_train. Update search and add_to_leaderboard to no longer require data
  • Remove the woodwork warnings when pandas dtypes are provided (@angela97lin heads up this will cause mild conflicts with Allow users to set feature types without having to learn about woodwork directly #1555 . I can try to back it out of this PR if you'd like, but it felt easier in the moment to just remove here heh)
  • Update split_data helper to accept problem_type, problem_configuration, test_size, shuffle, and to default random_state to 0
  • Rename the "Testing" split (during CV) in the automl log output to "Validation"
  • Move evalml/automl/data_splitters/ to evalml/preprocessing/data_splitters/ to avoid circular import
  • Add problem type utils: is_regression, is_classification, is_timeseries, is_binary, is_multiclass
  • Update documentation

@dsherry dsherry added the enhancement An improvement to an existing feature. label Dec 23, 2020
@codecov
Copy link

codecov bot commented Dec 23, 2020

Codecov Report

Merging #1597 (05a6c74) into main (54bf169) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1597     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18056    18092     +36     
=========================================
+ Hits        18048    18084     +36     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/automl/__init__.py 100.0% <ø> (ø)
evalml/preprocessing/data_splitters/__init__.py 100.0% <ø> (ø)
.../preprocessing/data_splitters/time_series_split.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (-<0.1%) ⬇️
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/preprocessing/__init__.py 100.0% <100.0%> (ø)
...essing/data_splitters/training_validation_split.py 100.0% <100.0%> (ø)
evalml/preprocessing/utils.py 100.0% <100.0%> (ø)
evalml/problem_types/__init__.py 100.0% <100.0%> (ø)
evalml/problem_types/utils.py 100.0% <100.0%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54bf169...05a6c74. Read the comment docs.

@dsherry dsherry marked this pull request as ready for review December 23, 2020 21:30
text_columns = list(text_column_vals.to_dataframe().columns)
if len(text_columns) == 0:
text_columns = None

if not isinstance(y, ww.DataColumn):
logger.warning("`y` passed was not a DataColumn. EvalML will try to convert the input as a Woodwork DataTable and types will be inferred. To control this behavior, please pass in a Woodwork DataTable instead.")
y = _convert_to_woodwork_structure(y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these warnings because we were planning on doing so (#1555) and because it was easier than fixing the failing test coverage for them 😅

@@ -614,29 +599,26 @@ def _get_mean_cv_scores_for_all_objectives(cv_data):
scores[field] += value
return {objective_name: float(score) / n_folds for objective_name, score in scores.items()}

def _compute_cv_scores(self, pipeline, X, y):
def _compute_cv_scores(self, pipeline):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to get more eyes on the changes here. I think I did it right but can't hurt to check.

@@ -683,7 +667,7 @@ def _compute_cv_scores(self, pipeline, X, y):
ordered_scores.update({self.objective.name: score})
ordered_scores.update(scores)
ordered_scores.update({"# Training": y_train.shape[0]})
ordered_scores.update({"# Testing": y_test.shape[0]})
ordered_scores.update({"# Validation": y_valid.shape[0]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this field because I think "validation" is a better description

@@ -21,6 +21,8 @@ def __init__(self, max_delay=0, gap=0, n_splits=3):
of rows of the current split to avoid "throwing out" more data than in necessary.
gap (int): Gap used in time series problem. Time series pipelines shift the target variable by gap rows
since we are interested in
n_splits (int): number of data splits to make.
test_size (float): What percentage of data points should be included in the test set. Defaults to None.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I will remove test_size, left over

@@ -43,4 +43,4 @@ def split(self, X, y=None):
list: indices to split data into training and test set
"""
train, test = train_test_split(np.arange(X.shape[0]), test_size=self.test_size, train_size=self.train_size, shuffle=self.shuffle, stratify=self.stratify, random_state=self.random_state)
return [(train, test)]
return iter([(train, test)])
Copy link
Contributor Author

@dsherry dsherry Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was necessary because BaseCrossValidator subclasses are expected to return iterators. Some tests failed without it. I had to update other tests to call list(...) 🤷‍♂️

train, test = next(CV_method.split(X.to_dataframe(), y.to_series()))
data_splitter = None
if is_time_series(problem_type):
data_splitter = TrainingValidationSplit(test_size=test_size, shuffle=False, stratify=None, random_state=random_state)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change @freddyaboulton and I were discussion. Once this PR is merged we can update looking glass to use split_data

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a couple comments but I think you caught everything!

docs/source/start.ipynb Outdated Show resolved Hide resolved
evalml/problem_types/utils.py Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
"\n",
"Below EvalML utilizes Bayesian optimization (EvalML's default optimizer) to search and find the best pipeline defined by the given objective.\n",
"\n",
"EvalML provides a number of parameters to control the search process. `max_batches` is one of the parameters which controls the stopping criterion for the AutoML search. It indicates the maximum number of rounds of AutoML to evaluate, where each round may train and score a variable number of pipelines. In this example, `max_batches` is set to 1.\n",
Copy link
Contributor

@kmax12 kmax12 Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it might be better to move the explanation of max_batches until after the code example. it doesn't mean anything to see that parameter defined until it's actually used. for example, see in the featuretools docs how we 1. show the function 2. explain what the function did 3. explain the parameters that controlled the function.
    image

  2. not sure it's helpful to introduce the word "round" here since it's basically functioning as a synonym for batch without actually defining the word batch.

  3. We also don't give the user any intuition of when or why they'd want to use the parameter i.e it controls the "budget" evalml has to look for the best model. increasing the value allows evalml to search deeper for the best model, at the risk of overfitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that this text wasn't actually change in the PR, so maybe best done as a follow up

@dsherry dsherry merged commit edb6ed6 into main Dec 29, 2020
@dsherry dsherry deleted the ds_blitz_automl branch December 29, 2020 16:20
ParthivNaresh added a commit that referenced this pull request Dec 29, 2020
@dsherry dsherry mentioned this pull request Dec 29, 2020
ParthivNaresh added a commit that referenced this pull request Jan 6, 2021
* First round

* Removed RMSLE, MSLE, and MAPE from non_core_objectives

* Add objective parameter to all data_check subclasses

* Add new data_check_message_code for a target that is incompatible with an objective

* Add check to invalid_target_data_check for RMSLE, MSLE, and MAPE

* Invalid Target Data Check update

* Fix test case to not include invalid objectives in regression_core_objectives

* Test updates

* Invalid target data check update

* Lint

* Release notes

* Tests and updates to code

* Release notes

* Latest Dependencies Version update

* Lint fix

* Dependency check

* Fixing minor issues

* test none objective update

* Update to reflect changes in #1597

* Make objective mandatory for DefaultDataChecks and InvalidTargetDataCheck

* Mock data check to see if objective is being passed from AutoML

* Fix breaking tests due to mandatory objective param in DefaultDataChecks, and check error output from DefaultDataChecks

* Change label to all positive so RMSLE, MSLE, and MAPE can be included in regression_core_objectives

* Raise error for None objective in InvalidTargetDataCheck

* Docstring example fix

* Lint and docstring example

* Jupyter notebook update for build_docs

* Jupyter notebook update

* remove print statement

* Added test cases and docstring, updated error message

* test case fixes

* Test update

* lint error

* Add positive_only class attribute for ObjectiveBase

* Adding @classproperty to ObjectiveBase for positive_only check

* Lint error

* Fix conflicts with #1551

* Lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants