Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add problem_configuration parameter to AutoMLSearch #1457

Merged
merged 11 commits into from
Nov 24, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1382


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Nov 23, 2020

Codecov Report

Merging #1457 (830058f) into main (67b534f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1457     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         223      223             
  Lines       14930    15001     +71     
=========================================
+ Hits        14923    14994     +71     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/automl/utils.py 100.0% <ø> (ø)
evalml/automl/__init__.py 100.0% <100.0%> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/data_checks/default_data_checks.py 100.0% <100.0%> (ø)
evalml/problem_types/problem_types.py 100.0% <100.0%> (ø)
evalml/problem_types/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67b534f...830058f. Read the comment docs.

@freddyaboulton freddyaboulton changed the title Passing problem configuration parameters to created pipelines. Add problem_configuration field to AutoMLSearch Nov 23, 2020
@freddyaboulton freddyaboulton changed the title Add problem_configuration field to AutoMLSearch Add problem_configuration parameter to AutoMLSearch Nov 23, 2020
@freddyaboulton freddyaboulton force-pushed the 1382-automl-problem-configuration branch from 33a3faa to 5560011 Compare November 23, 2020 21:48
@freddyaboulton freddyaboulton marked this pull request as ready for review November 23, 2020 22:10
@@ -119,6 +122,8 @@ def add_result(self, score_to_minimize, pipeline):
def _transform_parameters(self, pipeline_class, proposed_parameters):
"""Given a pipeline parameters dict, make sure n_jobs and number_features are set."""
parameters = {}
if self._pipeline_params:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the if .. here so that they are only passed to the pipeline if they are needed

@@ -32,7 +32,7 @@ def __init__(self, problem_type):
Arguments:
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass.
"""
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION:
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been done in #1378 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

We must've been missing unit test coverage for the default data checks for the time series regression problem type. Could we add that? Should just be able to clone an existing test, ensure the right data checks show up just like regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added that. Champion! 🏅 🤣

@@ -21,6 +21,10 @@ def __str__(self):
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"}
return problem_type_dict[self.name]

@classproperty
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this so that users can specify the problem type as "time series regression" (which matches the enum value) as opposed to "time_series_regression".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Where were the underscores coming from previously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is this because ProblemTypes.TIME_SERIES_REGRESSION.name is "time_series_regression", whereas ProblemTypes.TIME_SERIES_REGRESSION.value is "time series regression"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly! By default, you can't look up the enum by .value, only the .name, but we prefer users to not have to use underscores to keep it consistent with the .value!

@freddyaboulton freddyaboulton force-pushed the 1382-automl-problem-configuration branch from cc1c7df to 7c06b74 Compare November 23, 2020 22:45
def __init__(self, dummy_parameter='default', random_state=0):
super().__init__(parameters={'dummy_parameter': dummy_parameter}, component_obj=None, random_state=random_state)
def __init__(self, dummy_parameter='default', random_state=0, **kwargs):
super().__init__(parameters={'dummy_parameter': dummy_parameter, **kwargs},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this to accept kwargs for one of my tests but we should do this anyway because our convention is to allow kwargs to estimators.

@freddyaboulton freddyaboulton force-pushed the 1382-automl-problem-configuration branch from 7c06b74 to c6fdecf Compare November 24, 2020 15:34
@freddyaboulton freddyaboulton self-assigned this Nov 24, 2020
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!! I didn't have any suggestions other than deleting a test, 🚢 !

# Pass the pipeline params to the components that need them
for param_name, value in self._pipeline_params.items():
if param_name in init_params:
component_parameters[param_name] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton got it, looks good to me!

@@ -163,6 +167,9 @@ def __init__(self,
max_batches (int): The maximum number of batches of pipelines to search. Parameters max_time, and
max_iterations have precedence over stopping the search.

problem_configuration (dict, None): Additional parameters needed to configure the search. For example,
in time series problems, values should be passed in for the gap and max_delay variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

def _validate_problem_configuration(self, problem_configuration=None):
if self.problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]:
required_parameters = {'gap', 'max_delay'}
if not problem_configuration or not all(p in problem_configuration for p in required_parameters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, fancy usage of all

This validation logic lgtm!

@@ -593,7 +613,7 @@ def _add_baseline_pipelines(self, X, y):
baseline = ModeBaselineBinaryPipeline(parameters={})
elif self.problem_type == ProblemTypes.MULTICLASS:
baseline = ModeBaselineMulticlassPipeline(parameters={})
elif self.problem_type == ProblemTypes.REGRESSION:
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. This is great. I do wonder if we'll want to update our timeseries "baseline" to a weighted moving average or something. We can wait and see!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, lots of options here! Another naive thing we could do is just use the previous target value for "today's" prediction.

@@ -363,6 +379,9 @@ def _set_data_split(self, X):
default_data_split = KFold(n_splits=3, random_state=self.random_state, shuffle=True)
elif self.problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]:
default_data_split = StratifiedKFold(n_splits=3, random_state=self.random_state, shuffle=True)
elif self.problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]:
default_data_split = TimeSeriesSplit(n_splits=3, gap=self.problem_configuration['gap'],
max_delay=self.problem_configuration['max_delay'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -32,7 +32,7 @@ def __init__(self, problem_type):
Arguments:
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass.
"""
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION:
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

We must've been missing unit test coverage for the default data checks for the time series regression problem type. Could we add that? Should just be able to clone an existing test, ensure the right data checks show up just like regression.

@@ -21,6 +21,10 @@ def __str__(self):
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"}
return problem_type_dict[self.name]

@classproperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Where were the underscores coming from previously?

@@ -21,6 +21,10 @@ def __str__(self):
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"}
return problem_type_dict[self.name]

@classproperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is this because ProblemTypes.TIME_SERIES_REGRESSION.name is "time_series_regression", whereas ProblemTypes.TIME_SERIES_REGRESSION.value is "time series regression"?

problem_params = {"gap": 3, "max_delay": 2, "extra": "foo"}
automl = AutoMLSearch(problem_type=problem_type, problem_configuration=problem_params, max_iterations=1)
automl.search(X, y)
assert automl._automl_algorithm._pipeline_params == problem_params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton this is great. But I think the real test would be, do the pipelines created by the automl algo contain the correct parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol I see that's your next test. Cool!

So in that case, between the iterative algo test and the test below this, is this test necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think we need it! Good catch. I added this test before realizing we needed something more thorough. I'll delete!

@@ -32,7 +32,7 @@ def __init__(self, problem_type):
Arguments:
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass.
"""
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION:
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added that. Champion! 🏅 🤣

@freddyaboulton freddyaboulton force-pushed the 1382-automl-problem-configuration branch from 05f3d61 to 830058f Compare November 24, 2020 16:26
@freddyaboulton freddyaboulton merged commit ef01e1b into main Nov 24, 2020
@freddyaboulton freddyaboulton deleted the 1382-automl-problem-configuration branch November 24, 2020 17:03
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a problem_configuration parameter to AutoMLSearch
2 participants