-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add problem_configuration parameter to AutoMLSearch #1457
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1457 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 223 223
Lines 14930 15001 +71
=========================================
+ Hits 14923 14994 +71
Misses 7 7
Continue to review full report at Codecov.
|
33a3faa
to
5560011
Compare
@@ -119,6 +122,8 @@ def add_result(self, score_to_minimize, pipeline): | |||
def _transform_parameters(self, pipeline_class, proposed_parameters): | |||
"""Given a pipeline parameters dict, make sure n_jobs and number_features are set.""" | |||
parameters = {} | |||
if self._pipeline_params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need the if ..
here so that they are only passed to the pipeline if they are needed
@@ -32,7 +32,7 @@ def __init__(self, problem_type): | |||
Arguments: | |||
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass. | |||
""" | |||
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION: | |||
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have been done in #1378 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
We must've been missing unit test coverage for the default data checks for the time series regression problem type. Could we add that? Should just be able to clone an existing test, ensure the right data checks show up just like regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added that. Champion! 🏅 🤣
@@ -21,6 +21,10 @@ def __str__(self): | |||
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"} | |||
return problem_type_dict[self.name] | |||
|
|||
@classproperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need this so that users can specify the problem type as "time series regression" (which matches the enum value) as opposed to "time_series_regression".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it. Where were the underscores coming from previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is this because ProblemTypes.TIME_SERIES_REGRESSION.name
is "time_series_regression"
, whereas ProblemTypes.TIME_SERIES_REGRESSION.value
is "time series regression"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly! By default, you can't look up the enum by .value
, only the .name
, but we prefer users to not have to use underscores to keep it consistent with the .value
!
cc1c7df
to
7c06b74
Compare
def __init__(self, dummy_parameter='default', random_state=0): | ||
super().__init__(parameters={'dummy_parameter': dummy_parameter}, component_obj=None, random_state=random_state) | ||
def __init__(self, dummy_parameter='default', random_state=0, **kwargs): | ||
super().__init__(parameters={'dummy_parameter': dummy_parameter, **kwargs}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this to accept kwargs for one of my tests but we should do this anyway because our convention is to allow kwargs to estimators.
7c06b74
to
c6fdecf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!! I didn't have any suggestions other than deleting a test, 🚢 !
# Pass the pipeline params to the components that need them | ||
for param_name, value in self._pipeline_params.items(): | ||
if param_name in init_params: | ||
component_parameters[param_name] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton got it, looks good to me!
@@ -163,6 +167,9 @@ def __init__(self, | |||
max_batches (int): The maximum number of batches of pipelines to search. Parameters max_time, and | |||
max_iterations have precedence over stopping the search. | |||
|
|||
problem_configuration (dict, None): Additional parameters needed to configure the search. For example, | |||
in time series problems, values should be passed in for the gap and max_delay variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
def _validate_problem_configuration(self, problem_configuration=None): | ||
if self.problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]: | ||
required_parameters = {'gap', 'max_delay'} | ||
if not problem_configuration or not all(p in problem_configuration for p in required_parameters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, fancy usage of all
This validation logic lgtm!
@@ -593,7 +613,7 @@ def _add_baseline_pipelines(self, X, y): | |||
baseline = ModeBaselineBinaryPipeline(parameters={}) | |||
elif self.problem_type == ProblemTypes.MULTICLASS: | |||
baseline = ModeBaselineMulticlassPipeline(parameters={}) | |||
elif self.problem_type == ProblemTypes.REGRESSION: | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. This is great. I do wonder if we'll want to update our timeseries "baseline" to a weighted moving average or something. We can wait and see!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, lots of options here! Another naive thing we could do is just use the previous target value for "today's" prediction.
@@ -363,6 +379,9 @@ def _set_data_split(self, X): | |||
default_data_split = KFold(n_splits=3, random_state=self.random_state, shuffle=True) | |||
elif self.problem_type in [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]: | |||
default_data_split = StratifiedKFold(n_splits=3, random_state=self.random_state, shuffle=True) | |||
elif self.problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]: | |||
default_data_split = TimeSeriesSplit(n_splits=3, gap=self.problem_configuration['gap'], | |||
max_delay=self.problem_configuration['max_delay']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -32,7 +32,7 @@ def __init__(self, problem_type): | |||
Arguments: | |||
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass. | |||
""" | |||
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION: | |||
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
We must've been missing unit test coverage for the default data checks for the time series regression problem type. Could we add that? Should just be able to clone an existing test, ensure the right data checks show up just like regression.
@@ -21,6 +21,10 @@ def __str__(self): | |||
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"} | |||
return problem_type_dict[self.name] | |||
|
|||
@classproperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it. Where were the underscores coming from previously?
@@ -21,6 +21,10 @@ def __str__(self): | |||
ProblemTypes.TIME_SERIES_REGRESSION.name: "time series regression"} | |||
return problem_type_dict[self.name] | |||
|
|||
@classproperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is this because ProblemTypes.TIME_SERIES_REGRESSION.name
is "time_series_regression"
, whereas ProblemTypes.TIME_SERIES_REGRESSION.value
is "time series regression"
?
problem_params = {"gap": 3, "max_delay": 2, "extra": "foo"} | ||
automl = AutoMLSearch(problem_type=problem_type, problem_configuration=problem_params, max_iterations=1) | ||
automl.search(X, y) | ||
assert automl._automl_algorithm._pipeline_params == problem_params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton this is great. But I think the real test would be, do the pipelines created by the automl algo contain the correct parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh lol I see that's your next test. Cool!
So in that case, between the iterative algo test and the test below this, is this test necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think we need it! Good catch. I added this test before realizing we needed something more thorough. I'll delete!
@@ -32,7 +32,7 @@ def __init__(self, problem_type): | |||
Arguments: | |||
problem_type (str): The problem type that is being validated. Can be regression, binary, or multiclass. | |||
""" | |||
if handle_problem_types(problem_type) == ProblemTypes.REGRESSION: | |||
if handle_problem_types(problem_type) in [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added that. Champion! 🏅 🤣
05f3d61
to
830058f
Compare
Pull Request Description
Fixes #1382
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.