-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require date index parameter for time series #3041
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3041 +/- ##
=======================================
- Coverage 99.8% 99.7% -0.0%
=======================================
Files 312 312
Lines 30255 30344 +89
=======================================
+ Hits 30167 30252 +85
- Misses 88 92 +4
Continue to review full report at Codecov.
|
6226531
to
731a3f4
Compare
Returns: | ||
list[Transformer]: A list of applicable preprocessing components to use with the estimator. | ||
""" | ||
if is_time_series(problem_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a refactor aimed at letting us tweak the pipeline structure for time series without making the implementation for non-time series pipelines more confusing. I suspect we'll need something like this too for #2511
8c345a8
to
716d6c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good. I'm a big fan of the refactorization for make_pipeline
.
Haven't taken a thorough look at the tests yet, but have a bunch of questions :P
@@ -105,7 +105,12 @@ def fit(self, X, y=None): | |||
|
|||
Returns: | |||
self | |||
|
|||
Raises: | |||
ValueError: if self.date_index is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment in the place I want but is the docstring for the init method still valid?
date_index (str): Name of the column containing the datetime information used to order the data. Ignored.
Why do we check this here? Can we not do these checks during init time / make date_index a required parameter rather than a keyword parameter with a default of None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Normalize the data into pandas objects | ||
X_ww = infer_feature_types(X) | ||
cols_to_delay = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what we're doing here is first selecting categorical (semantic tag) columns, and then selecting numeric, category, and boolean logical types. Why does this need to change compared to what we were previously doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talk about this a little bit in the comment in _get_preprocessing_components
but we don't want to engineer the date_index
.
It's a little unfortunate we need to seep that special-case logic into the components but I think that's the best way to go.
if is_classification(problem_type): | ||
pp_components.append(LabelEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RIP pp_components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also, thank you for breaking this into more sizeable chunks :) it was definitely getting to be too much.
list[Transformer]: A list of applicable preprocessing components to use with the estimator. | ||
""" | ||
if is_time_series(problem_type): | ||
components_functions = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only difference between these two the order of the list and _get_time_series_featurizer
? If so, is it possible to change our non-time series pipeline to match the order of our time series pipeline, without the featurizer?
Curious because from afar it's hard to understand why the order needs to be the order that it is, and I have a feeling looking back at this in a few months I'll still feel that way 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-time series we want to impute after the date time featurizer so that we can impute any NaNs in the datetime features.
For time series, it's a bit trickier. We want the date_index
to be present in the DelayedFeatureTransformer
(which means that component should be before DateTimeFeaturizer
) but we can't impute after DelayedFeatureTransformer
because it'll impute the NaNs created by the DelayedFeatureTransformer
. Those rows with NaNs should be dropped instead.
In a nutshell, we want to pass the date index through DelayedFeatureTransformer without transforming it. I think the simplest way to do that is to switch the order of the delayed feature transformer and Date Time Featurizer in the pipeline.
AutoMLTestEnv, | ||
ts_data_binary, | ||
ts_data_multi, | ||
ts_data, | ||
): | ||
if is_binary(problem_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I feel like we're at a point where checking is_binary(problem_type)
is no longer cleaner than just checking problem_type == ProblemTypes.BINARY
, etc etc 😅
13d81fa
to
4384eb4
Compare
|
||
def fit_transform(self, X, y): | ||
"""Fit the component and transform the input data. | ||
|
||
Args: | ||
X (pd.DataFrame or None): Data to transform. None is expected when only the target variable is being used. | ||
X (pd.DataFrame): Data to transform. None is expected when only the target variable is being used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this docstring still hold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch let me fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a big one, but it looks good to me. I'm warming up on the changes to get_preprocessing_components. Looks good, man.
@@ -796,9 +796,11 @@ def _validate_problem_configuration(self, problem_configuration=None): | |||
p in problem_configuration for p in required_parameters | |||
): | |||
raise ValueError( | |||
"user_parameters must be a dict containing values for at least the date_index, gap, max_delay, " | |||
"problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing this!
] | ||
components = [] | ||
for function in components_functions: | ||
components.extend(function(X, y, problem_type, estimator_class, sampler_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can get on board with this...it's clear in which circumstances you're generating the list in which order.
7b2fd35
to
ce1e408
Compare
ce1e408
to
bcaaabc
Compare
Pull Request Description
Fixes #3031
Planning on filing a separate issue for refactoring ARIMA/Prophet in light of this: #3046
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.