Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require date index parameter for time series #3041

Merged
merged 11 commits into from
Nov 17, 2021
Merged

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Nov 12, 2021

Pull Request Description

Fixes #3031

Planning on filing a separate issue for refactoring ARIMA/Prophet in light of this: #3046


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #3041 (bcaaabc) into main (51c8914) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3041     +/-   ##
=======================================
- Coverage   99.8%   99.7%   -0.0%     
=======================================
  Files        312     312             
  Lines      30255   30344     +89     
=======================================
+ Hits       30167   30252     +85     
- Misses        88      92      +4     
Impacted Files Coverage Δ
.../pipelines/time_series_classification_pipelines.py 100.0% <ø> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <ø> (ø)
...s/prediction_explanations_tests/test_algorithms.py 100.0% <ø> (ø)
...lml/tests/model_understanding_tests/test_graphs.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
...rmers/preprocessing/delayed_feature_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/time_series_pipeline_base.py 99.1% <100.0%> (-0.9%) ⬇️
evalml/pipelines/utils.py 99.7% <100.0%> (+0.1%) ⬆️
...ts/automl_tests/parallel_tests/test_automl_dask.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c8914...bcaaabc. Read the comment docs.

Returns:
list[Transformer]: A list of applicable preprocessing components to use with the estimator.
"""
if is_time_series(problem_type):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a refactor aimed at letting us tweak the pipeline structure for time series without making the implementation for non-time series pipelines more confusing. I suspect we'll need something like this too for #2511

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. I'm a big fan of the refactorization for make_pipeline.

Haven't taken a thorough look at the tests yet, but have a bunch of questions :P

@@ -105,7 +105,12 @@ def fit(self, X, y=None):

Returns:
self

Raises:
ValueError: if self.date_index is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment in the place I want but is the docstring for the init method still valid?

date_index (str): Name of the column containing the datetime information used to order the data. Ignored.

Why do we check this here? Can we not do these checks during init time / make date_index a required parameter rather than a keyword parameter with a default of None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is accurate because the date_index is ignored. It won't be ignored in #3028 .

The problem with checking in the init is that the default value of date_index is that DelayedFeatureTransformer() will raise an exception and that will cause AutoMLSearch to error out until #1637 is done.

# Normalize the data into pandas objects
X_ww = infer_feature_types(X)
cols_to_delay = list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what we're doing here is first selecting categorical (semantic tag) columns, and then selecting numeric, category, and boolean logical types. Why does this need to change compared to what we were previously doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talk about this a little bit in the comment in _get_preprocessing_components but we don't want to engineer the date_index.

It's a little unfortunate we need to seep that special-case logic into the components but I think that's the best way to go.

if is_classification(problem_type):
pp_components.append(LabelEncoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RIP pp_components

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, thank you for breaking this into more sizeable chunks :) it was definitely getting to be too much.

list[Transformer]: A list of applicable preprocessing components to use with the estimator.
"""
if is_time_series(problem_type):
components_functions = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only difference between these two the order of the list and _get_time_series_featurizer? If so, is it possible to change our non-time series pipeline to match the order of our time series pipeline, without the featurizer?

Curious because from afar it's hard to understand why the order needs to be the order that it is, and I have a feeling looking back at this in a few months I'll still feel that way 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-time series we want to impute after the date time featurizer so that we can impute any NaNs in the datetime features.

For time series, it's a bit trickier. We want the date_index to be present in the DelayedFeatureTransformer (which means that component should be before DateTimeFeaturizer) but we can't impute after DelayedFeatureTransformer because it'll impute the NaNs created by the DelayedFeatureTransformer. Those rows with NaNs should be dropped instead.

In a nutshell, we want to pass the date index through DelayedFeatureTransformer without transforming it. I think the simplest way to do that is to switch the order of the delayed feature transformer and Date Time Featurizer in the pipeline.

AutoMLTestEnv,
ts_data_binary,
ts_data_multi,
ts_data,
):
if is_binary(problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I feel like we're at a point where checking is_binary(problem_type) is no longer cleaner than just checking problem_type == ProblemTypes.BINARY, etc etc 😅


def fit_transform(self, X, y):
"""Fit the component and transform the input data.

Args:
X (pd.DataFrame or None): Data to transform. None is expected when only the target variable is being used.
X (pd.DataFrame): Data to transform. None is expected when only the target variable is being used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this docstring still hold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch let me fix this!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a big one, but it looks good to me. I'm warming up on the changes to get_preprocessing_components. Looks good, man.

@@ -796,9 +796,11 @@ def _validate_problem_configuration(self, problem_configuration=None):
p in problem_configuration for p in required_parameters
):
raise ValueError(
"user_parameters must be a dict containing values for at least the date_index, gap, max_delay, "
"problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing this!

]
components = []
for function in components_functions:
components.extend(function(X, y, problem_type, estimator_class, sampler_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can get on board with this...it's clear in which circumstances you're generating the list in which order.

@freddyaboulton freddyaboulton merged commit 24779e0 into main Nov 17, 2021
@freddyaboulton freddyaboulton deleted the require-date-index branch November 17, 2021 19:21
@chukarsten chukarsten mentioned this pull request Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Require date_index for time series problems
4 participants