Require date index parameter for time series #3041

freddyaboulton · 2021-11-12T15:01:06Z

Pull Request Description

Planning on filing a separate issue for refactoring ARIMA/Prophet in light of this: #3046

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2021-11-12T15:04:19Z

Codecov Report

Merging #3041 (bcaaabc) into main (51c8914) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3041     +/-   ##
=======================================
- Coverage   99.8%   99.7%   -0.0%     
=======================================
  Files        312     312             
  Lines      30255   30344     +89     
=======================================
+ Hits       30167   30252     +85     
- Misses        88      92      +4

Impacted Files	Coverage Δ
.../pipelines/time_series_classification_pipelines.py	`100.0% <ø> (ø)`
...valml/pipelines/time_series_regression_pipeline.py	`100.0% <ø> (ø)`
...s/prediction_explanations_tests/test_algorithms.py	`100.0% <ø> (ø)`
...lml/tests/model_understanding_tests/test_graphs.py	`100.0% <ø> (ø)`
evalml/automl/automl_search.py	`99.9% <100.0%> (+0.1%)`	⬆️
...rmers/preprocessing/delayed_feature_transformer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/time_series_pipeline_base.py	`99.1% <100.0%> (-0.9%)`	⬇️
evalml/pipelines/utils.py	`99.7% <100.0%> (+0.1%)`	⬆️
...ts/automl_tests/parallel_tests/test_automl_dask.py	`100.0% <100.0%> (ø)`
evalml/tests/automl_tests/test_automl.py	`99.5% <100.0%> (+0.1%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c8914...bcaaabc. Read the comment docs.

freddyaboulton · 2021-11-15T19:35:11Z

evalml/pipelines/utils.py

+    Returns:
+        list[Transformer]: A list of applicable preprocessing components to use with the estimator.
+    """
+    if is_time_series(problem_type):


This is a refactor aimed at letting us tweak the pipeline structure for time series without making the implementation for non-time series pipelines more confusing. I suspect we'll need something like this too for #2511

angela97lin

This looks really good. I'm a big fan of the refactorization for make_pipeline.

Haven't taken a thorough look at the tests yet, but have a bunch of questions :P

angela97lin · 2021-11-15T21:49:27Z

evalml/pipelines/components/transformers/preprocessing/delayed_feature_transformer.py

@@ -105,7 +105,12 @@ def fit(self, X, y=None):

        Returns:
            self
+
+        Raises:
+            ValueError: if self.date_index is None


Can't comment in the place I want but is the docstring for the init method still valid?

date_index (str): Name of the column containing the datetime information used to order the data. Ignored.

Why do we check this here? Can we not do these checks during init time / make date_index a required parameter rather than a keyword parameter with a default of None?

I guess it is accurate because the date_index is ignored. It won't be ignored in #3028 .

The problem with checking in the init is that the default value of date_index is that DelayedFeatureTransformer() will raise an exception and that will cause AutoMLSearch to error out until #1637 is done.

angela97lin · 2021-11-15T21:52:34Z

evalml/pipelines/components/transformers/preprocessing/delayed_feature_transformer.py

        # Normalize the data into pandas objects
        X_ww = infer_feature_types(X)
+        cols_to_delay = list(


So what we're doing here is first selecting categorical (semantic tag) columns, and then selecting numeric, category, and boolean logical types. Why does this need to change compared to what we were previously doing?

I talk about this a little bit in the comment in _get_preprocessing_components but we don't want to engineer the date_index.

It's a little unfortunate we need to seep that special-case logic into the components but I think that's the best way to go.

angela97lin · 2021-11-15T21:53:22Z

evalml/pipelines/utils.py

    if is_classification(problem_type):
-        pp_components.append(LabelEncoder)


RIP pp_components

But also, thank you for breaking this into more sizeable chunks :) it was definitely getting to be too much.

angela97lin · 2021-11-15T21:59:51Z

evalml/pipelines/utils.py

+        list[Transformer]: A list of applicable preprocessing components to use with the estimator.
+    """
+    if is_time_series(problem_type):
+        components_functions = [


Is the only difference between these two the order of the list and _get_time_series_featurizer? If so, is it possible to change our non-time series pipeline to match the order of our time series pipeline, without the featurizer?

Curious because from afar it's hard to understand why the order needs to be the order that it is, and I have a feeling looking back at this in a few months I'll still feel that way 😂

For non-time series we want to impute after the date time featurizer so that we can impute any NaNs in the datetime features.

For time series, it's a bit trickier. We want the date_index to be present in the DelayedFeatureTransformer (which means that component should be before DateTimeFeaturizer) but we can't impute after DelayedFeatureTransformer because it'll impute the NaNs created by the DelayedFeatureTransformer. Those rows with NaNs should be dropped instead.

In a nutshell, we want to pass the date index through DelayedFeatureTransformer without transforming it. I think the simplest way to do that is to switch the order of the delayed feature transformer and Date Time Featurizer in the pipeline.

angela97lin · 2021-11-15T22:01:08Z

evalml/tests/automl_tests/parallel_tests/test_automl_dask.py

+    AutoMLTestEnv,
+    ts_data_binary,
+    ts_data_multi,
+    ts_data,
 ):
    if is_binary(problem_type):


Now I feel like we're at a point where checking is_binary(problem_type) is no longer cleaner than just checking problem_type == ProblemTypes.BINARY, etc etc 😅

eccabay · 2021-11-16T18:43:11Z

evalml/pipelines/components/transformers/preprocessing/delayed_feature_transformer.py


    def fit_transform(self, X, y):
        """Fit the component and transform the input data.

        Args:
-            X (pd.DataFrame or None): Data to transform. None is expected when only the target variable is being used.
+            X (pd.DataFrame): Data to transform. None is expected when only the target variable is being used.


Does this docstring still hold?

Good catch let me fix this!

chukarsten

This was a big one, but it looks good to me. I'm warming up on the changes to get_preprocessing_components. Looks good, man.

chukarsten · 2021-11-17T17:04:51Z

evalml/automl/automl_search.py

@@ -796,9 +796,11 @@ def _validate_problem_configuration(self, problem_configuration=None):
                p in problem_configuration for p in required_parameters
            ):
                raise ValueError(
-                    "user_parameters must be a dict containing values for at least the date_index, gap, max_delay, "
+                    "problem_configuration must be a dict containing values for at least the date_index, gap, max_delay, "


Thanks for changing this!

chukarsten · 2021-11-17T17:38:02Z

evalml/pipelines/utils.py

+        ]
+    components = []
+    for function in components_functions:
+        components.extend(function(X, y, problem_type, estimator_class, sampler_name))


OK, I can get on board with this...it's clear in which circumstances you're generating the list in which order.

freddyaboulton force-pushed the require-date-index branch 2 times, most recently from 6226531 to 731a3f4 Compare November 15, 2021 15:41

freddyaboulton mentioned this pull request Nov 15, 2021

Refactor ARIMA/Prophet to use only use date_index parameter #3046

Closed

freddyaboulton marked this pull request as ready for review November 15, 2021 17:10

auto-assign bot assigned freddyaboulton Nov 15, 2021

freddyaboulton requested review from angela97lin, bchen1116, chukarsten, christopherbunn, dsherry, eccabay, jeremyliweishih and ParthivNaresh and removed request for angela97lin and bchen1116 November 15, 2021 17:11

freddyaboulton commented Nov 15, 2021

View reviewed changes

freddyaboulton force-pushed the require-date-index branch from 8c345a8 to 716d6c5 Compare November 15, 2021 20:18

angela97lin reviewed Nov 15, 2021

View reviewed changes

freddyaboulton force-pushed the require-date-index branch from 13d81fa to 4384eb4 Compare November 16, 2021 16:19

eccabay reviewed Nov 16, 2021

View reviewed changes

chukarsten approved these changes Nov 17, 2021

View reviewed changes

freddyaboulton force-pushed the require-date-index branch from 7b2fd35 to ce1e408 Compare November 17, 2021 18:32

freddyaboulton added 5 commits November 17, 2021 13:59

Fix some tests and do implementation

6259afc

Fix tests

9c88207

Fix tests

0f79ebf

Add to release notes

23aeb45

Try to fix windows + delete commented code

7a4d3fe

freddyaboulton added 6 commits November 17, 2021 13:59

Delete uncommented out code

1a73e58

Fix lint

d838d71

Fix codecov + lint

fa81ac8

Fix pipeline structure for ts

8ca75d8

Delete unused code

f7fa864

Updating docstring

bcaaabc

freddyaboulton force-pushed the require-date-index branch from ce1e408 to bcaaabc Compare November 17, 2021 18:59

freddyaboulton merged commit 24779e0 into main Nov 17, 2021

freddyaboulton deleted the require-date-index branch November 17, 2021 19:21

chukarsten mentioned this pull request Nov 29, 2021

Release v.0.38.0 #3102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require date index parameter for time series #3041

Require date index parameter for time series #3041

freddyaboulton commented Nov 12, 2021 •

edited

Loading

codecov bot commented Nov 12, 2021 •

edited

Loading

freddyaboulton Nov 15, 2021

angela97lin left a comment

angela97lin Nov 15, 2021

freddyaboulton Nov 15, 2021

angela97lin Nov 15, 2021

freddyaboulton Nov 15, 2021

angela97lin Nov 15, 2021

angela97lin Nov 15, 2021

angela97lin Nov 15, 2021

freddyaboulton Nov 15, 2021

angela97lin Nov 15, 2021

eccabay Nov 16, 2021

freddyaboulton Nov 16, 2021

chukarsten left a comment

chukarsten Nov 17, 2021

chukarsten Nov 17, 2021

		if is_classification(problem_type):
		pp_components.append(LabelEncoder)

Require date index parameter for time series #3041

Require date index parameter for time series #3041

Conversation

freddyaboulton commented Nov 12, 2021 • edited Loading

Pull Request Description

codecov bot commented Nov 12, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton commented Nov 12, 2021 •

edited

Loading

codecov bot commented Nov 12, 2021 •

edited

Loading