Add a DelayedFeaturesTransformer #1396

freddyaboulton · 2020-11-02T20:34:57Z

Pull Request Description

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-11-02T21:31:14Z

Codecov Report

Merging #1396 (c6ac9e1) into main (af6807a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1396     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         214      216      +2     
  Lines       14133    14228     +95     
=========================================
+ Hits        14126    14221     +95     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/pipelines/__init__.py	`100.0% <ø> (ø)`
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.0% <100.0%> (ø)`
.../components/transformers/preprocessing/__init__.py	`100.0% <100.0%> (ø)`
...mponents/transformers/preprocessing/time_series.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`100.0% <100.0%> (ø)`
...mponent_tests/test_delayed_features_transformer.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_utils.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af6807a...c6ac9e1. Read the comment docs.

freddyaboulton · 2020-11-02T21:47:48Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+            y = pd.Series(y)
+
+        original_columns = X.columns
+        X = X.assign(**{f"{col}_delay_{t}": X[col].shift(t)


If the time-series is irregularly spaced, shifting may give you data that is too "far in the past", e.g. we have daily data, but for some reason, a whole month is missing. We've decided that irregularly-spaced time series are out-of-scope for this first release.

https://alteryx.quip.com/AM04ASOaQS4v/Time-Series-November-Design-Document

Name looks good

freddyaboulton · 2020-11-02T21:50:10Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+    hyperparameter_ranges = {}
+    needs_fitting = False
+
+    def __init__(self, max_delay=2, random_state=0, **kwargs):


In the future, we may want to add a parameter for features that should not be lagged (they don't change with time). I think proceeding without that functionality is good enough for an MVP though.

Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.

freddyaboulton · 2020-11-02T21:53:37Z

evalml/tests/component_tests/test_delayed_features_transformer.py

+    assert delayed_features.parameters == {"max_delay": 4}
+
+
+def test_lagged_feature_extractor_maxdelay3_gap1(delayed_features_data):


These tests could be condensed into one test with parametrize but I think that being explicit here makes it easier to see that the output matches what's in the design doc.

CLAassistant · 2020-11-04T14:41:54Z

All committers have signed the CLA.

bchen1116

LGTM!

bchen1116 · 2020-11-04T16:08:41Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+        """
+        if not isinstance(X, pd.DataFrame):
+            if y is None:
+                X = pd.DataFrame(X, columns=["target"])


why are we passing only 1 column for dataframe X?

In time series it's possible to fit estimators based only the target variable so we need to detect when we are in that case.

If X is not a dataframe and y is None (no second argument was passed in), then I'm inferring that only the target variable was passed in and so I convert it to a dataframe with only one column.

bchen1116 · 2020-11-04T16:17:14Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+                        for col in X})
+        X.drop(columns=original_columns, inplace=True)
+
+        # Handle cases where the label was not passed


Handle cases where the label was passed?

I'll fix this and add some more comments to explain what's going on based on your previous comment! ^

dsherry

Very cool!

I left a few comments. The impl and tests look solid, just following up on details.

dsherry · 2020-11-09T22:43:42Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+from evalml.pipelines.components.transformers.transformer import Transformer
+
+
+class DelayedFeaturesTransformer(Transformer):


What do you think of DelayedFeatureTransformer? I think its nice to keep plural for special cases

Also could you please add class and init docstrings?

Will change the name! I thought plural was appropriate because it will lag more than one feature if there are multiple input features.

Thanks. Yep got it, that makes sense, I'm fine either way but prefer the singular here because I think its more in line with our other components. I also recognize its a nit-pick / opinion haha

dsherry · 2020-11-09T22:44:15Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+    hyperparameter_ranges = {}
+    needs_fitting = False
+
+    def __init__(self, max_delay=2, random_state=0, **kwargs):


Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.

dsherry · 2020-11-09T22:45:14Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+        parameters = {"max_delay": max_delay}
+        parameters.update(kwargs)
+        super().__init__(parameters=parameters, random_state=random_state)
+        self.max_delay = max_delay


Style nit-pick: do this before super, perhaps at the beginning

dsherry · 2020-11-09T22:45:51Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+        self.max_delay = max_delay
+
+    def fit(self, X, y=None):
+        """Fits the LaggedFeatureExtractor."""


Pls update this name

dsherry · 2020-11-09T22:48:00Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+
+        Arguments:
+            X (pd.DataFrame): Data to transform.
+            y (pd.Series, optional): Targets.


Hmm I didn't think napoleon docstring format had an "optional" value type

Changed to None!

dsherry · 2020-11-09T23:01:24Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+            X = X.assign(**{f"target_delay_{t}": y.shift(t)
+                            for t in range(self.max_delay + 1)})
+
+        return X


👍 awesome, so we're adding the delayed target features to the returned X. Yep.

Note the target delay range should be for t in range(1, self.max_delay + 1), to avoid delay==0!

The pipeline will shift the target variable by the gap amount to take care of target leakage. The only time this would cause a problem is when gap=0 which I haven't personally seen in practice but it'd be a good idea to support. In that case, I'll start the target delay at 1.

So in short:

Add a gap parameter to the transformer init method.

If gap = 0, the features will be delayed from [0, max_delay], the target will be delayed [1, max_delay] (square brackets means the range is inclusive)

Else, features and targets will be delayed from [0, max_delay] (current behavior)

Ah very good point, you're right of course, I agree with your plan! Thanks.

I agree gap=0 for timeseries is not gonna see heavy use. You'd still get the delayed features which is nice, but that's essentially saying "given todays info, and historical features, predict todays target". Not invalid per se, but not the main point of timeseries.

dsherry · 2020-11-09T23:13:10Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+    hyperparameter_ranges = {}
+    needs_fitting = False
+
+    def __init__(self, max_delay=2, random_state=0, **kwargs):


@freddyaboulton can we add parameters to control delaying the target vs delaying the features?

delay_target=False, delay_features=True

I think it'll be helpful to be able to control the behavior like this.

@dsherry Should these be tuned by automl?

@freddyaboulton good question. Thinking about that, I think we should:

Default these both to True in this PR

Once we get start performance testing for timeseries we can determine whether or not we should make these tuneable. But my default assumption would be to leave it as non-tuneable

We need the ability to generate both delayed features and delayed target features. But for many datasets, target[n-1] will be strongly correlated with target[n], and users may not want their models to rely so heavily on the delayed target, and will want to disable the delayed target features. Hence why I suggested we add these two parameters.

Great point about the previous time step being correlated with the current time step!

I agree with your suggestion to leave the question about tuning until we have performance tests. I was thinking about the case when there are many delayed features. In that situation, not all of them will be useful for modeling the target and so we may want to intelligently not delay those features. In that case, there are other, and probably better, ways of parametrizing that apart from the coarse "should we delay or not delay" decision.

@freddyaboulton agreed. These two binary parameters are not going to be enough for us for tuning in the long-term.

dsherry · 2020-11-09T23:13:41Z

evalml/tests/component_tests/test_components.py

+            elif isinstance(component, DelayedFeaturesTransformer):
+                # We just want to check that DelayedFeaturesTransformer outputs a DataFrame
+                # The dataframe shape and index are checked in test_delayed_features_transformer.py
+                continue


dsherry · 2020-11-09T23:24:00Z

evalml/tests/component_tests/test_delayed_features_transformer.py

+    X_np = X.values
+    y_np = y.values
+
+    # Example 3 from the design document


I'd delete these comments, the examples stand on their own here!

dsherry · 2020-11-09T23:37:06Z

evalml/tests/component_tests/test_delayed_features_transformer.py

+                                  "target_delay_3": y.shift(3),
+                                  "target_delay_4": y.shift(4),
+                                  "target_delay_5": y.shift(5)})
+    pd.testing.assert_frame_equal(DelayedFeaturesTransformer(max_delay=5, gap=1).fit_transform(y), answer_only_y)


In our components we should always be able to assume that X contains features and y contains the target.

So I'd expect this call to be

tf.fit_transform(None, y=y)

or perhaps

tf.fit_transform(pd.DataFrame(), y=y)

Good point! I thought it'd be easier for users to avoid having to type None for an Arima-like problem where only the y is being modeled but it's best to be explicit and follow our convention.

… columns.

…erence.

…ase and don't change un-delayed feature name.

…ayedFeatureTransformer)

dsherry

Great!

dsherry · 2020-11-10T17:33:49Z

evalml/pipelines/components/transformers/preprocessing/time_series.py

+        self.delay_target = delay_target
+
+        # If 0, start at 1
+        self.start_delay_for_target = gap == 0


Do you need to convert to int here?

freddyaboulton changed the title ~~1379 delayed features transformer~~ Add a DelayedFeaturesTransformer Nov 2, 2020

freddyaboulton commented Nov 2, 2020

View reviewed changes

freddyaboulton marked this pull request as ready for review November 2, 2020 22:33

freddyaboulton requested review from dsherry, angela97lin, eccabay, bchen1116, christopherbunn and jeremyliweishih November 2, 2020 22:33

bchen1116 approved these changes Nov 4, 2020

View reviewed changes

freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 320c6ae to 8095437 Compare November 5, 2020 16:18

freddyaboulton self-assigned this Nov 5, 2020

freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 8095437 to 1f7291b Compare November 9, 2020 14:49

dsherry suggested changes Nov 9, 2020

View reviewed changes

freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 486a9ff to 95dfa12 Compare November 10, 2020 16:28

freddyaboulton added 11 commits November 10, 2020 11:40

Adding DelayedFeaturesTransformer.

10ba383

Mapping tests back to design document examples. Not using lap to name…

a7d5e4d

… columns.

Adding PR 1396 to release notes.

0aa30d7

Linting test_delayed_features_transformer.

10b0ef4

Setting needs_fitting=False for DelayedFeaturesTransformer.

57f7184

Adding docstring to DelayedFeaturesExtractor and adding it to api ref…

ff03487

…erence.

Editing comments in DelayedFeaturesTransformer transform method.

e204542

Changing name to DelayFeatureTransformer. X must be None for only-y-c…

0e60300

…ase and don't change un-delayed feature name.

Adding gap to DelayedFeatureTransformer.

5e50b88

Changing name attribute of DelayedFeatureTransformer.

b43dd01

Fixing typo in api ref (DelayedFeaturesTransformer was changed to Del…

5ea8706

…ayedFeatureTransformer)

Deleting not needed lines.

22679bf

freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 95dfa12 to 22679bf Compare November 10, 2020 16:40

freddyaboulton requested a review from dsherry November 10, 2020 17:07

dsherry approved these changes Nov 10, 2020

View reviewed changes

Convert from bool to int in start_delay_for_target definition.

c6ac9e1

freddyaboulton merged commit 78137f4 into main Nov 10, 2020

freddyaboulton mentioned this pull request Nov 10, 2020

Timeseries regression pipeline #1418

Merged

dsherry mentioned this pull request Nov 24, 2020

Release v0.16.0 #1468

Merged

freddyaboulton deleted the 1379-delayed-features-transformer branch June 17, 2021 21:26

		assert delayed_features.parameters == {"max_delay": 4}


		def test_lagged_feature_extractor_maxdelay3_gap1(delayed_features_data):

		from evalml.pipelines.components.transformers.transformer import Transformer


		class DelayedFeaturesTransformer(Transformer):

Add a DelayedFeaturesTransformer #1396

Add a DelayedFeaturesTransformer #1396

Conversation

freddyaboulton commented Nov 2, 2020

Pull Request Description

codecov bot commented Nov 2, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Nov 4, 2020 • edited Loading

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Nov 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 2, 2020 •

edited

Loading

CLAassistant commented Nov 4, 2020 •

edited

Loading

freddyaboulton Nov 10, 2020 •

edited

Loading