Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a DelayedFeaturesTransformer #1396

Merged
merged 13 commits into from
Nov 10, 2020
Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fix #1379


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title 1379 delayed features transformer Add a DelayedFeaturesTransformer Nov 2, 2020
@codecov
Copy link

codecov bot commented Nov 2, 2020

Codecov Report

Merging #1396 (c6ac9e1) into main (af6807a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1396     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         214      216      +2     
  Lines       14133    14228     +95     
=========================================
+ Hits        14126    14221     +95     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <100.0%> (ø)
.../components/transformers/preprocessing/__init__.py 100.0% <100.0%> (ø)
...mponents/transformers/preprocessing/time_series.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
...mponent_tests/test_delayed_features_transformer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af6807a...c6ac9e1. Read the comment docs.

y = pd.Series(y)

original_columns = X.columns
X = X.assign(**{f"{col}_delay_{t}": X[col].shift(t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the time-series is irregularly spaced, shifting may give you data that is too "far in the past", e.g. we have daily data, but for some reason, a whole month is missing. We've decided that irregularly-spaced time series are out-of-scope for this first release.

https://alteryx.quip.com/AM04ASOaQS4v/Time-Series-November-Design-Document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name looks good

hyperparameter_ranges = {}
needs_fitting = False

def __init__(self, max_delay=2, random_state=0, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may want to add a parameter for features that should not be lagged (they don't change with time). I think proceeding without that functionality is good enough for an MVP though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.

assert delayed_features.parameters == {"max_delay": 4}


def test_lagged_feature_extractor_maxdelay3_gap1(delayed_features_data):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests could be condensed into one test with parametrize but I think that being explicit here makes it easier to see that the output matches what's in the design doc.

@CLAassistant
Copy link

CLAassistant commented Nov 4, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

"""
if not isinstance(X, pd.DataFrame):
if y is None:
X = pd.DataFrame(X, columns=["target"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we passing only 1 column for dataframe X?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In time series it's possible to fit estimators based only the target variable so we need to detect when we are in that case.

If X is not a dataframe and y is None (no second argument was passed in), then I'm inferring that only the target variable was passed in and so I convert it to a dataframe with only one column.

for col in X})
X.drop(columns=original_columns, inplace=True)

# Handle cases where the label was not passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle cases where the label was passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this and add some more comments to explain what's going on based on your previous comment! ^

@freddyaboulton freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 320c6ae to 8095437 Compare November 5, 2020 16:18
@freddyaboulton freddyaboulton self-assigned this Nov 5, 2020
@freddyaboulton freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 8095437 to 1f7291b Compare November 9, 2020 14:49
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

I left a few comments. The impl and tests look solid, just following up on details.

from evalml.pipelines.components.transformers.transformer import Transformer


class DelayedFeaturesTransformer(Transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of DelayedFeatureTransformer? I think its nice to keep plural for special cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you please add class and init docstrings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change the name! I thought plural was appropriate because it will lag more than one feature if there are multiple input features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yep got it, that makes sense, I'm fine either way but prefer the singular here because I think its more in line with our other components. I also recognize its a nit-pick / opinion haha

hyperparameter_ranges = {}
needs_fitting = False

def __init__(self, max_delay=2, random_state=0, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.

parameters = {"max_delay": max_delay}
parameters.update(kwargs)
super().__init__(parameters=parameters, random_state=random_state)
self.max_delay = max_delay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit-pick: do this before super, perhaps at the beginning

self.max_delay = max_delay

def fit(self, X, y=None):
"""Fits the LaggedFeatureExtractor."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls update this name


Arguments:
X (pd.DataFrame): Data to transform.
y (pd.Series, optional): Targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Target"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I didn't think napoleon docstring format had an "optional" value type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to None!

X = X.assign(**{f"target_delay_{t}": y.shift(t)
for t in range(self.max_delay + 1)})

return X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 awesome, so we're adding the delayed target features to the returned X. Yep.

Note the target delay range should be for t in range(1, self.max_delay + 1), to avoid delay==0!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline will shift the target variable by the gap amount to take care of target leakage. The only time this would cause a problem is when gap=0 which I haven't personally seen in practice but it'd be a good idea to support. In that case, I'll start the target delay at 1.

So in short:

  • Add a gap parameter to the transformer init method.
  • If gap = 0, the features will be delayed from [0, max_delay], the target will be delayed [1, max_delay] (square brackets means the range is inclusive)
  • Else, features and targets will be delayed from [0, max_delay] (current behavior)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very good point, you're right of course, I agree with your plan! Thanks.

I agree gap=0 for timeseries is not gonna see heavy use. You'd still get the delayed features which is nice, but that's essentially saying "given todays info, and historical features, predict todays target". Not invalid per se, but not the main point of timeseries.

hyperparameter_ranges = {}
needs_fitting = False

def __init__(self, max_delay=2, random_state=0, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton can we add parameters to control delaying the target vs delaying the features?

delay_target=False,
delay_features=True

I think it'll be helpful to be able to control the behavior like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Should these be tuned by automl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton good question. Thinking about that, I think we should:

  • Default these both to True in this PR
  • Once we get start performance testing for timeseries we can determine whether or not we should make these tuneable. But my default assumption would be to leave it as non-tuneable

We need the ability to generate both delayed features and delayed target features. But for many datasets, target[n-1] will be strongly correlated with target[n], and users may not want their models to rely so heavily on the delayed target, and will want to disable the delayed target features. Hence why I suggested we add these two parameters.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point about the previous time step being correlated with the current time step!

I agree with your suggestion to leave the question about tuning until we have performance tests. I was thinking about the case when there are many delayed features. In that situation, not all of them will be useful for modeling the target and so we may want to intelligently not delay those features. In that case, there are other, and probably better, ways of parametrizing that apart from the coarse "should we delay or not delay" decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton agreed. These two binary parameters are not going to be enough for us for tuning in the long-term.

elif isinstance(component, DelayedFeaturesTransformer):
# We just want to check that DelayedFeaturesTransformer outputs a DataFrame
# The dataframe shape and index are checked in test_delayed_features_transformer.py
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

X_np = X.values
y_np = y.values

# Example 3 from the design document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete these comments, the examples stand on their own here!

"target_delay_3": y.shift(3),
"target_delay_4": y.shift(4),
"target_delay_5": y.shift(5)})
pd.testing.assert_frame_equal(DelayedFeaturesTransformer(max_delay=5, gap=1).fit_transform(y), answer_only_y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our components we should always be able to assume that X contains features and y contains the target.

So I'd expect this call to be

tf.fit_transform(None, y=y)

or perhaps

tf.fit_transform(pd.DataFrame(), y=y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I thought it'd be easier for users to avoid having to type None for an Arima-like problem where only the y is being modeled but it's best to be explicit and follow our convention.

@freddyaboulton freddyaboulton force-pushed the 1379-delayed-features-transformer branch from 486a9ff to 95dfa12 Compare November 10, 2020 16:28
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

self.delay_target = delay_target

# If 0, start at 1
self.start_delay_for_target = gap == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to convert to int here?

@freddyaboulton freddyaboulton merged commit 78137f4 into main Nov 10, 2020
@dsherry dsherry mentioned this pull request Nov 24, 2020
@freddyaboulton freddyaboulton deleted the 1379-delayed-features-transformer branch June 17, 2021 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a DelayedFeaturesTransformer
4 participants