New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a DelayedFeaturesTransformer #1396
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1396 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 214 216 +2
Lines 14133 14228 +95
=========================================
+ Hits 14126 14221 +95
Misses 7 7
Continue to review full report at Codecov.
|
y = pd.Series(y) | ||
|
||
original_columns = X.columns | ||
X = X.assign(**{f"{col}_delay_{t}": X[col].shift(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the time-series is irregularly spaced, shifting may give you data that is too "far in the past", e.g. we have daily data, but for some reason, a whole month is missing. We've decided that irregularly-spaced time series are out-of-scope for this first release.
https://alteryx.quip.com/AM04ASOaQS4v/Time-Series-November-Design-Document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name looks good
hyperparameter_ranges = {} | ||
needs_fitting = False | ||
|
||
def __init__(self, max_delay=2, random_state=0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we may want to add a parameter for features that should not be lagged (they don't change with time). I think proceeding without that functionality is good enough for an MVP though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.
assert delayed_features.parameters == {"max_delay": 4} | ||
|
||
|
||
def test_lagged_feature_extractor_maxdelay3_gap1(delayed_features_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests could be condensed into one test with parametrize
but I think that being explicit here makes it easier to see that the output matches what's in the design doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
""" | ||
if not isinstance(X, pd.DataFrame): | ||
if y is None: | ||
X = pd.DataFrame(X, columns=["target"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we passing only 1 column for dataframe X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In time series it's possible to fit estimators based only the target variable so we need to detect when we are in that case.
If X is not a dataframe and y is None (no second argument was passed in), then I'm inferring that only the target variable was passed in and so I convert it to a dataframe with only one column.
for col in X}) | ||
X.drop(columns=original_columns, inplace=True) | ||
|
||
# Handle cases where the label was not passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle cases where the label was passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this and add some more comments to explain what's going on based on your previous comment! ^
320c6ae
to
8095437
Compare
8095437
to
1f7291b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
I left a few comments. The impl and tests look solid, just following up on details.
from evalml.pipelines.components.transformers.transformer import Transformer | ||
|
||
|
||
class DelayedFeaturesTransformer(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of DelayedFeatureTransformer
? I think its nice to keep plural for special cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you please add class and init docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change the name! I thought plural was appropriate because it will lag more than one feature if there are multiple input features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yep got it, that makes sense, I'm fine either way but prefer the singular here because I think its more in line with our other components. I also recognize its a nit-pick / opinion haha
hyperparameter_ranges = {} | ||
needs_fitting = False | ||
|
||
def __init__(self, max_delay=2, random_state=0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed. We have per-column max/min delay listed in the future items for timeseries support.
parameters = {"max_delay": max_delay} | ||
parameters.update(kwargs) | ||
super().__init__(parameters=parameters, random_state=random_state) | ||
self.max_delay = max_delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit-pick: do this before super
, perhaps at the beginning
self.max_delay = max_delay | ||
|
||
def fit(self, X, y=None): | ||
"""Fits the LaggedFeatureExtractor.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls update this name
|
||
Arguments: | ||
X (pd.DataFrame): Data to transform. | ||
y (pd.Series, optional): Targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Target"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I didn't think napoleon docstring format had an "optional" value type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to None!
X = X.assign(**{f"target_delay_{t}": y.shift(t) | ||
for t in range(self.max_delay + 1)}) | ||
|
||
return X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 awesome, so we're adding the delayed target features to the returned X. Yep.
Note the target delay range should be for t in range(1, self.max_delay + 1)
, to avoid delay==0
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline will shift the target variable by the gap
amount to take care of target leakage. The only time this would cause a problem is when gap=0
which I haven't personally seen in practice but it'd be a good idea to support. In that case, I'll start the target delay at 1
.
So in short:
- Add a
gap
parameter to the transformer init method. - If gap = 0, the features will be delayed from [0, max_delay], the target will be delayed [1, max_delay] (square brackets means the range is inclusive)
- Else, features and targets will be delayed from [0, max_delay] (current behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very good point, you're right of course, I agree with your plan! Thanks.
I agree gap=0
for timeseries is not gonna see heavy use. You'd still get the delayed features which is nice, but that's essentially saying "given todays info, and historical features, predict todays target". Not invalid per se, but not the main point of timeseries.
hyperparameter_ranges = {} | ||
needs_fitting = False | ||
|
||
def __init__(self, max_delay=2, random_state=0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton can we add parameters to control delaying the target vs delaying the features?
delay_target=False,
delay_features=True
I think it'll be helpful to be able to control the behavior like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Should these be tuned by automl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton good question. Thinking about that, I think we should:
- Default these both to True in this PR
- Once we get start performance testing for timeseries we can determine whether or not we should make these tuneable. But my default assumption would be to leave it as non-tuneable
We need the ability to generate both delayed features and delayed target features. But for many datasets, target[n-1]
will be strongly correlated with target[n]
, and users may not want their models to rely so heavily on the delayed target, and will want to disable the delayed target features. Hence why I suggested we add these two parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point about the previous time step being correlated with the current time step!
I agree with your suggestion to leave the question about tuning until we have performance tests. I was thinking about the case when there are many delayed features. In that situation, not all of them will be useful for modeling the target and so we may want to intelligently not delay those features. In that case, there are other, and probably better, ways of parametrizing that apart from the coarse "should we delay or not delay" decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton agreed. These two binary parameters are not going to be enough for us for tuning in the long-term.
elif isinstance(component, DelayedFeaturesTransformer): | ||
# We just want to check that DelayedFeaturesTransformer outputs a DataFrame | ||
# The dataframe shape and index are checked in test_delayed_features_transformer.py | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
X_np = X.values | ||
y_np = y.values | ||
|
||
# Example 3 from the design document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd delete these comments, the examples stand on their own here!
"target_delay_3": y.shift(3), | ||
"target_delay_4": y.shift(4), | ||
"target_delay_5": y.shift(5)}) | ||
pd.testing.assert_frame_equal(DelayedFeaturesTransformer(max_delay=5, gap=1).fit_transform(y), answer_only_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our components we should always be able to assume that X contains features and y contains the target.
So I'd expect this call to be
tf.fit_transform(None, y=y)
or perhaps
tf.fit_transform(pd.DataFrame(), y=y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I thought it'd be easier for users to avoid having to type None
for an Arima-like problem where only the y is being modeled but it's best to be explicit and follow our convention.
486a9ff
to
95dfa12
Compare
…ase and don't change un-delayed feature name.
…ayedFeatureTransformer)
95dfa12
to
22679bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
self.delay_target = delay_target | ||
|
||
# If 0, start at 1 | ||
self.start_delay_for_target = gap == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to convert to int here?
Pull Request Description
Fix #1379
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.