Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries regression pipeline #1418

Merged
merged 21 commits into from
Nov 12, 2020
Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fix #1380


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton self-assigned this Nov 9, 2020
@freddyaboulton freddyaboulton force-pushed the 1380-timeseries-regression-pipeline branch 2 times, most recently from e7a47d4 to c03e186 Compare November 9, 2020 20:10
@codecov
Copy link

codecov bot commented Nov 9, 2020

Codecov Report

Merging #1418 (35af009) into main (87cc77f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1418     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         216      218      +2     
  Lines       14226    14438    +212     
=========================================
+ Hits        14219    14431    +212     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.0% <100.0%> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
.../components/ensemble/stacked_ensemble_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/baseline_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <100.0%> (ø)
...s/estimators/regressors/decision_tree_regressor.py 100.0% <100.0%> (ø)
...ents/estimators/regressors/elasticnet_regressor.py 100.0% <100.0%> (ø)
...s/components/estimators/regressors/et_regressor.py 100.0% <100.0%> (ø)
...mponents/estimators/regressors/linear_regressor.py 100.0% <100.0%> (ø)
...s/components/estimators/regressors/rf_regressor.py 100.0% <100.0%> (ø)
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87cc77f...35af009. Read the comment docs.

@freddyaboulton freddyaboulton changed the title 1380 timeseries regression pipeline Timeseries regression pipeline Nov 9, 2020
@@ -11,7 +11,7 @@ class StackedEnsembleRegressor(StackedEnsembleBase):
"""Stacked Ensemble Regressor."""
name = "Stacked Ensemble Regressor"
model_family = ModelFamily.ENSEMBLE
supported_problem_types = [ProblemTypes.REGRESSION]
supported_problem_types = [ProblemTypes.REGRESSION, ProblemTypes.TIME_SERIES_REGRESSION]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make this change because of _validate_estimator_problem_type

return X_t

def _fit(self, X, y):
def _compute_features_during_fit(self, X, y):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make this small refactor because we need to make sure the Nans are dropped before the estimator is fit. I think it makes the most sense to do that in the time series pipeline as opposed to the base class.

@@ -81,6 +82,8 @@ def _get_pipeline_base_class(problem_type):
return MulticlassClassificationPipeline
elif problem_type == ProblemTypes.REGRESSION:
return RegressionPipeline
elif problem_type == ProblemTypes.TIME_SERIES_REGRESSION:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make this change so that the unit tests for make_pipeline don't error out for TIME_SERIES_REGRESSION problem types.

I'll modify the logic of make_pipeline to handle time series when we integrate time series regression into automl search in #1382

@@ -60,6 +60,9 @@ def test_scikit_learn_wrapper(X_y_binary, X_y_multi, X_y_regression):
num_classes = 3
elif problem_type == ProblemTypes.REGRESSION:
X, y = X_y_regression
elif problem_type == ProblemTypes.TIME_SERIES_REGRESSION:
# Skipping because make_pipeline_from_components does not yet work for time series.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is because of the pipeline-level paramters of gap and max_delay. I imagine we'll use the same pattern to specify them that we will introduce in #1382 .

return X, y


class MockDelayedFeatures(Transformer):
Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the complexity from time series comes from the nans that are introduced into the features. We need a component that does this to make the tests more robust. I'll delete this in favor of the bona-fide DelayedFeaturesTransformer when #1379 is merged.

Deleted

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I read the impl and left some thoughts and questions.

@@ -190,17 +190,24 @@ def compute_estimator_features(self, X):
"""
X_t = X
for component in self.component_graph[:-1]:
X_t = component.transform(X_t)
X_t = component.transform(X_t, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 100% a style nit-pick, but do you think we should say y=y given that y is a named arg to transform, just for clarity/consistency?

@@ -0,0 +1,82 @@
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton our current pattern for file naming means we should call this file time_series_regression_pipeline.py

We can certainly change that pattern, just calling that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of adding all of the time series pipelines in one file but let's stick with the regression vs classification separation we have with our files!

@@ -44,3 +44,5 @@
BaselineRegressionPipeline,
MeanBaselineRegressionPipeline
)

from .time_series import TimeSeriesRegressionPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

def __init__(self, parameters, gap, max_delay, random_state=0):
super().__init__(parameters, random_state)
self.gap = gap
self.max_delay = max_delay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit-pick, do super last

class TimeSeriesRegressionPipeline(RegressionPipeline):
problem_type = ProblemTypes.TIME_SERIES_REGRESSION

def __init__(self, parameters, gap, max_delay, random_state=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also didn't we say gap/max_delay/min_delay would go into parameters under the key 'pipeline' as "pipeline parameters"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing that once we got to the AutoML integration but it's way better to do this now. I just pushed this change up. The one thing I'd like to hear your thoughts on is whether we should raise an exception when the pipeline key is missing. I vote yes because it doesn't make sense to specify default arguments for these pipeline-level arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Yeah, the timeseries pipelines need those timeseries parameters in order to function. So yes, if gap and max_delay aren't included in the parameters dict under the pipeline key, I think we should raise an exception.

Should we prefix these params with timeseries? I.e. timeseries_gap, timeseries_max_delay. An argument for this would be that we may add other pipeline-level params in the future.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean? This is how a user would instantiate a TS regression pipeline right now:

    pl = Pipeline({"Delayed Feature Transformer": {"gap": gap, "max_delay": max_delay,
                                                   "delay_features": True,
                                                   "delay_target": True},
                   "pipeline": {"gap": gap, "max_delay": max_delay}})

I think renaming to the "pipeline" dict keys to "timeseries_gap" and "timeseries_max_delay" is bit too verbose. We can add more pipeline level params to the "pipeline" dict as we need without conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree! Let's stick with the example you've provided here.

X_t, _ = self._compute_features_during_fit(X, y=None)
y = X.squeeze()
else:
y = pd.Series(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this out, like we're doing for X online 29? Will make it easier to normalize to dataframe when that comes in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh actually @freddyaboulton we should be converting these to woodwork now that @angela97lin merged #1393. Check out the code in the other pipeline classes. That just got merged recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reminding me! Just pushed this up.

X_t, _ = self._compute_features_during_fit(X, y)

y_shifted = y.shift(-self.gap)
X_t, y_shifted = drop_nan(X_t, y_shifted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Perhaps drop_nan_rows or drop_rows_with_nans would be a good name for this helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree that this needs a clearer name. Let me know what you think of my suggestion below to use keep_non_nan_rows.

if any_values_are_nan(features):
return pad_with_nans(predictions, self.max_delay)
else:
return predictions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always run pad_with_nans, and in the case where there are no nans, nothing happens? Maybe I'm not following this code right yet

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictions will never have NaNs. I'm doing this because our design says that the data passed into this method and the predictions should always have the same number of rows. Since we drop NaN rows before calling predict on the estimator, I need to pad with NaNs.

There are other ways to do this:

return pad_with_nans(predictions, self.max_delay if any_values_are_nan(features) else 0)
if features.shape[0] != len(predictions):
    return pad_with_nans(predictions, self.max_delay)
else:
    return predictions
return pad_with_nans(predictions, self.max_delay if features.shape[0] != len(predictions) else 0)

Which do you think is best? Open to suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, thanks, will respond on new code since this comment is now outdated

pd.Series: Predicted values.
"""
features = self.compute_estimator_features(X, y)
predictions = self.estimator.predict(features.dropna(axis=0, how="any"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the drop_nan helper here too? Maybe I'm just confused about the names.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should rename the drop_nan helper keep_non_nan_rows? Since the target and features will have NaNs in different places in score, we need a way for selecting the rows that are not NaN in both of them. As far as I know, there isn't a pandas method that does this.

In this case, we just need to drop the NaN rows from a dataframe and pandas has a built-in for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I like that name. Right, I don't think there's a convenient builtin in pandas for that particular use-case. I've always either used dropna or isnan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I changed the name

"""
y_predicted = self.predict(X, y)
if y is None:
y = X.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this? If y isn't provided, that's an error...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case where there are no features and only y is used but I just cleaned this up!

X = _convert_to_woodwork_structure(X)
y = _convert_to_woodwork_structure(y)
X = _convert_woodwork_types_wrapper(X.to_pandas())
y = _convert_woodwork_types_wrapper(y.to_pandas())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @angela97lin does this look good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks good!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just left a few docstring comments / questions for my own curiosity haha

Returns:
self
"""
if X is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what's the point of this? Is there a case where the user only passes in y X is None for time series regression problems? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It's common to build ts models based on only delayed values of the target. Like let's say you have a sequence of daily stock prices or temperature readings, you could model future values based solely on the previous values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. One common model type like this is ARIMA.

I think in practice, most timeseries pipelines will use the input features.

Arguments:
parameters (dict): Dictionary with component names as keys and dictionary of that component's parameters as values.
An empty dictionary {} implies using all default values for component parameters. Pipeline-level
parameters such as gap and max_delay must be specified as a dictionary with the "pipeline" key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must be specified in a dictionary with the "pipeline" key"? Hmm. This was a tad bit confusing to me--not sure best way to clear it up via wording. Maybe would be good to give an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion - I added an example!

Comment on lines 319 to 361
def pad_with_nans(pd_data, num_to_pad):
"""Pad the beginning num_to_pad rows with nans.

Arguments:
pd_data (pd.DataFrame or pd.Series): Data to pad.

Returns:
pd.DataFrame or pd.Series
"""
if isinstance(pd_data, pd.Series):
padding = pd.Series([None] * num_to_pad)
else:
padding = pd.DataFrame({col: [None] * num_to_pad
for col in pd_data.columns})
return pd.concat([padding, pd_data], ignore_index=True).infer_objects()


def keep_non_nan_rows(pd_data_1, pd_data_2):
"""Keep nows that do not have any NaNs in both pd_data_1 and pd_data_2.

Returns:
tuple of pd.DataFrame or pd.Series
"""

def _not_nan(pd_data):
if isinstance(pd_data, pd.Series):
return ~pd_data.isna().values
else:
return ~pd_data.isna().any(axis=1).values

mask = np.logical_and(_not_nan(pd_data_1), _not_nan(pd_data_2))
return pd_data_1.iloc[mask], pd_data_2.iloc[mask]


def any_values_are_nan(pd_data):
"""Determine if any rows have at least one NaN.
Args:
pd_data (pd.Dataframe or pd.Series): Pandas data to check.

Returns:
pd.Series with bool values. Each value corresponds to if any values in the row are NaN.
"""
return pd_data.isna().values.any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's either specify these in the API docs or keep them private? :d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added them to the API docs! Thanks for reminding me.


def any_values_are_nan(pd_data):
"""Determine if any rows have at least one NaN.
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args --> Arguments

Returns:
pd.Series with bool values. Each value corresponds to if any values in the row are NaN.
"""
return pd_data.isna().values.any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably necessary/not since it's just one line? :d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted!



def keep_non_nan_rows(pd_data_1, pd_data_2):
"""Keep nows that do not have any NaNs in both pd_data_1 and pd_data_2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep nows --> Keep rows

pipeline_params = parameters.pop("pipeline")
self.gap = pipeline_params['gap']
self.max_delay = pipeline_params['max_delay']
super().__init__(parameters, random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do parameters.pop as above, the pipeline-level params won't be saved in the parameters attribute of the pipeline instance. Can we make it so those parameters are saved just like any other component's parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; we should add a test to make sure gap and max_delay are in the parameters dict after initialization to check for this

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I had to modify PipelineBase a little to get this to work (the pipeline parameters method would only read parameters from the component graph) but I'll push this up momentarily!

I decided to only save the pipeline-level parameters if they are present to avoid introducing breaking changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to only save the pipeline-level parameters if they are present to avoid introducing breaking changes.

I.e. whatever the user passes in is what's saved on the pipeline? If so, yep I agree. We shouldn't create an empty "pipelines" entry in parameters dicts, only create if its populated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I meant!


features = self.compute_estimator_features(X, y)
predictions = self.estimator.predict(features.dropna(axis=0, how="any"))
return pad_with_nans(predictions, self.max_delay if features.isna().values.any() else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton this looks good!

Why not say pad_with_nans(predictions, self.max_delay) ? I figure we should rely on the configuration rather than on properties of the data. Fundamentally, we need the shape of the output to match the number of rows in X and y.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't have any delayed features (either the DelayedFeatureTransformer is missing from the component graph or they passed delay_features:False, 'delay_target:False to DelayFeatureTransformer) then we don't need to pad the predictions.

pad_with_nans(predictions, self.max_delay) will always add max_delay number of rows with NaNs to the beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

Edge cases to consider:

  • delay_features and delay_target were False. And, due to a bug or unexpected component behavior, features happens to have a nan value somewhere, even though there were no nans created by the delayed featurizer.
  • delay_features or delay_target were True. And, due to a bug or unexpected component behavior, features happens to not have any nans. Maybe this case isn't important, but it could happen if someone put an imputer after the delayed feature eng.

Is there a foolproof way to avoid these sorts of scenarios? What about

n_rows_padding_pre = X.shape[0] - predictions.shape[0]
return pad_with_nans(predictions, n_rows_padding_pre)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that works! Good suggestion.

@freddyaboulton freddyaboulton force-pushed the 1380-timeseries-regression-pipeline branch from f54673c to 198400a Compare November 12, 2020 18:43
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton colossal!

I like that you put parametrize in place in the unit tests in prep for the other problem types 👏
and I love the coverage which checks the data provided to the estimator is correct.
I think we've rounded out most of the open stuff. I left a comment in predict about the padding -- ideally we should find a way to tighten that up. I left a couple unit test suggestions too.

It would be nice to have coverage for using a time series pipeline which happens to not use delayed feature transformer, but this isn't high priority since we know we want these features.

X_t = self._compute_features_during_fit(X, y)

y_shifted = y.shift(-self.gap)
X_t, y_shifted = keep_non_nan_rows(X_t, y_shifted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nit-pick on naming again lol, not critical, but: does this make more sense as a position ("drop_rows_with_nans") rather than a negation ("keep_non_nan_rows")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed! No worries I agree ehehe


features = self.compute_estimator_features(X, y)
predictions = self.estimator.predict(features.dropna(axis=0, how="any"))
return pad_with_nans(predictions, self.max_delay if features.isna().values.any() else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

Edge cases to consider:

  • delay_features and delay_target were False. And, due to a bug or unexpected component behavior, features happens to have a nan value somewhere, even though there were no nans created by the delayed featurizer.
  • delay_features or delay_target were True. And, due to a bug or unexpected component behavior, features happens to not have any nans. Maybe this case isn't important, but it could happen if someone put an imputer after the delayed feature eng.

Is there a foolproof way to avoid these sorts of scenarios? What about

n_rows_padding_pre = X.shape[0] - predictions.shape[0]
return pad_with_nans(predictions, n_rows_padding_pre)

X = _convert_woodwork_types_wrapper(X.to_dataframe())

y_predicted = self.predict(X, y)
y_shifted = y.shift(-self.gap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense. So we shift the target down by gap rows, so that the target value at row n headed into the estimator comes originally from row n + gap. And the final gap rows are nan.


@pytest.mark.parametrize("only_use_y", [True, False])
@pytest.mark.parametrize("include_delayed_features", [True, False])
@pytest.mark.parametrize("gap,max_delay", [(1, 2), (2, 2), (7, 3), (2, 4)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage where gap is set to 0 ("given historical features and today's features, predict today's target"), or where max_delay is set to 0 ("given today's features, predict a future target") or 1?

I think we should add this to the tests in this file, fit/predict/score

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these combinations of gap and max_delay to the tests!

(0, 0), (1, 0), (0, 2), (1, 1)

estimator_name, gap, max_delay, include_delayed_features, only_use_y, ts_data):

if only_use_y and not include_delayed_features:
pytest.skip("Can't prevent label leakage when only the target is used without lagging.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would happen when max_delay=0 and delay_target=True? Perhaps we should raise an error in that case?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the problem is that all of our (current) estimators require the X to be non-empty. If you are passing an X=None to the pipeline and you are not computing any features with a DelayedFeaturesTransformer, then our estimators will freak out. I'll raise an error in fit if we're in that case but we may need to rework this in the future when we add time-series specific estimators that compute their own features internally. In that case, we'll likely want to pass in X=None and not include a DelayFeatureTransformer in the pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @freddyaboulton . Sounds good, raising an error in pipeline fit makes sense, and then when we add timeseries support to make_pipelines for automl we'll just have to handle this edge case and not add delayed feature eng to the pipeline.

if only_use_y:
pl.fit(None, y)
else:
pl.fit(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

expected_target = np.arange(1 + gap + max_delay, 32)
else:
train_index = pd.date_range(f"2020-10-01", f"2020-10-{31-gap}")
expected_target = np.arange(1 + gap, 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data gen is great, easy to follow

@freddyaboulton freddyaboulton merged commit af9400b into main Nov 12, 2020
@freddyaboulton freddyaboulton deleted the 1380-timeseries-regression-pipeline branch November 12, 2020 23:29
christopherbunn pushed a commit that referenced this pull request Nov 19, 2020
* Adding TimeSeriesRegressionPipeline.

* Simplyfing MockDelayedFeatures transformer.

* Adding PR 1418 to release notes.

* Removing artifact after rebase.

* Fixing broken tests.

* Adding docstrings for TimeSeriesRegressionPipeline.

* Removing pytest.skip that was skipping all subsequent estimators in favor of continue.

* Add docstrings for NaN helper functions in gen_utils.

* Reverting change that breaks unit tests.

* Adding time series regression pipeline to api ref.

* Adopting converntion that X must be None when there are no features apart from target for time series.

* Refactoring how pipeline is defined in tests.

* Specifying pipeline parameters as the pipeline key in the parameters dict.

* Rename drop_nan to keep_non_nan_rows.

* Fixing import order in tests/utils_tests/test_gen_utils.py

* Deleting unnecessary series conversion in fit.

* Pipeline parameters are now saved in the parameters dict.

* Upgrading woodwork conversion to 0.0.5

* Adding test coverage for gap, max_delay = 0

* Changing skip message in test_time_series_pipeline.py

* Fixing typo in api ref.
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a TimeSeriesRegressionPipeline
3 participants