-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add forecast functions to time series regression pipeline #3742
Conversation
69392fc
to
1ca503c
Compare
Codecov Report
@@ Coverage Diff @@
## main #3742 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 341 341
Lines 35235 35287 +52
=======================================
+ Hits 35103 35155 +52
Misses 132 132
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
3f4d619
to
2f2a53d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are going to be only for regression pipelines for now. Can move to base class when we revisit for time series classification problems.
X, _, y = ts_data(problem_type=ProblemTypes.TIME_SERIES_REGRESSION) | ||
|
||
X_train, y_train = X.iloc[:15], y.iloc[:15] | ||
X_validation = X.iloc[15 : (15 + gap + forecast_horizon)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This holdout dataset has a feature
column that is obviously not generated inside of get_forecast_predictions()
(i.e. it is multivariate).
) | ||
|
||
# Generate numerical index | ||
first_idx = len(X) - 1 if not isinstance(X.index.dtype, int) else X.index[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this to create a int index that is sequential after X_train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this so important to have an integer index?
forecast_preds = clf.get_forecast_predictions(X=X_train, y=y_train) | ||
X_val_preds = clf.predict(X_validation, X_train=X_train, y_train=y_train) | ||
|
||
assert_series_equal(forecast_preds, X_val_preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check my understanding here: the only reason these two predictions are the same is because we're using a baseline pipeline, so the extra feature in X doesn't have an impact. For other models though, this isn't going to be the case.
Can we add info/a warning in the function (and docstring, as well as tests) to make users aware that the get_forecast_predictions
function only works for univariate problems?
Is that something we're ok with, since we're forecasting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay it should work in the multivariate case (when we engineer lagged features) but it doesn't work for KIA features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These forecasting functions are meant to work for both univariate and multivariate problems. Because we have a forecast horizon and a gap with time series problems, we need to pull from past data. In the case of multivariate problems, we are pulling these features from past rows of X_train
after it is transformed through the entire pipeline. For univariate problems, just the periods generated should be fine.
I agree that we should probably use a more complex pipeline that utilizes features though. Updated to include this.
2d663a4
to
0816cb6
Compare
0816cb6
to
1949ba2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - good work!
evalml/tests/pipeline_tests/test_time_series_baseline_pipeline.py
Outdated
Show resolved
Hide resolved
forecast_preds = clf.get_forecast_predictions(X=X_train, y=y_train) | ||
X_val_preds = clf.predict(X_validation, X_train=X_train, y_train=y_train) | ||
|
||
assert_series_equal(forecast_preds, X_val_preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay it should work in the multivariate case (when we engineer lagged features) but it doesn't work for KIA features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks Chris!
... "pipeline": {"gap": gap, "max_delay": 1, "forecast_horizon": forecast_horizon, "time_index": "date"}}, | ||
... ) | ||
>>> pipeline.fit(X, y) | ||
pipeline = TimeSeriesRegressionPipeline(component_graph={'Linear Regressor': ['Linear Regressor', 'X', 'y']}, parameters={'Linear Regressor':{'fit_intercept': True, 'normalize': True, 'n_jobs': -1}, 'pipeline':{'gap': 1, 'max_delay': 1, 'forecast_horizon': 2, 'time_index': 'date'}}, random_seed=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this line
) | ||
|
||
# Generate numerical index | ||
first_idx = len(X) - 1 if not isinstance(X.index.dtype, int) else X.index[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this so important to have an integer index?
Adds
get_forecast_periods()
andget_forecast_predictions()
to time series regression pipelines.