Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support relative forecasting for ARIMA #2613

Merged
merged 6 commits into from Aug 11, 2021
Merged

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Aug 10, 2021

Fixes: #2592

The original issue was filed under the assumption that sktime would not be able to support certain time units, such as those that had a frequency other than 1, such as 2 days, 4 minutes, 3 months, etc. This manifests when exogenous variables (features) are included in fit and predict. This problem doesn't seem to persist when the is_relative parameter in ForecastingHorizon is set to True. By setting this to True, predictions will be made on a set of absolute indices starting from after the last date used when fitting the estimator, as opposed to being made on specific dates.

time_index_1 = pd.date_range("1/1/21", periods=100, freq="3T")
X = pd.DataFrame(range(100), index=time_index_1)
y = pd.Series(y_values, index=time_index_1)

With the flag is set to False

# The actual dates have to be passed in to the ForecastingHorizon for when predictions need to be made
fh_ = ForecastingHorizon(X.index[75:], is_relative=False)
a_clf = AutoARIMA(suppress_warnings=True)
clf = a_clf.fit(y=y[:75])
y_pred_sk = clf.predict(fh=fh_)
print(y_pred_sk)
---------------------------------------------------
2021-03-17    0.371662
2021-03-18    0.776146
...
2021-04-09   -0.486227
2021-04-10   -0.000040

This does not work when features are passed

fh_ = ForecastingHorizon(X.index[75:], is_relative=False)
a_clf = AutoARIMA(suppress_warnings=True)
clf = a_clf.fit(X=X[:75], y=y[:75])
y_pred_sk = clf.predict(fh=fh_, X=X[75:])
print(y_pred_sk)
---------------------------------------------------
668         X = self._check_exog(X)  # type: np.ndarray
669         if X is not None and X.shape[0] != n_periods:
--> 670             raise ValueError('X array dims (n_rows) != n_periods')
671 
672         # f = self.arima_res_.forecast(steps=n_periods, exog=X)

ValueError: X array dims (n_rows) != n_periods

With the flag is set to True

# Only the "periods" or "absolute indices" after the last date provided when fitting the model are needed
fh_ = ForecastingHorizon([i+1 for i in range(len(X.index[75:]))], is_relative=True)
a_clf = AutoARIMA(suppress_warnings=True)
clf = a_clf.fit(y=y[:75])
y_pred_sk = clf.predict(fh=fh_)
print(y_pred_sk)
---------------------------------------------------
2021-03-17    0.371662
2021-03-18    0.776146
...
2021-04-09   -0.486227
2021-04-10   -0.000040

This works even when features are passed

fh_ = ForecastingHorizon([i+1 for i in range(len(X.index[75:]))], is_relative=True)
a_clf = AutoARIMA(suppress_warnings=True)
clf = a_clf.fit(X=X[:75], y=y[:75])
y_pred_sk = clf.predict(fh=fh_, X=X[75:])
print(y_pred_sk)
---------------------------------------------------
2021-01-01 03:45:00    0.369964
2021-01-01 03:48:00    0.768921
...
2021-01-01 04:54:00   -0.458856
2021-01-01 04:57:00   -0.012400

More information can be found here.

@codecov
Copy link

codecov bot commented Aug 10, 2021

Codecov Report

Merging #2613 (53b6c70) into main (716ea92) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2613     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        297     297             
  Lines      27027   27039     +12     
=======================================
+ Hits       26983   26995     +12     
  Misses        44      44             
Impacted Files Coverage Δ
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_arima_regressor.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_estimators.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 716ea92...53b6c70. Read the comment docs.

[i + 1 for i in range(len(y[15:]))], is_relative=True
)

a_clf = sktime_arima.AutoARIMA(start_p=2, start_q=2, max_p=2, max_q=2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters reduce this test run time from 1m 14s to 20s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great call!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be the same coverage to just reduce the freq_str to maybe half or so? I appreciate the thoroughness, but I feel like maybe we can trust sklearn to handle uniformity between the different time periods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can limit it to Seconds, Months, and Years mainly to support sub-minute intervals and because months and years have a varying number of days, it might be useful to stay up to date with any changes they make in sktime.

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, Parthiv! I'm not 100% sure whether we only want only relative forecasting or we want to enable both relative and absolute. If the answer is the former, then I think we're good to go. If the latter, we might need to rework this a little bit.

[i + 1 for i in range(len(y[15:]))], is_relative=True
)

a_clf = sktime_arima.AutoARIMA(start_p=2, start_q=2, max_p=2, max_q=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great call!

[i + 1 for i in range(len(y[15:]))], is_relative=True
)

a_clf = sktime_arima.AutoARIMA(start_p=2, start_q=2, max_p=2, max_q=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be the same coverage to just reduce the freq_str to maybe half or so? I appreciate the thoroughness, but I feel like maybe we can trust sklearn to handle uniformity between the different time periods.

Comment on lines +155 to +157
fh_ = forecasting_.ForecastingHorizon(
[i + 1 for i in range(len(dates))], is_relative=True
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, dumb question, it seems that right now relative forecasting is all we support with this change. Is there any benefit to changing the api by having is_relative be assigned during the regressor's creation? I'm not super familiar, so maybe we only want relative forecasting, but it seems to the luddite like me that "both" might be desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not dumb at all and this is certainly something we can discuss further. Based on the time series conversations that have been going on, we want to make sure that however we implement this, it should work well with the moving window strategy we want to leverage in Rolling Origin Cross Validation. I think especially considering gap and forecast_horizon, we'd want to look a certain number of steps ahead when making our predictions. Having relative forecasting would make this easier since we can determine how far ahead we want to look, and what those indices should be.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question for my own understanding, but rest of it looks good!

@@ -152,7 +152,9 @@ def _format_dates(self, dates, X, y, predict=False):
forecasting_ = import_or_raise(
"sktime.forecasting.base", error_msg=arima_model_msg
)
fh_ = forecasting_.ForecastingHorizon(dates, is_relative=False)
fh_ = forecasting_.ForecastingHorizon(
[i + 1 for i in range(len(dates))], is_relative=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we forecasting on this array of ints rather than the dates? Do the dates become irrelevant when we use is_relative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Correct, the ints represent how many "periods" after relative to the end of the training dates we want to predict for.

@ParthivNaresh ParthivNaresh merged commit ce1830f into main Aug 11, 2021
ParthivNaresh added a commit that referenced this pull request Aug 11, 2021
@chukarsten chukarsten mentioned this pull request Aug 12, 2021
@freddyaboulton freddyaboulton deleted the Support-Relative-Freq-ARIMA branch May 13, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Withholding ARIMA from AutoML if time unit is not supported by sktime
3 participants