Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop ARIMA when month_start freq is detected #2632

Merged
merged 10 commits into from
Aug 17, 2021
Merged

Conversation

ParthivNaresh
Copy link
Contributor

Fixes #2631

@codecov
Copy link

codecov bot commented Aug 16, 2021

Codecov Report

Merging #2632 (fcedb3f) into main (06c21cf) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2632     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        297     297             
  Lines      27037   27044      +7     
=======================================
+ Hits       26993   27000      +7     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.7% <100.0%> (+0.1%) ⬆️
...ests/automl_tests/test_automl_search_regression.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_arima_regressor.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06c21cf...fcedb3f. Read the comment docs.

@@ -379,7 +379,7 @@ def test_fit_predict_date_index_named_out_of_sample(


@pytest.mark.parametrize("freq_num", ["1", "2"])
@pytest.mark.parametrize("freq_str", ["S", "T", "H", "D", "M", "Y"])
@pytest.mark.parametrize("freq_str", ["T", "M", "Y"])
Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaking this in here because it didn't go through in my last ARIMA PR (just meant to save time from 20s to 10s)

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left a question about one of the tests

@@ -3452,7 +3452,8 @@ def test_automl_validates_problem_configuration(X_y_binary):
problem_type="time series regression",
problem_configuration={"max_delay": 2, "gap": 3},
)

_, y = ts_data
X = pd.DataFrame(pd.date_range("2020-10-01", "2020-10-31"), columns=["Date"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need to add this to the test here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had added it because the code I included in AutoMLSearch would throw an error if the data didn't have a date_index column specified, which X_y_binary didn't have. Decided to just clean up the test and replace the data with what I added, thanks!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh When I run this repro for Arima with MS frequency on main I get a value error but it's coming from our own arima implementation.

from evalml.pipelines.components import ARIMARegressor
from evalml.demos import load_diabetes
import pandas as pd

X, y = load_diabetes()
X.ww["Date"] = pd.Series(pd.date_range(start="1/1/2018", periods=X.shape[0], freq="MS"))
arima = ARIMARegressor(date_index="Date")
arima.fit(X, y)

When I delete the line that sets the frequency to M for MS, i.e. freq = pd.infer_freq(dates), the above repro doesn't throw an error anymore and the arima is able to fit:

image

Unit tests also pass. So I'm wondering whether we need to make the changes to AutoMLSearch?

@ParthivNaresh
Copy link
Contributor Author

@freddyaboulton If that line is changed to freq = pd.infer_freq(dates) the error just gets pushed back to predict.

from evalml.pipelines.components import ARIMARegressor
from evalml.demos import load_diabetes
import pandas as pd
import numpy as np
X = pd.DataFrame()
y = pd.Series(np.random.randint(1, 10, 10))
X["Date"] = pd.Series(pd.date_range(start="1/1/2018", periods=10, freq="MS"))

arima = ARIMARegressor(date_index="Date")
arima.fit(X, y)
arima.predict(X)
------------------------------------------
   481         msg = str(e)
    482         if "Invalid frequency" in msg or "_period_dtype_code" in msg:
--> 483             raise ValueError(
    484                 "Invalid frequency. Please select a frequency that can "
    485                 "be converted to a regular `pd.PeriodIndex`. For other "

ValueError: Invalid frequency. Please select a frequency that can be converted to a regular `pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations currently do not work reliably.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ParthivNaresh !

@@ -505,6 +505,16 @@ def __init__(
allowed_estimators = get_estimators(
self.problem_type, self.allowed_model_families
)
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence as to whether this should live here or in get_estimators. I'm ok keeping this here. Is this a bug that sktime can fix in the future or is it a limitation of the MS frequency?

Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I chose to put it here instead of get_estimators was to avoid passing problem_configuration to get_estimators just for this check
It's a limitation with MS unfortunately, due to the unreliability of forecasting over intervals that randomly change in values (different days in a month). This doesn't seem to be an issue with month end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let's not change the api of get_estimators yet! Ok, I think there will be other frequencies with that limitation so I will file an issue for following up about that. Might make sense to add yet another data check for this hehe

@ParthivNaresh ParthivNaresh merged commit ec657f8 into main Aug 17, 2021
@freddyaboulton freddyaboulton deleted the ARIMA-Freq-Support branch May 13, 2022 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARIMA doesn't support MS as a frequency
3 participants