Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Periodicty Determination #3729

Merged
merged 32 commits into from
Sep 27, 2022
Merged

Automatic Periodicty Determination #3729

merged 32 commits into from
Sep 27, 2022

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Sep 22, 2022

In order to better calculate the seasonal signal, the Decomposer child objects should be updated with a function to estimate the period of the seasonal component of that signal. This PR adds a function to estimate the period using either an autocorrelation or partial autocorrelation based method. Currently, the autocorrelation method works best. Follow-on stories are filed to continue work on these functions.

@codecov
Copy link

codecov bot commented Sep 22, 2022

Codecov Report

Merging #3729 (6fa0157) into main (7a995be) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3729     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        338     339      +1     
  Lines      34672   34786    +114     
=======================================
+ Hits       34540   34655    +115     
+ Misses       132     131      -1     
Impacted Files Coverage Δ
...ransformers/preprocessing/polynomial_decomposer.py 100.0% <ø> (+0.8%) ⬆️
...omponents/transformers/preprocessing/decomposer.py 100.0% <100.0%> (ø)
...omponent_tests/decomposer_tests/test_decomposer.py 100.0% <100.0%> (ø)
...sts/decomposer_tests/test_polynomial_decomposer.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.1% <100.0%> (+0.1%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -26,10 +32,122 @@ def __init__(self, parameters=None, component_obj=None, random_seed=0, **kwargs)
**kwargs,
)

def _set_time_index(self, X, y):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifted _set_time_index() up from PolynomialDecomposer class to the parent since it can probably be reused.

@chukarsten chukarsten changed the title Determine periodicity Automatic Periodicty Determination Sep 27, 2022
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see this, and I'm glad the implementation isn't overly-complicated. My questions are mainly about the intended flow within the decomposer. When are the periodic and detrender functions expected to be called within the flow fitting and predicting on a pipeline? This might be buried in the code somewhere but it's currently unclear to me.

"cell_type": "markdown",
"metadata": {},
"source": [
"The children of the `Decomposer` class, if not set in the constructor, sets its `seasonal_period` parameter to -1. If the `Decomposer` object is fit with a `seasonal_period` of -1, it will take on a default value of 7, which is good for daily data signals that have a known seasonal component period that is weekly. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this. The default is -1, but actually 7? Why have the separate values and not just default to 7 immediately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note - why leave the seasonal_period to the children? Seems like it would fit better in the parent Decomposer class, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to leaving the seasonal_period to the children, I initially did that because I was not entirely sure whether the other children of Decomposer would need this functionality or not. I filed this issue to come back to it after we do one or two more Decomposer children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to the "default" value, I understand the confusion. I changed the wording in the notebook as the default value is not really even 7. "-1" means "let statsmodels' freq_to_period figure it out base on the inferred frequency of the datetime index." Setting any other integer makes that the seasonal period.

"source": [
"The children of the `Decomposer` class, if not set in the constructor, sets its `seasonal_period` parameter to -1. If the `Decomposer` object is fit with a `seasonal_period` of -1, it will take on a default value of 7, which is good for daily data signals that have a known seasonal component period that is weekly. \n",
"\n",
"However, if the seasonal period is not known before hand, the `set_seasonal_period()` convenience function will look at the target data, determine the period and set the `Decomposer` object appropriately."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems to conflict with the above. Can you clarify the flow of how the parameter gets set if the user does not set it themselves?

Unrelated nit: "before hand" -> "beforehand"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this in the cell:

The children of the `Decomposer` class, if not explicitly set in the constructor, set their `seasonal_period` parameter
based on a `statsmodels` function, `freq_to_period`, that considers the frequency of the datetime data.  If the
`PolynomialDecomposer` object is fit with `seasonal_period` not explicitly set, it will take on a default value of 7,
which is good for daily data signals that have a known seasonal component period that is weekly.

However, if the seasonal period is not known beforehand, the `set_seasonal_period()` convenience function will look at the target data,
determine a better guess for the period and set the `Decomposer` object appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a bit more sense. I'm still slightly confused though, does the set_seasonal_period() function replace the freq_to_period function with a better guess, or are they completely unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_seasonal_period() does indeed set seasonal_period with a better guess. freq_to_period basically just looks at the data and says "oh, the granularity of the time series is days, let's use 7." and has a similar mapping for other frequencies. set_seasonal_period() will actually invoke the seasonal period detection to make a good guess, then set it.

"pdc.fit(X_train_time, y_train_time)\n",
"assert pdc.seasonal_period == 7\n",
"pdc.set_seasonal_period(X_train_time, y_train_time)\n",
"assert 363 < pdc.seasonal_period < 368"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this range? Would the seasonal period ever be something different than 365 or 366? (And if so, does it make sense to fix it since the data in this example is known?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying to show the "correct" period of ~365 +- a little fuzziness due to the algorithm.

pacf = sm.tsa.pacf(y)
return argrelextrema(pacf, np.greater)[0]

def _detrend_on_fly(X, y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this - these are lots of computationally expensive calls. Remind me, when in the process do we expect to be using the periodicity?

If I understand correctly, we need the period to properly detrend, so this is a pre-detrending to get the period? What a fun catch-22... If this is the case, can we reduce this in any way by enforcing this is run after either the component is fit or the time index is already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, it cannot be run after the component is fit...as the intent is to use the output to fit the component properly. This does seem a bit like a catch-22 because I think really this is kind of an iterative process. I am not sure what you mean though for the time index, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. My thinking with the time index is that set_time_index can do a lot of extra work if it's run multiple times - ideally there'd be some way we can make sure its called as few times as possible within the component.

X, y = load_weather()
y = y.set_axis(X["Date"]).asfreq(pd.infer_freq(X["Date"]))
X = X.set_index("Date").asfreq(pd.infer_freq(X["Date"]))
return X, y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking at this closely, is there any value in initializing woodwork here?

@chukarsten chukarsten merged commit ae43681 into main Sep 27, 2022
@chukarsten chukarsten deleted the determine_periodicity branch September 27, 2022 23:45
@chukarsten chukarsten mentioned this pull request Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants