-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Periodicty Determination #3729
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3729 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 338 339 +1
Lines 34672 34786 +114
=======================================
+ Hits 34540 34655 +115
+ Misses 132 131 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…r of the relative extrema, filtering out negative ACF values and just grabbing the max.
@@ -26,10 +32,122 @@ def __init__(self, parameters=None, component_obj=None, random_seed=0, **kwargs) | |||
**kwargs, | |||
) | |||
|
|||
def _set_time_index(self, X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shifted _set_time_index()
up from PolynomialDecomposer
class to the parent since it can probably be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to see this, and I'm glad the implementation isn't overly-complicated. My questions are mainly about the intended flow within the decomposer. When are the periodic and detrender functions expected to be called within the flow fitting and predicting on a pipeline? This might be buried in the code somewhere but it's currently unclear to me.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The children of the `Decomposer` class, if not set in the constructor, sets its `seasonal_period` parameter to -1. If the `Decomposer` object is fit with a `seasonal_period` of -1, it will take on a default value of 7, which is good for daily data signals that have a known seasonal component period that is weekly. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this. The default is -1, but actually 7? Why have the separate values and not just default to 7 immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note - why leave the seasonal_period
to the children? Seems like it would fit better in the parent Decomposer
class, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to leaving the seasonal_period
to the children, I initially did that because I was not entirely sure whether the other children of Decomposer
would need this functionality or not. I filed this issue to come back to it after we do one or two more Decomposer
children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to the "default" value, I understand the confusion. I changed the wording in the notebook as the default value is not really even 7. "-1" means "let statsmodels
' freq_to_period
figure it out base on the inferred frequency of the datetime index." Setting any other integer makes that the seasonal period.
"source": [ | ||
"The children of the `Decomposer` class, if not set in the constructor, sets its `seasonal_period` parameter to -1. If the `Decomposer` object is fit with a `seasonal_period` of -1, it will take on a default value of 7, which is good for daily data signals that have a known seasonal component period that is weekly. \n", | ||
"\n", | ||
"However, if the seasonal period is not known before hand, the `set_seasonal_period()` convenience function will look at the target data, determine the period and set the `Decomposer` object appropriately." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence seems to conflict with the above. Can you clarify the flow of how the parameter gets set if the user does not set it themselves?
Unrelated nit: "before hand" -> "beforehand"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this in the cell:
The children of the `Decomposer` class, if not explicitly set in the constructor, set their `seasonal_period` parameter
based on a `statsmodels` function, `freq_to_period`, that considers the frequency of the datetime data. If the
`PolynomialDecomposer` object is fit with `seasonal_period` not explicitly set, it will take on a default value of 7,
which is good for daily data signals that have a known seasonal component period that is weekly.
However, if the seasonal period is not known beforehand, the `set_seasonal_period()` convenience function will look at the target data,
determine a better guess for the period and set the `Decomposer` object appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a bit more sense. I'm still slightly confused though, does the set_seasonal_period()
function replace the freq_to_period
function with a better guess, or are they completely unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_seasonal_period()
does indeed set seasonal_period
with a better guess. freq_to_period
basically just looks at the data and says "oh, the granularity of the time series is days, let's use 7." and has a similar mapping for other frequencies. set_seasonal_period()
will actually invoke the seasonal period detection to make a good guess, then set it.
"pdc.fit(X_train_time, y_train_time)\n", | ||
"assert pdc.seasonal_period == 7\n", | ||
"pdc.set_seasonal_period(X_train_time, y_train_time)\n", | ||
"assert 363 < pdc.seasonal_period < 368" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this range? Would the seasonal period ever be something different than 365 or 366? (And if so, does it make sense to fix it since the data in this example is known?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to show the "correct" period of ~365 +- a little fuzziness due to the algorithm.
pacf = sm.tsa.pacf(y) | ||
return argrelextrema(pacf, np.greater)[0] | ||
|
||
def _detrend_on_fly(X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this - these are lots of computationally expensive calls. Remind me, when in the process do we expect to be using the periodicity?
If I understand correctly, we need the period to properly detrend, so this is a pre-detrending to get the period? What a fun catch-22... If this is the case, can we reduce this in any way by enforcing this is run after either the component is fit or the time index is already set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, it cannot be run after the component is fit...as the intent is to use the output to fit the component properly. This does seem a bit like a catch-22 because I think really this is kind of an iterative process. I am not sure what you mean though for the time index, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. My thinking with the time index is that set_time_index
can do a lot of extra work if it's run multiple times - ideally there'd be some way we can make sure its called as few times as possible within the component.
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_polynomial_decomposer.py
Outdated
Show resolved
Hide resolved
X, y = load_weather() | ||
y = y.set_axis(X["Date"]).asfreq(pd.infer_freq(X["Date"])) | ||
X = X.set_index("Date").asfreq(pd.infer_freq(X["Date"])) | ||
return X, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking at this closely, is there any value in initializing woodwork here?
In order to better calculate the seasonal signal, the
Decomposer
child objects should be updated with a function to estimate the period of the seasonal component of that signal. This PR adds a function to estimate the period using either an autocorrelation or partial autocorrelation based method. Currently, the autocorrelation method works best. Follow-on stories are filed to continue work on these functions.