-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select Smarter Lags for Time Series #3005
Conversation
5b166dc
to
6a9afb3
Compare
Codecov Report
@@ Coverage Diff @@
## main #3005 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 312 312
Lines 29775 29853 +78
=======================================
+ Hits 29684 29762 +78
Misses 91 91
Continue to review full report at Codecov.
|
X=X, y=y | ||
), | ||
DelayedFeatureTransformer( | ||
max_delay=3, gap=0, forecast_horizon=1, conf_level=1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a conf_level
of 1.0 selects all lags. Some of these tests rely on the DelayedFeatureTransformer selecting all lags.
465c297
to
cd7dbb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! You got this out so fast. I think one thing that we should add before merging this would be some documentation, maybe here, that would allow users to understand what this component is doing and how changing conf_level could change the lags used. This would help with un-black-boxing the component.
Otherwise, left some nits and some questions. Good stuff though!
@@ -14,6 +18,9 @@ class DelayedFeatureTransformer(Transformer): | |||
date_index (str): Name of the column containing the datetime information used to order the data. Ignored. | |||
max_delay (int): Maximum number of time units to delay each feature. Defaults to 2. | |||
forecast_horizon (int): The number of time periods the pipeline is expected to forecast. | |||
conf_level (float, None): Float between 0 and 1 that determines the confidence interval size used to select |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add default value here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Will get rid of the outdated None
as well as adding some validation that conf_level is in the range (0, 1].
# Return lags that are significant peaks or the first 10 significant lags | ||
index = np.arange(len(acf_values)) | ||
significant = np.logical_or(ci_intervals[:, 0] > 0, ci_intervals[:, 1] < 0) | ||
first_significant_10 = index[:10][significant[:10]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't necessarily return the first 10 significant lags, right? It will only return the 10 lags if they're all significant, but if a few of them aren't significant, returns less lags. If that's what's happening, can we update the comment above to say that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I can see why this is confusing now. The intention is to select the significant lags in the range [0, 10] not the first 10 significant lags.
significant_lags = ( | ||
set(index[significant]).intersection(peaks).union(first_significant_10) | ||
) | ||
# If not lags are significant get the first lag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if no lags are significant...
@@ -14,6 +18,9 @@ class DelayedFeatureTransformer(Transformer): | |||
date_index (str): Name of the column containing the datetime information used to order the data. Ignored. | |||
max_delay (int): Maximum number of time units to delay each feature. Defaults to 2. | |||
forecast_horizon (int): The number of time periods the pipeline is expected to forecast. | |||
conf_level (float, None): Float between 0 and 1 that determines the confidence interval size used to select | |||
which lags to compute from the set of [1, max_delay]. A delay of 1 will always be computed. If 1, | |||
selects all possible lags in the set of [1, max_delay], inclusive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Added some parameter validation as well.
|
||
first_significant_10 = [l for l in significant_lags if l < 10] | ||
expected_lags = ( | ||
set(significant_lags + peaks).intersection(peaks).union(first_significant_10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesnt set(significant_lags + peaks).intersection(peaks)
just reduce to set(peaks)
?
I think since we wanted the significant peaks, it would be just set(significant_lags).intersection(peaks)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, making this change now!
@bchen1116 Added details on the algorithm to the component docstring and added a link from the user guide to the component reference since I didn't want to jump deep into the specifics of a component in the user guide. Thanks for the feedback though, the documentation definitely needed improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! In your perf tests it showed that Daily Female Births
had slightly lower validation results while Southern Oscillations
had slightly better. It would be interesting to see how Prophet and ARIMA perform (once we enable DelayedFeatureTransformer for ARIMA).
@staticmethod | ||
def _find_significant_lags(y, conf_level, max_delay): | ||
all_lags = np.arange(max_delay + 1) | ||
if conf_level is not None and y is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happened originally I set the default value to None
to mean "no lags" but that doesn't work well with our tuners so I made it be a float. I'll move the none check to the init! Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for making the changes! Agreed with @ParthivNaresh about conf_level=None
, but other than that, great work!
b3af9cd
to
b9930e6
Compare
Pull Request Description
Fixes #2733
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.