Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select Smarter Lags for Time Series #3005

Merged
merged 11 commits into from
Nov 4, 2021
Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #2733


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Nov 2, 2021

Codecov Report

Merging #3005 (b9930e6) into main (ae89b27) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3005     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        312     312             
  Lines      29775   29853     +78     
=======================================
+ Hits       29684   29762     +78     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl.py 99.5% <ø> (ø)
.../automl_tests/test_automl_search_classification.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_components.py 98.9% <ø> (ø)
...rmers/preprocessing/delayed_feature_transformer.py 100.0% <100.0%> (ø)
...mponent_tests/test_delayed_features_transformer.py 100.0% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 99.8% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae89b27...b9930e6. Read the comment docs.

X=X, y=y
),
DelayedFeatureTransformer(
max_delay=3, gap=0, forecast_horizon=1, conf_level=1.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a conf_level of 1.0 selects all lags. Some of these tests rely on the DelayedFeatureTransformer selecting all lags.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! You got this out so fast. I think one thing that we should add before merging this would be some documentation, maybe here, that would allow users to understand what this component is doing and how changing conf_level could change the lags used. This would help with un-black-boxing the component.

Otherwise, left some nits and some questions. Good stuff though!

@@ -14,6 +18,9 @@ class DelayedFeatureTransformer(Transformer):
date_index (str): Name of the column containing the datetime information used to order the data. Ignored.
max_delay (int): Maximum number of time units to delay each feature. Defaults to 2.
forecast_horizon (int): The number of time periods the pipeline is expected to forecast.
conf_level (float, None): Float between 0 and 1 that determines the confidence interval size used to select
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add default value here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Will get rid of the outdated None as well as adding some validation that conf_level is in the range (0, 1].

# Return lags that are significant peaks or the first 10 significant lags
index = np.arange(len(acf_values))
significant = np.logical_or(ci_intervals[:, 0] > 0, ci_intervals[:, 1] < 0)
first_significant_10 = index[:10][significant[:10]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't necessarily return the first 10 significant lags, right? It will only return the 10 lags if they're all significant, but if a few of them aren't significant, returns less lags. If that's what's happening, can we update the comment above to say that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I can see why this is confusing now. The intention is to select the significant lags in the range [0, 10] not the first 10 significant lags.

significant_lags = (
set(index[significant]).intersection(peaks).union(first_significant_10)
)
# If not lags are significant get the first lag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if no lags are significant...

@@ -14,6 +18,9 @@ class DelayedFeatureTransformer(Transformer):
date_index (str): Name of the column containing the datetime information used to order the data. Ignored.
max_delay (int): Maximum number of time units to delay each feature. Defaults to 2.
forecast_horizon (int): The number of time periods the pipeline is expected to forecast.
conf_level (float, None): Float between 0 and 1 that determines the confidence interval size used to select
which lags to compute from the set of [1, max_delay]. A delay of 1 will always be computed. If 1,
selects all possible lags in the set of [1, max_delay], inclusive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify here that the range is actually (0, 1], since passing in 0 here throws an error, while passing 1 works.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added some parameter validation as well.


first_significant_10 = [l for l in significant_lags if l < 10]
expected_lags = (
set(significant_lags + peaks).intersection(peaks).union(first_significant_10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnt set(significant_lags + peaks).intersection(peaks) just reduce to set(peaks)?
I think since we wanted the significant peaks, it would be just set(significant_lags).intersection(peaks)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, making this change now!

@freddyaboulton
Copy link
Contributor Author

@bchen1116 Added details on the algorithm to the component docstring and added a link from the user guide to the component reference since I didn't want to jump deep into the specifics of a component in the user guide. Thanks for the feedback though, the documentation definitely needed improvement.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! In your perf tests it showed that Daily Female Births had slightly lower validation results while Southern Oscillations had slightly better. It would be interesting to see how Prophet and ARIMA perform (once we enable DelayedFeatureTransformer for ARIMA).

@staticmethod
def _find_significant_lags(y, conf_level, max_delay):
all_lags = np.arange(max_delay + 1)
if conf_level is not None and y is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to check if the conf_level is not None here?

Screen Shot 2021-11-04 at 10 38 26 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what happened originally I set the default value to None to mean "no lags" but that doesn't work well with our tuners so I made it be a float. I'll move the none check to the init! Thank you.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for making the changes! Agreed with @ParthivNaresh about conf_level=None, but other than that, great work!

@freddyaboulton freddyaboulton merged commit 8985ea2 into main Nov 4, 2021
@freddyaboulton freddyaboulton deleted the 2733-smarter-lags-time-series branch November 4, 2021 17:34
@chukarsten chukarsten mentioned this pull request Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Smarter lagging for time series
3 participants