-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support time series in DefaultAlgorithm
#3177
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3177 +/- ##
=======================================
- Coverage 99.7% 99.0% -0.7%
=======================================
Files 326 326
Lines 31390 31268 -122
=======================================
- Hits 31286 30939 -347
- Misses 104 329 +225
Continue to review full report at Codecov.
|
DefaultAlgorithm
for Time SeriesDefaultAlgorithm
next_batch = self._create_n_pipelines( | ||
self._top_n_pipelines, self.num_long_pipelines_per_batch | ||
) | ||
if self._batch_number == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means DefaultAlgorithm
will have different batches for TS and non-TS problems as ensembling is turned off for TS. Moreover, the pipeline structures are different for TS problems as well (we defer pipeline creation to _make_pipeline_time_series
which incorporates the KIA work @freddyaboulton did). For the future but this begs the question: should we have a separate automl algorithm implementation for TS problems vs non-TS problems? I can envision the batch structure being the same (once ensembling for TS is fixed) but with a different implementation (pipelines are different etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the mechanics of this algorithm for non-time series pipelines yield good results when applied to time series pipelines without violating any "rules of machine learning", e.g. data-leakage etc, then we should use the same algorithm.
I suspect that will be the case when we get ensembling to work for time series.
], | ||
) | ||
@patch("evalml.pipelines.components.FeatureSelector.get_names") | ||
def test_default_algorithm_time_series( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #2867 goes in, any AutoML test that used IterativeAlgorithm
will use DefaultAlgorithm
instead. So I won't duplicate coverage for those tests in this PR.
@@ -56,7 +56,7 @@ def test_nullable_types_builds_pipelines( | |||
if automl_algorithm == "iterative": | |||
pipelines = [pl.name for pl in automl.allowed_pipelines] | |||
elif automl_algorithm == "default": | |||
# TODO: Upon resolution of GH Issue #2691, increase the num of batches. | |||
# TODO: Upon resolution of GH Issue #3186, increase the num of batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to the new issue I set up to track adding ensembling support for TS problems.
@@ -110,6 +110,7 @@ def __init__( | |||
self.verbose = verbose | |||
self._selected_cat_cols = [] | |||
self._split = False | |||
self._ensembling = True if not is_time_series(self.problem_type) else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add ensembling as a parameter to DefaultAlgorithm
as well but currently I don't believe we have a use case for it.
) | ||
for estimator in estimators | ||
] | ||
if is_time_series(self.problem_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres a couple is_time_series
scattered in this PR and I like how explicit it is (although it does make this code more verbose). An alternative would be to move is_time_series
into _make_split_pipeline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, Jeremy. Nothing blocking from me, just some refactor suggestions. I think we should try to remove some of the duplicate code if the complexity isn't unmanageable! I'll leave it up to you which suggestions to take and which to pass on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a few suggestions on ways to clean up code, but nothing blocking!
Fixes #2691.
Performance test results:
ts_comp.html.zip
Results look good and matches up with past Default tests. This case is even simpler in that the pipelines are exactly the same, but just in a different search order. You can see that the only difference is very minor (<5%) variations in estimator performance in change in fit time due to having 2 additional pipelines searched in Default.