Support time series in DefaultAlgorithm#3177
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3177 +/- ##
=======================================
- Coverage 99.7% 99.0% -0.7%
=======================================
Files 326 326
Lines 31390 31268 -122
=======================================
- Hits 31286 30939 -347
- Misses 104 329 +225
Continue to review full report at Codecov.
|
DefaultAlgorithm for Time SeriesDefaultAlgorithm
| next_batch = self._create_n_pipelines( | ||
| self._top_n_pipelines, self.num_long_pipelines_per_batch | ||
| ) | ||
| if self._batch_number == 0: |
There was a problem hiding this comment.
This means DefaultAlgorithm will have different batches for TS and non-TS problems as ensembling is turned off for TS. Moreover, the pipeline structures are different for TS problems as well (we defer pipeline creation to _make_pipeline_time_series which incorporates the KIA work @freddyaboulton did). For the future but this begs the question: should we have a separate automl algorithm implementation for TS problems vs non-TS problems? I can envision the batch structure being the same (once ensembling for TS is fixed) but with a different implementation (pipelines are different etc.).
There was a problem hiding this comment.
If the mechanics of this algorithm for non-time series pipelines yield good results when applied to time series pipelines without violating any "rules of machine learning", e.g. data-leakage etc, then we should use the same algorithm.
I suspect that will be the case when we get ensembling to work for time series.
| ], | ||
| ) | ||
| @patch("evalml.pipelines.components.FeatureSelector.get_names") | ||
| def test_default_algorithm_time_series( |
There was a problem hiding this comment.
Once #2867 goes in, any AutoML test that used IterativeAlgorithm will use DefaultAlgorithm instead. So I won't duplicate coverage for those tests in this PR.
| pipelines = [pl.name for pl in automl.allowed_pipelines] | ||
| elif automl_algorithm == "default": | ||
| # TODO: Upon resolution of GH Issue #2691, increase the num of batches. | ||
| # TODO: Upon resolution of GH Issue #3186, increase the num of batches. |
There was a problem hiding this comment.
Changed to the new issue I set up to track adding ensembling support for TS problems.
| self.verbose = verbose | ||
| self._selected_cat_cols = [] | ||
| self._split = False | ||
| self._ensembling = True if not is_time_series(self.problem_type) else False |
There was a problem hiding this comment.
I can add ensembling as a parameter to DefaultAlgorithm as well but currently I don't believe we have a use case for it.
| ) | ||
| for estimator in estimators | ||
| ] | ||
| if is_time_series(self.problem_type): |
There was a problem hiding this comment.
Theres a couple is_time_series scattered in this PR and I like how explicit it is (although it does make this code more verbose). An alternative would be to move is_time_series into _make_split_pipeline.
chukarsten
left a comment
There was a problem hiding this comment.
Nice work, Jeremy. Nothing blocking from me, just some refactor suggestions. I think we should try to remove some of the duplicate code if the complexity isn't unmanageable! I'll leave it up to you which suggestions to take and which to pass on!
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! I left a few suggestions on ways to clean up code, but nothing blocking!
Fixes #2691.
Performance test results:
ts_comp.html.zip
Results look good and matches up with past Default tests. This case is even simpler in that the pipelines are exactly the same, but just in a different search order. You can see that the only difference is very minor (<5%) variations in estimator performance in change in fit time due to having 2 additional pipelines searched in Default.