Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support time series in DefaultAlgorithm #3177

Merged
merged 22 commits into from
Jan 11, 2022
Merged

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Jan 3, 2022

Fixes #2691.

Performance test results:
ts_comp.html.zip

Results look good and matches up with past Default tests. This case is even simpler in that the pipelines are exactly the same, but just in a different search order. You can see that the only difference is very minor (<5%) variations in estimator performance in change in fit time due to having 2 additional pipelines searched in Default.

@codecov
Copy link

codecov bot commented Jan 3, 2022

Codecov Report

Merging #3177 (f36726a) into main (b469733) will decrease coverage by 0.8%.
The diff coverage is 99.7%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3177     +/-   ##
=======================================
- Coverage   99.7%   99.0%   -0.7%     
=======================================
  Files        326     326             
  Lines      31390   31268    -122     
=======================================
- Hits       31286   30939    -347     
- Misses       104     329    +225     
Impacted Files Coverage Δ
evalml/automl/utils.py 98.5% <ø> (-1.5%) ⬇️
...lml/tests/automl_tests/test_iterative_algorithm.py 92.3% <ø> (-7.7%) ⬇️
evalml/tests/component_tests/test_utils.py 95.7% <ø> (ø)
...lml/tests/integration_tests/test_nullable_types.py 100.0% <ø> (ø)
.../prediction_explanations_tests/test_force_plots.py 100.0% <ø> (ø)
...l_understanding_tests/test_feature_explanations.py 100.0% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <ø> (ø)
...ests/automl_tests/test_automl_search_regression.py 84.3% <66.7%> (-15.7%) ⬇️
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <100.0%> (ø)
.../transformers/preprocessing/datetime_featurizer.py 100.0% <100.0%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b469733...f36726a. Read the comment docs.

@jeremyliweishih jeremyliweishih changed the title [SPIKE] Changes to DefaultAlgorithm for Time Series Support time series in DefaultAlgorithm Jan 4, 2022
@jeremyliweishih jeremyliweishih marked this pull request as ready for review January 5, 2022 21:16
next_batch = self._create_n_pipelines(
self._top_n_pipelines, self.num_long_pipelines_per_batch
)
if self._batch_number == 0:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means DefaultAlgorithm will have different batches for TS and non-TS problems as ensembling is turned off for TS. Moreover, the pipeline structures are different for TS problems as well (we defer pipeline creation to _make_pipeline_time_series which incorporates the KIA work @freddyaboulton did). For the future but this begs the question: should we have a separate automl algorithm implementation for TS problems vs non-TS problems? I can envision the batch structure being the same (once ensembling for TS is fixed) but with a different implementation (pipelines are different etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mechanics of this algorithm for non-time series pipelines yield good results when applied to time series pipelines without violating any "rules of machine learning", e.g. data-leakage etc, then we should use the same algorithm.

I suspect that will be the case when we get ensembling to work for time series.

],
)
@patch("evalml.pipelines.components.FeatureSelector.get_names")
def test_default_algorithm_time_series(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #2867 goes in, any AutoML test that used IterativeAlgorithm will use DefaultAlgorithm instead. So I won't duplicate coverage for those tests in this PR.

@@ -56,7 +56,7 @@ def test_nullable_types_builds_pipelines(
if automl_algorithm == "iterative":
pipelines = [pl.name for pl in automl.allowed_pipelines]
elif automl_algorithm == "default":
# TODO: Upon resolution of GH Issue #2691, increase the num of batches.
# TODO: Upon resolution of GH Issue #3186, increase the num of batches.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to the new issue I set up to track adding ensembling support for TS problems.

@@ -110,6 +110,7 @@ def __init__(
self.verbose = verbose
self._selected_cat_cols = []
self._split = False
self._ensembling = True if not is_time_series(self.problem_type) else False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add ensembling as a parameter to DefaultAlgorithm as well but currently I don't believe we have a use case for it.

)
for estimator in estimators
]
if is_time_series(self.problem_type):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres a couple is_time_series scattered in this PR and I like how explicit it is (although it does make this code more verbose). An alternative would be to move is_time_series into _make_split_pipeline.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, Jeremy. Nothing blocking from me, just some refactor suggestions. I think we should try to remove some of the duplicate code if the complexity isn't unmanageable! I'll leave it up to you which suggestions to take and which to pass on!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left a few suggestions on ways to clean up code, but nothing blocking!

@jeremyliweishih jeremyliweishih enabled auto-merge (squash) January 11, 2022 22:18
@jeremyliweishih jeremyliweishih merged commit 70d8dc9 into main Jan 11, 2022
@chukarsten chukarsten mentioned this pull request Jan 18, 2022
@freddyaboulton freddyaboulton deleted the js_2691_ts_ensemble branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support time series in new automl algo
4 participants