-
Notifications
You must be signed in to change notification settings - Fork 91
Integrate Time Series Classification Pipelines into AutoMLSearch #1666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Time Series Classification Pipelines into AutoMLSearch #1666
Conversation
| assert problem_config == {"max_delay": 2, "gap": 3} | ||
|
|
||
|
|
||
| @patch('evalml.pipelines.TimeSeriesRegressionPipeline.score', return_value={"R2": 0.3}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this test to test_automl_search_regression.py
Codecov Report
@@ Coverage Diff @@
## main #1666 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18625 18652 +27
=========================================
+ Hits 18617 18644 +27
Misses 8 8
Continue to review full report at Codecov.
|
| pp_components.append(DateTimeFeaturizer) | ||
|
|
||
| if problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]: | ||
| if is_time_series(problem_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that AutoML can create pipelines for ts classification when allowed_pipelines=None
| ) | ||
|
|
||
|
|
||
| class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidating all of the baseline pipelines for ts into the same file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this
| @pytest.mark.parametrize("pipeline_class", [TimeSeriesBinaryClassificationPipeline, | ||
| TimeSeriesMulticlassClassificationPipeline]) | ||
| @pytest.mark.parametrize("use_none_X", [True, False]) | ||
| def test_score_works_with_estimator_uses_y(use_none_X, pipeline_class, X_y_binary, X_y_multi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have coverage for this case in test_time_series_baseline_pipeline.py
87ff77b to
0a51e85
Compare
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small, comment related stuff. Looks good though. Hefty, hefty.
evalml/pipelines/components/estimators/regressors/time_series_baseline_estimator.py
Outdated
Show resolved
Hide resolved
|
|
||
| def predict_proba(self, X, y=None): | ||
| if y is None: | ||
| raise ValueError("Cannot predict Time Series Baseline Regressor if y is None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regressor -> Estimator?
| ) | ||
| from evalml.preprocessing.utils import is_time_series | ||
| from evalml.problem_types import ProblemTypes | ||
| from evalml.problem_types import ProblemTypes, is_time_series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably a smart move.
ParthivNaresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just a few docstring updates
| ) | ||
|
|
||
|
|
||
| class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this
|
|
||
| @property | ||
| def feature_importance(self): | ||
| """Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to change regressors to estimators here as well
|
|
||
| # In case gap is 0 and this is a baseline pipeline, we drop the nans in the | ||
| # predictions before decoding them | ||
| predictions = pd.Series(self._decode_targets(predictions.dropna()), name=self.input_target_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
0a51e85 to
8ab890b
Compare
bchen1116
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is supposed to work currently, but

Additionally, something that we didn't catch previously, but we should change the datachecks we use when we are doing time series problems, specifically the class_imbalance_data_check since TimeSeriesSplit doesn't necessarily shuffle/stratify the data that the other data splitters do:

This can definitely be filed as a separate issue.
The rest looks good to me!
| y = _convert_woodwork_types_wrapper(y.to_series()) | ||
| preds = self.predict(X, y).dropna(axis=0, how='any').astype('int') | ||
| proba_arr = np.zeros((len(preds), y.max() + 1)) | ||
| proba_arr[np.arange(len(preds)), preds] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| "Time Series Baseline Estimator": {'gap': gap, 'max_delay': 1}}) | ||
| expected_y = y | ||
| if gap == 0: | ||
| expected_y = y.shift(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could shorten this to
expected_y = y.shift(1) if gap == 0 else y|
@bchen1116 Thanks for the comments! I filed #1681 to track updates to the class imbalance datacheck. The issue with the fraud dataset is closely related to #1507 because the problem is that the pipelines are delaying categorical features (which introduces NaNs) and then trying to OHE the delayed categorical columns (which can't handle NaNs). I think the problem is that we need a way of identifying which features depend on time and which don't: For fraud we may want to have a record of the currencies used in the last k transactions, but we shouldn't delay features which don't depend on time (like the customer_id if present). I'll add some notes to #1507, we may have to file a separate issue. |
8ab890b to
307830e
Compare
307830e to
c636e3b
Compare
Pull Request Description
Fixes #1580
_set_data_splitto chooseTimeSeriesSplitfor time series classification pipelinestest_score_works_with_estimator_uses_yintest_time_series_pipeline.pyand reuse the tests intest_time_series_baseline_regression.py. See https://github.com/alteryx/evalml/pull/1651/files#r553425518Minor cosmetic changes that make the diff bigger than it is:
TimeSeriesBaselineRegressortoTimeSeriesBaselineEstimatorsince it works for all time series problem typesevalml/pipelines/regression/time_series_baseline_regression.pytoevalml/pipelines/time_series_baselines.pysince it contains the definitions for all ts baseline pipelines.evalml/tests/pipeline_tests/regression_pipeline_tests/test_time_series_baseline_regression.pytoevalml/tests/pipeline_tests/test_time_series_baseline_pipeline.pyAfter creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.