Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Time Series Classification Pipelines into AutoMLSearch #1666

Merged
merged 12 commits into from Jan 13, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jan 7, 2021

Pull Request Description

Fixes #1580

  • Update _set_data_split to choose TimeSeriesSplit for time series classification pipelines
  • Set log loss as the default objective for time series classification problems
  • Update the baseline time series regressor to work for classification problems
  • See if we can delete test_score_works_with_estimator_uses_y in test_time_series_pipeline.py and reuse the tests in test_time_series_baseline_regression.py. See https://github.com/alteryx/evalml/pull/1651/files#r553425518

Minor cosmetic changes that make the diff bigger than it is:

  • Rename TimeSeriesBaselineRegressor to TimeSeriesBaselineEstimator since it works for all time series problem types
  • Move evalml/pipelines/regression/time_series_baseline_regression.py to evalml/pipelines/time_series_baselines.py since it contains the definitions for all ts baseline pipelines.
  • Move evalml/tests/pipeline_tests/regression_pipeline_tests/test_time_series_baseline_regression.py to evalml/tests/pipeline_tests/test_time_series_baseline_pipeline.py

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@@ -1976,32 +1972,6 @@ def test_automl_validates_problem_configuration(X_y_binary):
assert problem_config == {"max_delay": 2, "gap": 3}


@patch('evalml.pipelines.TimeSeriesRegressionPipeline.score', return_value={"R2": 0.3})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this test to test_automl_search_regression.py

@codecov
Copy link

codecov bot commented Jan 7, 2021

Codecov Report

Merging #1666 (c7fcabe) into main (ba7590f) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1666     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18625    18652     +27     
=========================================
+ Hits        18617    18644     +27     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.0% <ø> (ø)
evalml/pipelines/regression/__init__.py 100.0% <ø> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/pipelines/__init__.py 100.0% <100.0%> (ø)
...lines/components/estimators/regressors/__init__.py 100.0% <100.0%> (ø)
...ators/regressors/time_series_baseline_estimator.py 100.0% <100.0%> (ø)
evalml/pipelines/time_series_baselines.py 100.0% <100.0%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba7590f...c7fcabe. Read the comment docs.

@freddyaboulton freddyaboulton self-assigned this Jan 8, 2021
@freddyaboulton freddyaboulton added the enhancement An improvement to an existing feature. label Jan 8, 2021
@@ -67,7 +71,7 @@ def _get_preprocessing_components(X, y, problem_type, text_columns, estimator_cl
if add_datetime_featurizer:
pp_components.append(DateTimeFeaturizer)

if problem_type in [ProblemTypes.TIME_SERIES_REGRESSION]:
if is_time_series(problem_type):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that AutoML can create pipelines for ts classification when allowed_pipelines=None

)


class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidating all of the baseline pipelines for ts into the same file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

@pytest.mark.parametrize("pipeline_class", [TimeSeriesBinaryClassificationPipeline,
TimeSeriesMulticlassClassificationPipeline])
@pytest.mark.parametrize("use_none_X", [True, False])
def test_score_works_with_estimator_uses_y(use_none_X, pipeline_class, X_y_binary, X_y_multi):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have coverage for this case in test_time_series_baseline_pipeline.py

@freddyaboulton freddyaboulton force-pushed the 1580-integrate-ts-classification-into-automl branch from 87ff77b to 0a51e85 Compare January 8, 2021 18:03
@freddyaboulton freddyaboulton marked this pull request as ready for review January 8, 2021 18:51
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small, comment related stuff. Looks good though. Hefty, hefty.

evalml/automl/automl_search.py Show resolved Hide resolved
@@ -63,6 +65,16 @@ def predict(self, X, y=None):

return y

def predict_proba(self, X, y=None):
if y is None:
raise ValueError("Cannot predict Time Series Baseline Regressor if y is None")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regressor -> Estimator?

@@ -51,8 +51,7 @@
make_pipeline,
make_pipeline_from_components
)
from evalml.preprocessing.utils import is_time_series
from evalml.problem_types import ProblemTypes
from evalml.problem_types import ProblemTypes, is_time_series
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a smart move.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a few docstring updates

)


class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

proba_arr = np.zeros((len(preds), y.max() + 1))
proba_arr[np.arange(len(preds)), preds] = 1
return pad_with_nans(pd.DataFrame(proba_arr), len(y) - len(preds))

@property
def feature_importance(self):
"""Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to change regressors to estimators here as well


# In case gap is 0 and this is a baseline pipeline, we drop the nans in the
# predictions before decoding them
predictions = pd.Series(self._decode_targets(predictions.dropna()), name=self.input_target_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@freddyaboulton freddyaboulton force-pushed the 1580-integrate-ts-classification-into-automl branch from 0a51e85 to 8ab890b Compare January 11, 2021 15:17
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is supposed to work currently, but
image

Additionally, something that we didn't catch previously, but we should change the datachecks we use when we are doing time series problems, specifically the class_imbalance_data_check since TimeSeriesSplit doesn't necessarily shuffle/stratify the data that the other data splitters do:
image
This can definitely be filed as a separate issue.

The rest looks good to me!

y = _convert_woodwork_types_wrapper(y.to_series())
preds = self.predict(X, y).dropna(axis=0, how='any').astype('int')
proba_arr = np.zeros((len(preds), y.max() + 1))
proba_arr[np.arange(len(preds)), preds] = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

"Time Series Baseline Estimator": {'gap': gap, 'max_delay': 1}})
expected_y = y
if gap == 0:
expected_y = y.shift(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could shorten this to

expected_y = y.shift(1) if gap == 0 else y

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jan 12, 2021

@bchen1116 Thanks for the comments! I filed #1681 to track updates to the class imbalance datacheck. The issue with the fraud dataset is closely related to #1507 because the problem is that the pipelines are delaying categorical features (which introduces NaNs) and then trying to OHE the delayed categorical columns (which can't handle NaNs). I think the problem is that we need a way of identifying which features depend on time and which don't: For fraud we may want to have a record of the currencies used in the last k transactions, but we shouldn't delay features which don't depend on time (like the customer_id if present). I'll add some notes to #1507, we may have to file a separate issue.

@freddyaboulton freddyaboulton force-pushed the 1580-integrate-ts-classification-into-automl branch from 8ab890b to 307830e Compare January 12, 2021 19:59
@freddyaboulton freddyaboulton force-pushed the 1580-integrate-ts-classification-into-automl branch from 307830e to c636e3b Compare January 12, 2021 22:08
@freddyaboulton freddyaboulton merged commit f49c327 into main Jan 13, 2021
@freddyaboulton freddyaboulton deleted the 1580-integrate-ts-classification-into-automl branch January 13, 2021 17:50
@bchen1116 bchen1116 mentioned this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate time series classification pipelines into AutoMLSearch
4 participants