-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classification Pipelines for Time Series #1528
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1528 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 238 239 +1
Lines 17136 17406 +270
=========================================
+ Hits 17128 17398 +270
Misses 8 8
Continue to review full report at Codecov.
|
ed2d59e
to
1d380ca
Compare
@@ -24,5 +25,9 @@ def _check_for_fit(self, X=None, y=None): | |||
elif y is None: | |||
return method(self, X) | |||
else: | |||
# For time series classification pipelines, predict will take X, y, objective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that TimeSeriesBinaryClassification.predict
accepts X
, y
and objective
. All other pipeline classes accept X
and y
(TimeSeriesRegressionPipeline
) or X
and objective
(BinaryClassificationPipeline
, MulticlassClassificationPipeline
) or X
(ClassificationPipeline
).
I think adding this extra check for the number of arguments in predict
is the lowest-touch way of handling the complexity (as opposed to adding another metaclass) especially since this class is not user-facing and we don't expect users to define their own pipeline base classes.
That being said, maybe we can refactor this metaclass definition to make it a little clearer what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking # of arguments is a good idea. Maybe we can just refactor it into the usual if else logic?
if not self._is_fitted:
raise PipelineNotYetFittedError(f'This {klass} is not fitted yet. You must fit {klass} before calling {method.__name__}.')
elif X is None and y is None:
return method(self)
elif y is None:
return method(self, X)
# For time series classification pipelines, predict will take X, y, objective
elif len(inspect.getfullargspec(method).args) == 4:
return method(self, X, y, objective)
else:
# For other pipelines, predict will take X, y or X, objective
return method(self, X, y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
self._encoder.fit(y) | ||
y = self._encode_targets(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout, we need to encode targets before computing features because we allow the user to pass in string-valued targets. If there is a DelayedFeatureTransformer
in the pipeline, these string values will get added to the feature matrix and if there isn't a OneHotEncoder
, the estimator will crash.
Long story short, this is a repeat of #1477 . I think it would be better if we pass this complexity to the DelayedFeatureTransformer
and have it encode the target before delaying but we should not let that block this PR. Let me know what you guys think. I'm happy to file something for updating the DelayedFeatureTransformer
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you here! I like the idea of filing an issue and updating the DelayedFeatureTransformer
as well as removing encoding in the pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton ah agreed that it would be great to have DelayedFeatureTransformer
be able to handle this.
Can you call super().fit(X, y)
here instead of duplicating code with ClassificationPipeline.fit
? ClassificationPipeline
already sets up the encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, I forgot about _compute_features_during_fit
In any case, I wonder if there's a way to avoid duplicating code for the classification target encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the dust settles and we build more features, I want to see if there is a way to refactor these classes to avoid duplication between classification and time series classification pipelines.
It's not apparent right now because the "time-series-specific" code gets sprinkled in between the "non-time-series" code so it's not easy to just isolate the time series code into a helper function/mix-in yet. But maybe a new design will be clearer when this one becomes more stable and we test it!
self.estimator.fit(X_t, y_shifted) | ||
return self | ||
|
||
def _predict(self, X, y, objective=None, pad=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a pad
parameter to _predict
because in predict
we want to return un-padded predictions or else self._decode_targets
will error out. However in _compute_predictions
(called by score
) we want to return padded predictions (so that predictions, predicted probabilities, and the shifted target all have the same length. This makes dropping nans much easier).
y_shifted = y_encoded.shift(-self.gap) | ||
y_pred, y_pred_proba = self._compute_predictions(X, y, objectives) | ||
non_nan_mask = _get_rows_without_nans(y_shifted, y_pred, y_pred_proba) | ||
if y_pred is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ugly but depending on the objectives that are passed in, y_pred
or y_pred_proba
can be None
so we can only drop nans if possible.
I've thought about avoiding this complexity by having _compute_predictions
return un-padded predictions so that we don't have to worry about dropping anything. However, that doesn't really work because the predictions and y_shifted
will have different number of nans depending on the values of gap
and the number of delayed features computed in the pipeline. The least complex way of knowing what the common non-nan subset is by looking at the nans in the predictions and targets.
In short, we may be able to hide this complexity in a helper function but I don't think we can avoid it.
if not objective.is_defined_for_problem_type(self.problem_type): | ||
raise ValueError(f"Objective {objective.name} is not defined for time series binary classification.") | ||
|
||
if self.threshold is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit different from BinaryClassificationPipeline
so that I only have to call pad_with_nans
once.
@patch('evalml.objectives.BinaryClassificationObjective.decision_function') | ||
@patch('evalml.pipelines.components.Estimator.predict_proba', return_value=pd.DataFrame({0: [1.]})) | ||
@patch('evalml.pipelines.components.Estimator.predict', return_value=pd.Series([1.])) | ||
def test_binary_classification_predictions_thresholded_properly(mock_predict, mock_predict_proba, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to achieve 100% coverage on the binary classification _predict
method.
(TimeSeriesRegressionPipeline, ['R2']), | ||
(TimeSeriesRegressionPipeline, ['R2', "Mean Absolute Percentage Error"])]) | ||
@pytest.mark.parametrize("use_ww", [True, False]) | ||
def test_score_works(pipeline_class, objectives, use_ww, X_y_binary, X_y_multi, X_y_regression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to have a unit test where we pass data through a minimal time series pipeline and make sure nothing errors out. There's a lot of stuff happening in these pipelines lol. I'm setting n_jobs=1 here and the test takes 2.5 seconds to run so I think the overall impact on test runtime is small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @freddyaboulton! Just left some questions and comments.
@@ -24,5 +25,9 @@ def _check_for_fit(self, X=None, y=None): | |||
elif y is None: | |||
return method(self, X) | |||
else: | |||
# For time series classification pipelines, predict will take X, y, objective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking # of arguments is a good idea. Maybe we can just refactor it into the usual if else logic?
if not self._is_fitted:
raise PipelineNotYetFittedError(f'This {klass} is not fitted yet. You must fit {klass} before calling {method.__name__}.')
elif X is None and y is None:
return method(self)
elif y is None:
return method(self, X)
# For time series classification pipelines, predict will take X, y, objective
elif len(inspect.getfullargspec(method).args) == 4:
return method(self, X, y, objective)
else:
# For other pipelines, predict will take X, y or X, objective
return method(self, X, y)
"""Pipeline base class for time series classifcation problems.""" | ||
|
||
def __init__(self, parameters, random_state=0): | ||
"""Machine learning pipeline for time series regression problems made out of transformers and a classifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classification
instead of regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
self._encoder.fit(y) | ||
y = self._encode_targets(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you here! I like the idea of filing an issue and updating the DelayedFeatureTransformer
as well as removing encoding in the pipelines.
|
||
X_t = self._compute_features_during_fit(X, y) | ||
if X_t.empty: | ||
raise RuntimeError("Pipeline computed empty features during call to .fit. This means " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton do we want behavior here to be the same as the current regression pipelines or should it be like after the fixes for #1517? I think this is fine for now but if we want to change behavior in the future we should add that info to #1517.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db0c862
to
6c3d569
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Freddy! The logic for when to keep track of adding padding versus not seemed challenging, but everything looks good to me! I left a couple questions and a comment, but solid implementation and test coverage!
return X, y | ||
|
||
def fit(self, X, y): | ||
"""Fit a time series regression pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classification
else: | ||
return pd_data | ||
|
||
mask = reduce(lambda a, b: np.logical_and(_not_nan(a), _not_nan(b)), data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool! So this means that if the second Dataframe/Series in the sequence had a nan value in the last column, then the last value in the mask will be False, like below?
Might be nice to update the return to mask where each entry is True if and only if all corresponding entries in that index in data are non-nan
or something of the sort. I wasn't sure what this was supposed to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. Great suggestion on the docstring!
mock_regressor_predict.side_effect = mock_predict | ||
else: | ||
mock_classifier_predict.side_effect = mock_predict | ||
|
||
if only_use_y: | ||
pl.fit(None, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Looking at the code for fit()
, it seems like X is necessary to pass in, since you do call estimator.fit()
on the computed features of X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! When X
is None
it's assumed that the pipeline will create some features from the y
before passing to the estimator (typically DelayedFeatureTransformer
). We're getting rid of this assumption in #1517, however, because some (future) estimators will naturally support the X=None
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, that makes sense. Thanks!
6c3d569
to
8c87b40
Compare
…of what the expected output is.
8c87b40
to
e65ddcf
Compare
…names_of_defined_objectives
…rks_for_names_of_defined_objectives" This reverts commit 6ba060a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I left a few comments, got halfway through the impl so far :)
@@ -50,7 +50,7 @@ def fit(self, X, y): | |||
def _encode_targets(self, y): | |||
"""Converts target values from their original values to integer values that can be processed.""" | |||
try: | |||
return pd.Series(self._encoder.transform(y)) | |||
return pd.Series(self._encoder.transform(y), index=y.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat.
Do we need to do this elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know LightGBM and Catboost also use encoders internally. I'll make a note to see if we need to do something similar there and if so file an issue!
Great thinking!
klass = type(self).__name__ | ||
if not self._is_fitted: | ||
raise PipelineNotYetFittedError(f'This {klass} is not fitted yet. You must fit {klass} before calling {method.__name__}.') | ||
elif X is None and y is None: | ||
return method(self) | ||
elif y is None: | ||
return method(self, X) | ||
# For time series classification pipelines, predict will take X, y, objective | ||
elif len(inspect.getfullargspec(method).args) == 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton does args
list positional args only? Or does it include named args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Positional arguments! So it includes any arguments that don't have *
or **
in front of their name.
# For time series classification pipelines, predict will take X, y, objective | ||
elif len(inspect.getfullargspec(method).args) == 4: | ||
return method(self, X, y, objective) | ||
# For other pipelines, predict will take X, y or X, objective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? The else
below doesn't pass along the objective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's correct but confusing 😅 .
For non-timeseries classification pipelines, the arguments will be X
and objective
so the y
in this else
is actually the objective
the user passed in!
I left this comment earlier but I guess it got overwritten when I pushed new changes:
The problem is that TimeSeriesBinaryClassification.predict accepts X, y and objective. All other pipeline classes accept X and y (TimeSeriesRegressionPipeline) or X and objective (BinaryClassificationPipeline, MulticlassClassificationPipeline) or X (ClassificationPipeline).
I think adding this extra check for the number of arguments in predict is the lowest-touch way of handling the complexity (as opposed to adding another metaclass) especially since this class is not user-facing and we don't expect users to define their own pipeline base classes.
That being said, maybe we can refactor this metaclass definition to make it a little clearer what's going on?
""" | ||
if "pipeline" not in parameters: | ||
raise ValueError("gap and max_delay parameters cannot be omitted from the parameters dict. " | ||
"Please specify them as a dictionary with the key 'pipeline'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 awesome
X = pd.DataFrame() | ||
X = _convert_to_woodwork_structure(X) | ||
y = _convert_to_woodwork_structure(y) | ||
return X, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I wonder if this would be good to move to PipelineBase
or even utils, and use it throughout the codebase. @angela97lin FYI :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... could be, but the conversion to an empty DataFrame if X is None is a time series specific thing, no? In other cases, we'd probably want to raise something if the user passes None as X?
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
self._encoder.fit(y) | ||
y = self._encode_targets(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton ah agreed that it would be great to have DelayedFeatureTransformer
be able to handle this.
Can you call super().fit(X, y)
here instead of duplicating code with ClassificationPipeline.fit
? ClassificationPipeline
already sets up the encoder.
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
self._encoder.fit(y) | ||
y = self._encode_targets(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, I forgot about _compute_features_during_fit
In any case, I wonder if there's a way to avoid duplicating code for the classification target encoder.
@@ -0,0 +1,204 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton typo in filename: evalml/pipelines/time_series_classifcation_pipelines.py
--> evalml/pipelines/time_series_classification_pipelines.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! 😬
X, y = self._convert_to_woodwork(X, y) | ||
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
n_features = max(len(y), X.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton this is the same nan padding math we use for TS regression, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few nit-picky comments but otherwise looks good! (Will come back and read more into impl later when I get the chance hehe)
parameters (dict): Dictionary with component names as keys and dictionary of that component's parameters as values. | ||
An empty dictionary {} implies using all default values for component parameters. Pipeline-level | ||
parameters such as gap and max_delay must be specified with the "pipeline" key. For example: | ||
Pipeline(parameters={"pipeline": {"max_delay": 4, "gap": 2}}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great example, nice and clear :)
Pipeline(parameters={"pipeline": {"max_delay": 4, "gap": 2}}). | ||
random_state (int, np.random.RandomState): The random seed/state. Defaults to 0. | ||
""" | ||
if "pipeline" not in parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nit-picky, but what if a user passes in a "pipeline" key mapping to a dictionary with "gap" or "max_delay" missing? Ex: {pipeline: {"gap":1}} I think we get a KeyError in the following lines, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you're right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to fix the same thing in regression pipelines so I'll just file an issue and change it for both after this is merged!
X = pd.DataFrame() | ||
X = _convert_to_woodwork_structure(X) | ||
y = _convert_to_woodwork_structure(y) | ||
return X, y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... could be, but the conversion to an empty DataFrame if X is None is a time series specific thing, no? In other cases, we'd probably want to raise something if the user passes None as X?
@@ -50,7 +50,7 @@ def fit(self, X, y): | |||
def _encode_targets(self, y): | |||
"""Converts target values from their original values to integer values that can be processed.""" | |||
try: | |||
return pd.Series(self._encoder.transform(y)) | |||
return pd.Series(self._encoder.transform(y), index=y.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, did this break / how did you catch this? If so, could be good to add a test for it if there isn't one already 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time series tests caught this! We pass a y
with a custom index there.
Pull Request Description
Fixes #1474
I'll add to AutoML in a separate PR since this is a big diff.
Most of the changes come from adding time series binary and time series multiclass problems types to the list of supported problem types for each existing classifier. I would ignore the diff in the estimators and estimator tests and focus on the diff in
pipeline_base_meta.py
,time_series_classification_pipelines.py
andtest_time_series_pipeline.py
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.