Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow empty features in time series pipelines #1651

Merged
merged 10 commits into from
Jan 7, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jan 5, 2021

Pull Request Description

Fixes #1517.

@dsherry and @bchen1116 Can't tag you as reviewers but please take a look when you can!


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@@ -52,12 +53,8 @@ def fit(self, X, y=None):
return self

def predict(self, X, y=None):
if X is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting these because it's not needed



class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline):
"""Baseline Pipeline for time series regression problems."""
_name = "Time Series Baseline Regression Pipeline"
component_graph = ["Time Series Baseline Regressor"]

def fit(self, X, y):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was needed to accept the X=None case so we can delete it now! This means adding baselines for time series classification will be easy now since we can just reuse the time series baseline regressor and not worry about implementing pipeline fit or predict !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh that's great!

@codecov
Copy link

codecov bot commented Jan 5, 2021

Codecov Report

Merging #1651 (81f4278) into main (de316b8) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1651     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18476    18485      +9     
=========================================
+ Hits        18468    18477      +9     
  Misses          8        8             
Impacted Files Coverage Δ
...ines/regression/time_series_baseline_regression.py 100.0% <ø> (ø)
evalml/tests/utils_tests/test_gen_utils.py 100.0% <ø> (ø)
...valml/pipelines/components/estimators/estimator.py 100.0% <100.0%> (ø)
...ators/regressors/time_series_baseline_regressor.py 100.0% <100.0%> (ø)
.../pipelines/time_series_classification_pipelines.py 100.0% <100.0%> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 100.0% <100.0%> (ø)
evalml/utils/gen_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de316b8...81f4278. Read the comment docs.

@@ -21,6 +21,8 @@ class Estimator(ComponentBase):

To see some examples, check out the definitions of any Estimator component.
"""
# We can't use the inspect module to dynamically determine this because of issue 1582
predict_uses_y = False
Copy link
Contributor Author

@freddyaboulton freddyaboulton Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way of knowing whether or not to pass y as an argument to predict or predict_proba since the baseline (Prophet #1499 and ARIMA #1498 in the future) estimator allows this and others don't. I think this is the lowest-touch way of doing this.

One alternative is to list y as an argument for all estimators. However, I think that would be confusing for users to list y if it's not used because they might think our estimators are "cheating" by looking at the target value. Maybe I'm just being paranoid though.

@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch from 1016fd4 to 19c0719 Compare January 5, 2021 21:08
@freddyaboulton freddyaboulton self-assigned this Jan 5, 2021
@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch from 19c0719 to 91c23d9 Compare January 5, 2021 21:12
@freddyaboulton freddyaboulton marked this pull request as ready for review January 5, 2021 21:50
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems conceptually sound. Definitely a fan of removing unused code and simplification.

@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch 3 times, most recently from daf1ccf to dd41786 Compare January 6, 2021 19:02
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

X = None

pl.fit(X, y)
if expected_unique_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this always return true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removing the if statement!

@@ -401,7 +401,7 @@ def _not_nan(pd_data):
return mask


def drop_rows_with_nans(pd_data_1, pd_data_2):
def drop_rows_with_nans(*pd_data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

return self._estimator_predict_helper(features, y, return_probabilities=False)

def _estimator_predict_proba(self, features, y):
"""Get estimator prediced probabilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here :d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines 142 to 143
y_predicted = pd.Series([])
y_predicted_proba = pd.DataFrame([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be confused but why this change? This'll always return a valid prediction value, even if its empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_rows_with_nans (which is called in score) needs a pandas data type and not None. I changed drop_rows_with_nans to accept None and so I changed these back to None.

@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch from dd41786 to 94d8cc4 Compare January 6, 2021 19:54
@dsherry dsherry self-requested a review January 6, 2021 23:23

This helper passes y as an argument if needed by the estimator.
"""
return self._estimator_predict_helper(features, y, return_probabilities=True)
Copy link
Contributor

@dsherry dsherry Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton could we do this

def _estimator_predict(self, features, y):
        """Get estimator predictions.
        This helper passes y as an argument if needed by the estimator.
        """
        y_arg = None
        if self.estimator.predict_uses_y:
            y_arg = y
        return self.estimator.predict(features, y=y_arg, return_probabilities=False)

And similar for predict_proba?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can! I originally thought we couldn't because some estimators like XGBoost don't have y as an argument in their predict signature but the metaclass adds it!

else:
ypred_proba = self.estimator.predict_proba(features_no_nan).iloc[:, 1]
proba = self._estimator_predict_proba(features_no_nan, y_encoded)
proba = proba.iloc[:, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Not covered by _estimator_predict_proba?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, _estimator_predict_proba returns the probabilities for all classes. For binary pipelines in predict, we threshold the predictions for the "positive" class if the threshold is not None. Hence the need for .iloc[:, 1].

objectives = ["MCC Multiclass", "Log Loss Multiclass"]

if use_none_X:
X = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have coverage elsewhere for the case when use_none_X is True? I.e. do we need @pytest.mark.parametrize("use_none_X", [True, False])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we do for now. In the other tests with use_none_X, we always have a DelayedFeaturesTransformer in the pipeline. I think it's important to have coverage when that's not the case.

That being said, we have coverage for this case for ts regression pipelines in test_time_series_baseline_regression.py. In #1580 , I plan on extending the time baseline regressor to classification problems so in that case, I think we can move fold in this coverage into the tests for the baseline.

@@ -401,19 +401,24 @@ def _not_nan(pd_data):
return mask


def drop_rows_with_nans(pd_data_1, pd_data_2):
def drop_rows_with_nans(*pd_data):
"""Drop rows that have any NaNs in both pd_data_1 and pd_data_2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick: update this docstring too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

raise RuntimeError("Pipeline computed empty features during call to .fit. This means "
"that either 1) you passed in X=None to fit and don't have a DelayFeatureTransformer "
"in your pipeline or 2) you do have a DelayFeatureTransformer but gap=0 and max_delay=0. "
"Please add a DelayFeatureTransformer or change the values of gap and max_delay")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, great that we can eliminate this!

y_pred = y_pred.iloc[non_nan_mask]
if y_pred_proba is not None:
y_pred_proba = y_pred_proba.iloc[non_nan_mask]
y_labels = y_shifted.iloc[non_nan_mask]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to remove the iloc stuff from where this code got moved to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calls to iloc are now happening in drop_rows_with_nans!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion in the time series classification pipeline, and a couple questions, otherwise 🚢

@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch 2 times, most recently from 0edaa86 to 59a9589 Compare January 7, 2021 16:50
@freddyaboulton freddyaboulton force-pushed the 1517-allow-x-none-in-timeseries branch from 59a9589 to 81f4278 Compare January 7, 2021 18:16
@freddyaboulton freddyaboulton merged commit 2e4f91f into main Jan 7, 2021
@freddyaboulton freddyaboulton deleted the 1517-allow-x-none-in-timeseries branch January 7, 2021 18:57
@bchen1116 bchen1116 mentioned this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow X to be None in Time Series Pipelines
5 participants