-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow empty features in time series pipelines #1651
Conversation
@@ -52,12 +53,8 @@ def fit(self, X, y=None): | |||
return self | |||
|
|||
def predict(self, X, y=None): | |||
if X is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting these because it's not needed
|
||
|
||
class TimeSeriesBaselineRegressionPipeline(TimeSeriesRegressionPipeline): | ||
"""Baseline Pipeline for time series regression problems.""" | ||
_name = "Time Series Baseline Regression Pipeline" | ||
component_graph = ["Time Series Baseline Regressor"] | ||
|
||
def fit(self, X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was needed to accept the X=None
case so we can delete it now! This means adding baselines for time series classification will be easy now since we can just reuse the time series baseline regressor and not worry about implementing pipeline fit
or predict
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh that's great!
Codecov Report
@@ Coverage Diff @@
## main #1651 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18476 18485 +9
=========================================
+ Hits 18468 18477 +9
Misses 8 8
Continue to review full report at Codecov.
|
@@ -21,6 +21,8 @@ class Estimator(ComponentBase): | |||
|
|||
To see some examples, check out the definitions of any Estimator component. | |||
""" | |||
# We can't use the inspect module to dynamically determine this because of issue 1582 | |||
predict_uses_y = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way of knowing whether or not to pass y
as an argument to predict
or predict_proba
since the baseline (Prophet #1499 and ARIMA #1498 in the future) estimator allows this and others don't. I think this is the lowest-touch way of doing this.
One alternative is to list y
as an argument for all estimators. However, I think that would be confusing for users to list y
if it's not used because they might think our estimators are "cheating" by looking at the target value. Maybe I'm just being paranoid though.
1016fd4
to
19c0719
Compare
19c0719
to
91c23d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems conceptually sound. Definitely a fan of removing unused code and simplification.
daf1ccf
to
dd41786
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
X = None | ||
|
||
pl.fit(X, y) | ||
if expected_unique_values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this always return true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removing the if statement!
@@ -401,7 +401,7 @@ def _not_nan(pd_data): | |||
return mask | |||
|
|||
|
|||
def drop_rows_with_nans(pd_data_1, pd_data_2): | |||
def drop_rows_with_nans(*pd_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
return self._estimator_predict_helper(features, y, return_probabilities=False) | ||
|
||
def _estimator_predict_proba(self, features, y): | ||
"""Get estimator prediced probabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here :d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
y_predicted = pd.Series([]) | ||
y_predicted_proba = pd.DataFrame([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be confused but why this change? This'll always return a valid prediction value, even if its empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_rows_with_nans
(which is called in score
) needs a pandas data type and not None
. I changed drop_rows_with_nans
to accept None and so I changed these back to None
.
dd41786
to
94d8cc4
Compare
|
||
This helper passes y as an argument if needed by the estimator. | ||
""" | ||
return self._estimator_predict_helper(features, y, return_probabilities=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton could we do this
def _estimator_predict(self, features, y):
"""Get estimator predictions.
This helper passes y as an argument if needed by the estimator.
"""
y_arg = None
if self.estimator.predict_uses_y:
y_arg = y
return self.estimator.predict(features, y=y_arg, return_probabilities=False)
And similar for predict_proba?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can! I originally thought we couldn't because some estimators like XGBoost
don't have y
as an argument in their predict signature but the metaclass adds it!
else: | ||
ypred_proba = self.estimator.predict_proba(features_no_nan).iloc[:, 1] | ||
proba = self._estimator_predict_proba(features_no_nan, y_encoded) | ||
proba = proba.iloc[:, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Not covered by _estimator_predict_proba
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, _estimator_predict_proba
returns the probabilities for all classes. For binary pipelines in predict
, we threshold the predictions for the "positive" class if the threshold is not None. Hence the need for .iloc[:, 1]
.
objectives = ["MCC Multiclass", "Log Loss Multiclass"] | ||
|
||
if use_none_X: | ||
X = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have coverage elsewhere for the case when use_none_X
is True
? I.e. do we need @pytest.mark.parametrize("use_none_X", [True, False])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we do for now. In the other tests with use_none_X
, we always have a DelayedFeaturesTransformer
in the pipeline. I think it's important to have coverage when that's not the case.
That being said, we have coverage for this case for ts regression pipelines in test_time_series_baseline_regression.py
. In #1580 , I plan on extending the time baseline regressor to classification problems so in that case, I think we can move fold in this coverage into the tests for the baseline.
evalml/utils/gen_utils.py
Outdated
@@ -401,19 +401,24 @@ def _not_nan(pd_data): | |||
return mask | |||
|
|||
|
|||
def drop_rows_with_nans(pd_data_1, pd_data_2): | |||
def drop_rows_with_nans(*pd_data): | |||
"""Drop rows that have any NaNs in both pd_data_1 and pd_data_2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick: update this docstring too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
raise RuntimeError("Pipeline computed empty features during call to .fit. This means " | ||
"that either 1) you passed in X=None to fit and don't have a DelayFeatureTransformer " | ||
"in your pipeline or 2) you do have a DelayFeatureTransformer but gap=0 and max_delay=0. " | ||
"Please add a DelayFeatureTransformer or change the values of gap and max_delay") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, great that we can eliminate this!
y_pred = y_pred.iloc[non_nan_mask] | ||
if y_pred_proba is not None: | ||
y_pred_proba = y_pred_proba.iloc[non_nan_mask] | ||
y_labels = y_shifted.iloc[non_nan_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to remove the iloc
stuff from where this code got moved to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calls to iloc
are now happening in drop_rows_with_nans
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a suggestion in the time series classification pipeline, and a couple questions, otherwise 🚢
0edaa86
to
59a9589
Compare
…edictions to None in TimeSeriesClassificationPipeline.
…ression and classification pipelines.
59a9589
to
81f4278
Compare
Pull Request Description
Fixes #1517.
@dsherry and @bchen1116 Can't tag you as reviewers but please take a look when you can!
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.