-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2979 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 312 312
Lines 29893 29979 +86
=======================================
+ Hits 29802 29888 +86
Misses 91 91
Continue to review full report at Codecov.
|
…lml into 1639_preserve_custom_indices
@@ -97,5 +97,7 @@ def inverse_transform(self, y): | |||
""" | |||
if y is None: | |||
raise ValueError("y cannot be None!") | |||
y_ww = infer_feature_types(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this line? It looks like we only use it to grab the index. Is it just in case the input y
is a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not really related to index work, just standardizing since we do so for other components :')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Thank you this looks great! I think there's a bug in the test cause LightGBM.predict is not preserving the index?
evalml/pipelines/components/estimators/regressors/prophet_regressor.py
Outdated
Show resolved
Hide resolved
@@ -741,7 +741,9 @@ def make_dict(self, index, y_pred, y_true, scores, dataframe_index): | |||
|
|||
return { | |||
"probabilities": pred_values, | |||
"predicted_value": _make_json_serializable(self.predicted_values[index]), | |||
"predicted_value": _make_json_serializable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chasing this down!
_predictions = self._predict(X, objective=objective) | ||
predictions = self.inverse_transform(_predictions.astype(int)) | ||
predictions = pd.Series( | ||
predictions, name=self.input_target_name, index=_predictions.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line given that the LabelEncoder preserves the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically if all of the components properly handle preserving indices including the label encoder then no... but I figured it would be good to have on the pipelines just in case!
@@ -1562,6 +1573,12 @@ def test_transformer_fit_and_transform_respect_custom_indices( | |||
pd.testing.assert_index_equal( | |||
y.index, y_original_index, check_names=check_names | |||
) | |||
|
|||
if hasattr(transformer_class, "inverse_transform"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks for adding this!
"problem_type", | ||
[ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.REGRESSION], | ||
) | ||
def test_estimator_fit_predict_and_predict_proba_respect_custom_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this test catch this?
import pandas as pd
from evalml.demos import load_breast_cancer
from evalml.pipelines.components import LightGBMClassifier
X, y = load_breast_cancer()
lgbm = LightGBMClassifier()
X.index = range(25, 25 + X.shape[0]
lgbm.fit(X, y)
pd.testing.assert_index_equal(lgbm.predict(X).index, X.index)
AssertionError: Index are different
Index values are different (100.0 %)
[left]: RangeIndex(start=0, stop=569, step=1)
[right]: RangeIndex(start=25, stop=594, step=1)
predict_proba
does preserve the index though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is a really good catch! Tracking this down, it's because we have separate logic for LightGBM and XGBoost's predict methods depending on whether or not we need to encode the targets (they internally have label encoders). Since my test uses the X_y_binary fixture, it doesn't use the label encoder. In the case where we do use the label encoder, the label encoder's inverse_transform wipes the index so we have to set it again:
X, _ = super()._manage_woodwork(X)
X.ww.set_types(self._convert_bool_to_int(X))
X = _rename_column_names_to_numeric(X, flatten_tuples=False)
predictions = super().predict(X)
if not self._label_encoder:
return predictions
predictions = pd.Series(
self._label_encoder.inverse_transform(predictions.astype(np.int64)),
index=predictions.index,
)
return infer_feature_types(predictions)
Adding another set of tests for numeric vs non-numeric targets 😅
…lml into 1639_preserve_custom_indices
…lml into 1639_preserve_custom_indices
…lml into 1639_preserve_custom_indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Glad we're now able to preserve the index in our estimators and pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @angela97lin ! Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Angela, this looks great and I think it greatly enhances our test coverage with respect to index preservation. I don't see anything that blocks, but I am curious whether we should do any index preservation within infer_feature_types? I know that would cut out a few lines, from the individual pipeline classes, and it also seems like something we should expect of the function. If not, also, nbd.
y_dt = infer_feature_types(y) | ||
y_t = self._component_obj.inverse_transform(y_dt) | ||
y_t = infer_feature_types(pd.Series(y_t, index=y_dt.index)) | ||
y_ww = infer_feature_types(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is part of the underlying problem here that infer_feature_types()
should be preserving the index? Should we do it in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Unfortunately, I think regardless of our infer_feature_types
behavior (which I think should preserve indices but I'm not positive), the issue is that a lot of third party libraries might return data such as np.arrays or new pandas objects that don't have the indices attached to them. 🥲
y = y.map({val: str(val) for val in np.unique(y)}) | ||
|
||
if use_custom_index: | ||
gen = np.random.default_rng(seed=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we do this in similar tests, but it's bold. I'm always hesitant to see pseudo-randomness added into a testing suite. I think it opens the door for potential flakiness and I'm not sure about the implications across platforms. I'm pretty sure this usage is fine, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I had taken inspiration from our other tests but agreed! The pseudo-randomness isn't really necessary here so I'll remove it from this instance and our other tests that use this :)
Closes #1639