Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

Merged
merged 43 commits into from
Nov 8, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #1639

@angela97lin angela97lin self-assigned this Oct 27, 2021
@angela97lin angela97lin changed the title Preserve input indices on estimators and pipelines' predict/predict_proba/inverse_transform Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform Oct 28, 2021
@codecov
Copy link

codecov bot commented Oct 28, 2021

Codecov Report

Merging #2979 (c8c20dc) into main (a65e026) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2979     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        312     312             
  Lines      29893   29979     +86     
=======================================
+ Hits       29802   29888     +86     
  Misses        91      91             
Impacted Files Coverage Δ
...tanding/prediction_explanations/_user_interface.py 100.0% <ø> (ø)
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <ø> (ø)
...nents/estimators/classifiers/xgboost_classifier.py 100.0% <ø> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.0% <100.0%> (ø)
...valml/pipelines/components/estimators/estimator.py 100.0% <100.0%> (ø)
...ponents/estimators/regressors/prophet_regressor.py 100.0% <100.0%> (ø)
.../components/transformers/encoders/label_encoder.py 100.0% <100.0%> (ø)
...ents/transformers/preprocessing/log_transformer.py 100.0% <100.0%> (ø)
...transformers/preprocessing/polynomial_detrender.py 97.7% <100.0%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a65e026...c8c20dc. Read the comment docs.

@@ -97,5 +97,7 @@ def inverse_transform(self, y):
"""
if y is None:
raise ValueError("y cannot be None!")
y_ww = infer_feature_types(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this line? It looks like we only use it to grab the index. Is it just in case the input y is a numpy array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not really related to index work, just standardizing since we do so for other components :')

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Thank you this looks great! I think there's a bug in the test cause LightGBM.predict is not preserving the index?

@@ -741,7 +741,9 @@ def make_dict(self, index, y_pred, y_true, scores, dataframe_index):

return {
"probabilities": pred_values,
"predicted_value": _make_json_serializable(self.predicted_values[index]),
"predicted_value": _make_json_serializable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this down!

_predictions = self._predict(X, objective=objective)
predictions = self.inverse_transform(_predictions.astype(int))
predictions = pd.Series(
predictions, name=self.input_target_name, index=_predictions.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line given that the LabelEncoder preserves the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically if all of the components properly handle preserving indices including the label encoder then no... but I figured it would be good to have on the pipelines just in case!

@@ -1562,6 +1573,12 @@ def test_transformer_fit_and_transform_respect_custom_indices(
pd.testing.assert_index_equal(
y.index, y_original_index, check_names=check_names
)

if hasattr(transformer_class, "inverse_transform"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks for adding this!

"problem_type",
[ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.REGRESSION],
)
def test_estimator_fit_predict_and_predict_proba_respect_custom_indices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this test catch this?

import pandas as pd
from evalml.demos import load_breast_cancer
from evalml.pipelines.components import LightGBMClassifier

X, y = load_breast_cancer()
lgbm = LightGBMClassifier()
X.index = range(25, 25 + X.shape[0]
lgbm.fit(X, y)

pd.testing.assert_index_equal(lgbm.predict(X).index, X.index)

AssertionError: Index are different

Index values are different (100.0 %)
[left]:  RangeIndex(start=0, stop=569, step=1)
[right]: RangeIndex(start=25, stop=594, step=1)

predict_proba does preserve the index though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a really good catch! Tracking this down, it's because we have separate logic for LightGBM and XGBoost's predict methods depending on whether or not we need to encode the targets (they internally have label encoders). Since my test uses the X_y_binary fixture, it doesn't use the label encoder. In the case where we do use the label encoder, the label encoder's inverse_transform wipes the index so we have to set it again:

        X, _ = super()._manage_woodwork(X)
        X.ww.set_types(self._convert_bool_to_int(X))
        X = _rename_column_names_to_numeric(X, flatten_tuples=False)
        predictions = super().predict(X)
        if not self._label_encoder:
            return predictions
        predictions = pd.Series(
            self._label_encoder.inverse_transform(predictions.astype(np.int64)),
            index=predictions.index,
        )
        return infer_feature_types(predictions)

Adding another set of tests for numeric vs non-numeric targets 😅

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Glad we're now able to preserve the index in our estimators and pipelines

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @angela97lin ! Looks good to me!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Angela, this looks great and I think it greatly enhances our test coverage with respect to index preservation. I don't see anything that blocks, but I am curious whether we should do any index preservation within infer_feature_types? I know that would cut out a few lines, from the individual pipeline classes, and it also seems like something we should expect of the function. If not, also, nbd.

y_dt = infer_feature_types(y)
y_t = self._component_obj.inverse_transform(y_dt)
y_t = infer_feature_types(pd.Series(y_t, index=y_dt.index))
y_ww = infer_feature_types(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is part of the underlying problem here that infer_feature_types() should be preserving the index? Should we do it in there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Unfortunately, I think regardless of our infer_feature_types behavior (which I think should preserve indices but I'm not positive), the issue is that a lot of third party libraries might return data such as np.arrays or new pandas objects that don't have the indices attached to them. 🥲

y = y.map({val: str(val) for val in np.unique(y)})

if use_custom_index:
gen = np.random.default_rng(seed=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we do this in similar tests, but it's bold. I'm always hesitant to see pseudo-randomness added into a testing suite. I think it opens the door for potential flakiness and I'm not sure about the implications across platforms. I'm pretty sure this usage is fine, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had taken inspiration from our other tests but agreed! The pseudo-randomness isn't really necessary here so I'll remove it from this instance and our other tests that use this :)

@angela97lin angela97lin merged commit 36e0970 into main Nov 8, 2021
@angela97lin angela97lin deleted the 1639_preserve_custom_indices branch November 8, 2021 17:55
@chukarsten chukarsten mentioned this pull request Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline predict should preserve the data index
5 participants