Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

angela97lin · 2021-10-27T19:51:19Z

codecov · 2021-10-28T19:31:09Z

Codecov Report

Merging #2979 (c8c20dc) into main (a65e026) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2979     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        312     312             
  Lines      29893   29979     +86     
=======================================
+ Hits       29802   29888     +86     
  Misses        91      91

Impacted Files	Coverage Δ
...tanding/prediction_explanations/_user_interface.py	`100.0% <ø> (ø)`
...ents/estimators/classifiers/lightgbm_classifier.py	`100.0% <ø> (ø)`
...nents/estimators/classifiers/xgboost_classifier.py	`100.0% <ø> (ø)`
evalml/pipelines/classification_pipeline.py	`100.0% <100.0%> (ø)`
...ents/estimators/classifiers/catboost_classifier.py	`100.0% <100.0%> (ø)`
...valml/pipelines/components/estimators/estimator.py	`100.0% <100.0%> (ø)`
...ponents/estimators/regressors/prophet_regressor.py	`100.0% <100.0%> (ø)`
.../components/transformers/encoders/label_encoder.py	`100.0% <100.0%> (ø)`
...ents/transformers/preprocessing/log_transformer.py	`100.0% <100.0%> (ø)`
...transformers/preprocessing/polynomial_detrender.py	`97.7% <100.0%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a65e026...c8c20dc. Read the comment docs.

…lml into 1639_preserve_custom_indices

eccabay · 2021-11-04T13:54:39Z

evalml/pipelines/components/transformers/encoders/label_encoder.py

@@ -97,5 +97,7 @@ def inverse_transform(self, y):
        """
        if y is None:
            raise ValueError("y cannot be None!")
+        y_ww = infer_feature_types(y)


Why do we need this line? It looks like we only use it to grab the index. Is it just in case the input y is a numpy array?

Yeah, not really related to index work, just standardizing since we do so for other components :')

freddyaboulton

@angela97lin Thank you this looks great! I think there's a bug in the test cause LightGBM.predict is not preserving the index?

evalml/pipelines/components/estimators/regressors/prophet_regressor.py

freddyaboulton · 2021-11-04T14:27:52Z

evalml/model_understanding/prediction_explanations/_user_interface.py

@@ -741,7 +741,9 @@ def make_dict(self, index, y_pred, y_true, scores, dataframe_index):

        return {
            "probabilities": pred_values,
-            "predicted_value": _make_json_serializable(self.predicted_values[index]),
+            "predicted_value": _make_json_serializable(


Thanks for chasing this down!

freddyaboulton · 2021-11-04T14:57:39Z

evalml/pipelines/classification_pipeline.py

+        _predictions = self._predict(X, objective=objective)
+        predictions = self.inverse_transform(_predictions.astype(int))
+        predictions = pd.Series(
+            predictions, name=self.input_target_name, index=_predictions.index


Do we need this line given that the LabelEncoder preserves the index?

Theoretically if all of the components properly handle preserving indices including the label encoder then no... but I figured it would be good to have on the pipelines just in case!

freddyaboulton · 2021-11-04T15:02:32Z

evalml/tests/component_tests/test_components.py

@@ -1562,6 +1573,12 @@ def test_transformer_fit_and_transform_respect_custom_indices(
        pd.testing.assert_index_equal(
            y.index, y_original_index, check_names=check_names
        )
+
+    if hasattr(transformer_class, "inverse_transform"):


Nice thanks for adding this!

freddyaboulton · 2021-11-04T15:19:45Z

evalml/tests/component_tests/test_estimators.py

+    "problem_type",
+    [ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.REGRESSION],
+)
+def test_estimator_fit_predict_and_predict_proba_respect_custom_indices(


Why doesn't this test catch this?

import pandas as pd from evalml.demos import load_breast_cancer from evalml.pipelines.components import LightGBMClassifier X, y = load_breast_cancer() lgbm = LightGBMClassifier() X.index = range(25, 25 + X.shape[0] lgbm.fit(X, y) pd.testing.assert_index_equal(lgbm.predict(X).index, X.index)

AssertionError: Index are different

Index values are different (100.0 %) [left]: RangeIndex(start=0, stop=569, step=1) [right]: RangeIndex(start=25, stop=594, step=1)

predict_proba does preserve the index though

Wow, this is a really good catch! Tracking this down, it's because we have separate logic for LightGBM and XGBoost's predict methods depending on whether or not we need to encode the targets (they internally have label encoders). Since my test uses the X_y_binary fixture, it doesn't use the label encoder. In the case where we do use the label encoder, the label encoder's inverse_transform wipes the index so we have to set it again:

X, _ = super()._manage_woodwork(X) X.ww.set_types(self._convert_bool_to_int(X)) X = _rename_column_names_to_numeric(X, flatten_tuples=False) predictions = super().predict(X) if not self._label_encoder: return predictions predictions = pd.Series( self._label_encoder.inverse_transform(predictions.astype(np.int64)), index=predictions.index, ) return infer_feature_types(predictions)

Adding another set of tests for numeric vs non-numeric targets 😅

…lml into 1639_preserve_custom_indices

bchen1116

LGTM! Glad we're now able to preserve the index in our estimators and pipelines

freddyaboulton

Thank you @angela97lin ! Looks good to me!

chukarsten

Angela, this looks great and I think it greatly enhances our test coverage with respect to index preservation. I don't see anything that blocks, but I am curious whether we should do any index preservation within infer_feature_types? I know that would cut out a few lines, from the individual pipeline classes, and it also seems like something we should expect of the function. If not, also, nbd.

chukarsten · 2021-11-08T15:37:05Z

evalml/pipelines/components/transformers/preprocessing/polynomial_detrender.py

-        y_dt = infer_feature_types(y)
-        y_t = self._component_obj.inverse_transform(y_dt)
-        y_t = infer_feature_types(pd.Series(y_t, index=y_dt.index))
+        y_ww = infer_feature_types(y)


Is part of the underlying problem here that infer_feature_types() should be preserving the index? Should we do it in there?

@chukarsten Unfortunately, I think regardless of our infer_feature_types behavior (which I think should preserve indices but I'm not positive), the issue is that a lot of third party libraries might return data such as np.arrays or new pandas objects that don't have the indices attached to them. 🥲

chukarsten · 2021-11-08T15:41:30Z

evalml/tests/component_tests/test_estimators.py

+        y = y.map({val: str(val) for val in np.unique(y)})
+
+    if use_custom_index:
+        gen = np.random.default_rng(seed=0)


I know we do this in similar tests, but it's bold. I'm always hesitant to see pseudo-randomness added into a testing suite. I think it opens the door for potential flakiness and I'm not sure about the implications across platforms. I'm pretty sure this usage is fine, though.

Heh, I had taken inspiration from our other tests but agreed! The pseudo-randomness isn't really necessary here so I'll remove it from this instance and our other tests that use this :)

init

a475668

angela97lin self-assigned this Oct 27, 2021

angela97lin changed the title ~~Preserve input indices on estimators and pipelines' predict/predict_proba/inverse_transform~~ Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform Oct 28, 2021

angela97lin added 2 commits October 28, 2021 14:55

Merge branch 'main' into 1639_preserve_custom_indices

8a8666e

oops remove pdb and release notes

bc8e11b

angela97lin added 24 commits October 28, 2021 17:11

add test and logic for storing indices on estimator

12a71c3

fix tests

acbb739

fix doc

610ea87

Merge branch 'main' into 1639_preserve_custom_indices

8b1ed58

cleanup and fix test

fb160d2

clean up and fix test

0379ba6

try to fix prophet tests

31153ab

some additional cleanup, remove indices check from arima

89f1e3d

try setting indices

faa9461

try using equals instead

d5f00ac

add mock attribute for test

e09070b

try np comparison

ff55688

update index

ea1e870

testing out using loc

2a22157

Merge branch 'main' into 1639_preserve_custom_indices

9721588

clean up and fix indices

3d845d0

Merge branch '1639_preserve_custom_indices' of github.com:alteryx/eva…

81fd0e2

…lml into 1639_preserve_custom_indices

remove lines used for debugging

7f70008

add tests and update inverse_transform

2c1cdc0

add test to component graph

1f1afa4

Merge branch 'main' into 1639_preserve_custom_indices

4eb8405

test prophet assert statements

cfb6c33

update prophet tests

93cfb4f

add tests

9533e35

angela97lin marked this pull request as ready for review November 4, 2021 05:33

angela97lin requested review from freddyaboulton, ParthivNaresh, chukarsten, bchen1116, dsherry, eccabay and jeremyliweishih November 4, 2021 05:34

eccabay reviewed Nov 4, 2021

View reviewed changes

freddyaboulton suggested changes Nov 4, 2021

View reviewed changes

angela97lin added 11 commits November 6, 2021 17:32

fix lightgbm and xgboost classifiers with nonnumeric targets

136c45a

Merge branch '1639_preserve_custom_indices' of github.com:alteryx/eva…

18540e4

…lml into 1639_preserve_custom_indices

Merge branch 'main' into 1639_preserve_custom_indices

8309cd0

change prophet indices to match X instead of ds

7714214

Merge branch '1639_preserve_custom_indices' of github.com:alteryx/eva…

026e995

…lml into 1639_preserve_custom_indices

update index if X not None

d2b06de

Merge branch 'main' into 1639_preserve_custom_indices

57b37df

update code to handle empty case

ffd111e

change to use empty check

00e0c38

Merge branch '1639_preserve_custom_indices' of github.com:alteryx/eva…

febdac0

…lml into 1639_preserve_custom_indices

oops fix if statement

521a8d6

angela97lin requested a review from freddyaboulton November 8, 2021 00:20

bchen1116 approved these changes Nov 8, 2021

View reviewed changes

freddyaboulton approved these changes Nov 8, 2021

View reviewed changes

chukarsten approved these changes Nov 8, 2021

View reviewed changes

angela97lin added 2 commits November 8, 2021 12:00

remove pseudorandomness

38093a7

Merge branch 'main' into 1639_preserve_custom_indices

c8c20dc

angela97lin merged commit 36e0970 into main Nov 8, 2021

angela97lin deleted the 1639_preserve_custom_indices branch November 8, 2021 17:55

chukarsten mentioned this pull request Nov 9, 2021

Release v0.37.0 #3029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

angela97lin commented Oct 27, 2021

codecov bot commented Oct 28, 2021 •

edited

Loading

eccabay Nov 4, 2021

angela97lin Nov 7, 2021

freddyaboulton left a comment

freddyaboulton Nov 4, 2021

freddyaboulton Nov 4, 2021

angela97lin Nov 8, 2021

freddyaboulton Nov 4, 2021

freddyaboulton Nov 4, 2021

angela97lin Nov 6, 2021

bchen1116 left a comment

freddyaboulton left a comment

chukarsten left a comment

chukarsten Nov 8, 2021

angela97lin Nov 8, 2021

chukarsten Nov 8, 2021

angela97lin Nov 8, 2021

Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

Preserve input indices on estimators and pipelines' predict/predict_proba/transform/inverse_transform #2979

Conversation

angela97lin commented Oct 27, 2021

codecov bot commented Oct 28, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 28, 2021 •

edited

Loading