Fix Imputer indexing when sklearn imputer drops a row #1009

dsherry · 2020-07-31T21:42:09Z

Background
We just merged an updated Imputer in #991 , which handles numeric/categorical separately. And we're using that in automl instead of SimpleImputer, which is deprecated

Problem
Suppose the index of the training dataframe doesn't range from 0 to n-1. In that case, SimpleImputer ends up resetting the index to range from 0 to n-1. However, Imputer does not. This causes the one-hot encoder to fill in the missing rows, causing issues with the estimators. The most visible symptom is that RandomForestClassifier produces a stack trace during automl search.

Repro

import pandas as pd
import evalml
from evalml import AutoMLSearch
from sklearn.model_selection import train_test_split

df = pd.read_csv("~/Downloads/titanic3.csv")
non_null = ~pd.isnull(df['survived'])
df = df[non_null]
y = df["survived"]
X = df.drop("survived", axis=1)
# this is what messes with the indexes -- train_test_split doesn't reset them
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

pipeline_class = evalml.pipelines.utils.make_pipeline(
    X_train, y_train,
    evalml.pipelines.components.estimators.classifiers.RandomForestClassifier,
    'binary')
pipeline = pipeline_class(parameters={})
# the call to fit fails
pipeline.fit(X_train, y_train)

Fix
Have Imputer reset index at the end of transform

angela97lin · 2020-07-31T21:46:31Z

Wow, good catch on this, and thank you for adding in the test! Makes me wonder what other tests we could benefit from since none of our unit tests had caught this...

angela97lin · 2020-07-31T22:30:40Z

evalml/pipelines/components/transformers/imputers/imputer.py

        if X_null_dropped.empty:
-            return pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
-        return X_null_dropped.astype(dtypes)
+            transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
+        transformed = X_null_dropped.astype(dtypes)
+        transformed.reset_index(inplace=True, drop=True)
+        return transformed


This will error out when X_null_dropped.empty. I think you need to keep the cases separated:

if X_null_dropped.empty: transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns) else: transformed = X_null_dropped.astype(dtypes) transformed.reset_index(inplace=True, drop=True) return transformed

It wasn't an issue before because we were simply returning!

angela97lin

@dsherry I've left a comment about the code necessary to fix the implementation.

In order to pass the test_imputer_empty_data test (can't comment directly on test since it's not edited code), I think you need to do expected.reset_index(inplace=True, drop=True) for the pd.DataFrame case. Can't say I 100% understand the difference between Index and RangeIndex but resetting the index makes it a RangeIndex while an empty pd.DataFrame (what we currently set as expected) has an index of Index, hence the test failing (along with what I wrote needed to be updated for the implementation).

codecov · 2020-08-01T02:15:46Z

Codecov Report

Merging #1009 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1009   +/-   ##
=======================================
  Coverage   99.86%   99.86%           
=======================================
  Files         181      181           
  Lines        9584     9597   +13     
=======================================
+ Hits         9571     9584   +13     
  Misses         13       13

Impacted Files	Coverage Δ
...elines/components/transformers/imputers/imputer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_imputer.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6cc78f...2b49b67. Read the comment docs.

angela97lin

LGTM, thanks for catching this! 👍

dsherry added 3 commits July 31, 2020 16:54

Add reproducers

e708ba4

Add fix, update repro

38933a5

Delete unneeded

2cda8d1

dsherry added the bug Issues tracking problems with existing features. label Jul 31, 2020

dsherry added this to the July 2020 milestone Jul 31, 2020

dsherry requested review from angela97lin, ctduffy, jeremyliweishih, freddyaboulton and eccabay July 31, 2020 21:42

auto-assign bot assigned dsherry Jul 31, 2020

Release notes

b16657f

This was referenced Jul 31, 2020

Revert #991: Adds Imputer to allow different imputation strategies for numerical and categorical dtypes #1010

Closed

AutoML can configure SimpleImputer to apply invalid imputation for categorical dtype #881

Closed

angela97lin reviewed Jul 31, 2020

View reviewed changes

angela97lin suggested changes Jul 31, 2020

View reviewed changes

Special-case when all cols are dropped. Update test expected value

88ebd0c

Codecov

2b49b67

dsherry requested review from angela97lin, ctduffy, jeremyliweishih and freddyaboulton and removed request for angela97lin, ctduffy, jeremyliweishih, freddyaboulton and eccabay August 3, 2020 15:47

dsherry requested a review from eccabay August 3, 2020 15:49

angela97lin approved these changes Aug 3, 2020

View reviewed changes

dsherry merged commit 7ea8fc6 into main Aug 3, 2020

angela97lin mentioned this pull request Aug 3, 2020

Release v0.12.0 #1007

Merged

freddyaboulton deleted the ds_881_imputer_reset_index branch May 13, 2022 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Imputer indexing when sklearn imputer drops a row #1009

Fix Imputer indexing when sklearn imputer drops a row #1009

dsherry commented Jul 31, 2020

angela97lin commented Jul 31, 2020

angela97lin Jul 31, 2020

angela97lin left a comment •

edited

codecov bot commented Aug 1, 2020 •

edited

angela97lin left a comment

Fix Imputer indexing when sklearn imputer drops a row #1009

Fix Imputer indexing when sklearn imputer drops a row #1009

Conversation

dsherry commented Jul 31, 2020

angela97lin commented Jul 31, 2020

angela97lin Jul 31, 2020

Choose a reason for hiding this comment

angela97lin left a comment • edited

Choose a reason for hiding this comment

codecov bot commented Aug 1, 2020 • edited

Codecov Report

angela97lin left a comment

Choose a reason for hiding this comment

angela97lin left a comment •

edited

codecov bot commented Aug 1, 2020 •

edited