Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Imputer indexing when sklearn imputer drops a row #1009

Merged
merged 6 commits into from Aug 3, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Jul 31, 2020

Background
We just merged an updated Imputer in #991 , which handles numeric/categorical separately. And we're using that in automl instead of SimpleImputer, which is deprecated

Problem
Suppose the index of the training dataframe doesn't range from 0 to n-1. In that case, SimpleImputer ends up resetting the index to range from 0 to n-1. However, Imputer does not. This causes the one-hot encoder to fill in the missing rows, causing issues with the estimators. The most visible symptom is that RandomForestClassifier produces a stack trace during automl search.

Repro

import pandas as pd
import evalml
from evalml import AutoMLSearch
from sklearn.model_selection import train_test_split

df = pd.read_csv("~/Downloads/titanic3.csv")
non_null = ~pd.isnull(df['survived'])
df = df[non_null]
y = df["survived"]
X = df.drop("survived", axis=1)
# this is what messes with the indexes -- train_test_split doesn't reset them
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

pipeline_class = evalml.pipelines.utils.make_pipeline(
    X_train, y_train,
    evalml.pipelines.components.estimators.classifiers.RandomForestClassifier,
    'binary')
pipeline = pipeline_class(parameters={})
# the call to fit fails
pipeline.fit(X_train, y_train)

Fix
Have Imputer reset index at the end of transform

@dsherry dsherry added the bug Issues tracking problems with existing features. label Jul 31, 2020
@dsherry dsherry added this to the July 2020 milestone Jul 31, 2020
@angela97lin
Copy link
Contributor

Wow, good catch on this, and thank you for adding in the test! Makes me wonder what other tests we could benefit from since none of our unit tests had caught this...

Comment on lines 100 to 104
if X_null_dropped.empty:
return pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
return X_null_dropped.astype(dtypes)
transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
transformed = X_null_dropped.astype(dtypes)
transformed.reset_index(inplace=True, drop=True)
return transformed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will error out when X_null_dropped.empty. I think you need to keep the cases separated:

        if X_null_dropped.empty:
            transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
        else:
            transformed = X_null_dropped.astype(dtypes)
        transformed.reset_index(inplace=True, drop=True)
        return transformed

It wasn't an issue before because we were simply returning!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I've left a comment about the code necessary to fix the implementation.

In order to pass the test_imputer_empty_data test (can't comment directly on test since it's not edited code), I think you need to do expected.reset_index(inplace=True, drop=True) for the pd.DataFrame case. Can't say I 100% understand the difference between Index and RangeIndex but resetting the index makes it a RangeIndex while an empty pd.DataFrame (what we currently set as expected) has an index of Index, hence the test failing (along with what I wrote needed to be updated for the implementation).

@codecov
Copy link

codecov bot commented Aug 1, 2020

Codecov Report

Merging #1009 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1009   +/-   ##
=======================================
  Coverage   99.86%   99.86%           
=======================================
  Files         181      181           
  Lines        9584     9597   +13     
=======================================
+ Hits         9571     9584   +13     
  Misses         13       13           
Impacted Files Coverage Δ
...elines/components/transformers/imputers/imputer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_imputer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6cc78f...2b49b67. Read the comment docs.

@dsherry dsherry requested a review from eccabay August 3, 2020 15:49
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for catching this! 👍

@dsherry dsherry merged commit 7ea8fc6 into main Aug 3, 2020
@angela97lin angela97lin mentioned this pull request Aug 3, 2020
@freddyaboulton freddyaboulton deleted the ds_881_imputer_reset_index branch May 13, 2022 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants