New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Imputer indexing when sklearn imputer drops a row #1009
Conversation
Wow, good catch on this, and thank you for adding in the test! Makes me wonder what other tests we could benefit from since none of our unit tests had caught this... |
if X_null_dropped.empty: | ||
return pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns) | ||
return X_null_dropped.astype(dtypes) | ||
transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns) | ||
transformed = X_null_dropped.astype(dtypes) | ||
transformed.reset_index(inplace=True, drop=True) | ||
return transformed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will error out when X_null_dropped.empty
. I think you need to keep the cases separated:
if X_null_dropped.empty:
transformed = pd.DataFrame(X_null_dropped, columns=X_null_dropped.columns)
else:
transformed = X_null_dropped.astype(dtypes)
transformed.reset_index(inplace=True, drop=True)
return transformed
It wasn't an issue before because we were simply returning!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I've left a comment about the code necessary to fix the implementation.
In order to pass the test_imputer_empty_data
test (can't comment directly on test since it's not edited code), I think you need to do expected.reset_index(inplace=True, drop=True)
for the pd.DataFrame
case. Can't say I 100% understand the difference between Index
and RangeIndex
but resetting the index makes it a RangeIndex
while an empty pd.DataFrame (what we currently set as expected
) has an index of Index
, hence the test failing (along with what I wrote needed to be updated for the implementation).
Codecov Report
@@ Coverage Diff @@
## main #1009 +/- ##
=======================================
Coverage 99.86% 99.86%
=======================================
Files 181 181
Lines 9584 9597 +13
=======================================
+ Hits 9571 9584 +13
Misses 13 13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for catching this! 👍
Background
We just merged an updated
Imputer
in #991 , which handles numeric/categorical separately. And we're using that in automl instead ofSimpleImputer
, which is deprecatedProblem
Suppose the index of the training dataframe doesn't range from 0 to
n-1
. In that case,SimpleImputer
ends up resetting the index to range from 0 ton-1
. However,Imputer
does not. This causes the one-hot encoder to fill in the missing rows, causing issues with the estimators. The most visible symptom is thatRandomForestClassifier
produces a stack trace during automl search.Repro
Fix
Have
Imputer
reset index at the end oftransform