-
Notifications
You must be signed in to change notification settings - Fork 89
Fix SimpleImputer error which occurs when all features are bool type #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1215 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 223 223
Lines 15025 15100 +75
=========================================
+ Hits 15018 15093 +75
Misses 7 7
Continue to review full report at Codecov.
|
ddc2b78
to
f842c7e
Compare
f842c7e
to
63c643e
Compare
aa3b6af
to
a0e8d3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn thank you, looks good! I left one question about the behavior of the imputer for boolean inputs, may be a bug there.
imputer = Imputer() | ||
imputer.fit(X_multi, y) | ||
X_multi_expected_arr = pd.DataFrame({ | ||
"bool with nan": pd.Series([True, True, False, True, False]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were nans here filled with True
rather than False
? Your example data had one True
and two False
. I thought the default categorical strategy of most_frequent
would fill nans with False
.
@@ -67,6 +67,10 @@ def fit(self, X, y=None): | |||
|
|||
self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns) | |||
X_copy = X.copy() | |||
|
|||
if (X.dtypes == bool).all(): | |||
X_copy = X_copy.astype('category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin heads up these changes will cause conflicts for #1423 (update all components to accept woodwork datatables). I think they should be pretty straightforward to address--when you're resolving the conflicts you could call dt.set_logical_types(...)
to do this transformation, or you could keep doing the transformation with pandas. Great that @christopherbunn is adding test coverage for this case, that'll help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, #1423 was just merged!
@christopherbunn if you need a hand updating this, please ping @angela97lin or I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oo yup, just saw this. Yes yes, please reach out if you need help fixing the merge conflicts!
a0e8d3e
to
b1bbe56
Compare
@dsherry Thanks for the review! I looked at the test case a bit deeper and part of the reason why that passes is that np.nan actually gets casted to True when the data type is set to bool. As an example, this snippet below
is actually equal to
Because of this, I'm not actually sure if its possible to impute a bool-only df like I think that resolving this issue should instead focus on just catching this value error. The current HEAD of this PR will just return all-bool dfs as is without imputing it and I added a new case to test this behavior. Thoughts @angela97lin and @dsherry? |
@christopherbunn Yup, I'd agree that resolving this issue to just catch the ValueError sounds good to me. It's a limitation or design choice made by pandas, and one I came across as well while implementing Woodwork stuff. Now that we have woodwork integration in place for components, users can use Woodwork structures, which use the new nullable pd.NA, and this would be converted to a category (or pandas "object") column, which you've mentioned already works. (Maybe this is a worthwhile test to add for the imputer specifically?) |
Discussed with @angela97lin @christopherbunn . Recommendations:
|
5e3ab39
to
9603b1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few suggestions which we should resolve before merging but LGTM!
imputer = SimpleImputer() | ||
imputer.fit(X, y) | ||
X_t = imputer.transform(X) | ||
assert_frame_equal(X_expected_arr, X_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is great!
I think the first two cases (i.e. above this line) are fully covered by the last case, right? I suggest you delete the first two cases, above this line, and keep the last one below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue for keeping all of the test cases. The first test case triggers the type conversion while the second one checks for proper boolean impution. The third one is more of an extension of the second one but it also combines a complete boolean column to make sure that the types are being processed properly.
docs/source/release_notes.rst
Outdated
@@ -25,6 +25,7 @@ Release Notes | |||
* Updated enum classes to show possible enum values as attributes :pr:`1391` | |||
* Updated calls to ``Woodwork``'s ``to_pandas()`` to ``to_series()`` and ``to_dataframe()`` :pr:`1428` | |||
* Fixed bug in OHE where column names were not guaranteed to be unique :pr:`1349` | |||
* Added conversion of all bool to categories internally for imputers :pr:`1215` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest "Fix SimpleImputer error which occurs when all features are bool type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn I think this is fantastic! Thanks for making this change. I think your unit test for the Imputer
shows that SimpleImputer
and Imputer
convert dtypes between input and output differently (even though Imputer
calls SimpleImputer
under the hood).
I don't think that should block merge (maybe that behavior existed before this change) so maybe the best thing is to address @dsherry 's comments (and any others that come up) and file an issue to continue the discussion.
be01dca
to
aed294d
Compare
93ef88c
to
f81a4b4
Compare
imputer = Imputer() | ||
imputer.fit(X_multi, y) | ||
X_multi_t = imputer.transform(X_multi) | ||
assert_frame_equal(X_multi_expected_arr, X_multi_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests discussed with @christopherbunn
- All cols are bool-type
X = pd.DataFrame([True, True, False, True, True], dtype=bool)
- Imputation strategies work for bool type
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object)
X = pd.DataFrame([True, np.nan, False, np.nan, False], dtype=object)
- Multi-type, with at least one bool typed col
X_multi_expected_arr = pd.DataFrame({
"bool with nan": pd.Series([True, False, False, False, False], dtype=object),
"bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
})
@@ -70,6 +74,9 @@ def transform(self, X, y=None): | |||
# Convert None to np.nan, since None cannot be properly handled | |||
X = X.fillna(value=np.nan) | |||
|
|||
# Return early since bool dtype doesn't support nans and sklearn errors if all cols are bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn should this comment be in fit
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn thanks for chasing this down!
f372da4
to
c47f85d
Compare
Fixed issues with all-bool inputs for
Imputer
andSimpleImputer
by internally converting to the category datatype and reconverting to bools at the end.Resolves #1166