Fix SimpleImputer error which occurs when all features are bool type #1215

christopherbunn · 2020-09-23T19:01:53Z

Fixed issues with all-bool inputs for Imputer and SimpleImputer by internally converting to the category datatype and reconverting to bools at the end.

Resolves #1166

codecov · 2020-09-23T19:08:29Z

Codecov Report

Merging #1215 (c47f85d) into main (b213240) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1215     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         223      223             
  Lines       15025    15100     +75     
=========================================
+ Hits        15018    15093     +75     
  Misses          7        7

Impacted Files	Coverage Δ
...components/transformers/imputers/simple_imputer.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_imputer.py	`100.0% <100.0%> (ø)`
...valml/tests/component_tests/test_simple_imputer.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b213240...c47f85d. Read the comment docs.

jeremyliweishih

cool looks good!

dsherry

@christopherbunn thank you, looks good! I left one question about the behavior of the imputer for boolean inputs, may be a bug there.

dsherry · 2020-11-19T00:41:41Z

evalml/tests/component_tests/test_imputer.py

+    imputer = Imputer()
+    imputer.fit(X_multi, y)
+    X_multi_expected_arr = pd.DataFrame({
+        "bool with nan": pd.Series([True, True, False, True, False]),


Why were nans here filled with True rather than False? Your example data had one True and two False. I thought the default categorical strategy of most_frequent would fill nans with False.

dsherry · 2020-11-19T00:45:51Z

evalml/pipelines/components/transformers/imputers/imputer.py

@@ -67,6 +67,10 @@ def fit(self, X, y=None):

        self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
        X_copy = X.copy()
+
+        if (X.dtypes == bool).all():
+            X_copy = X_copy.astype('category')


@angela97lin heads up these changes will cause conflicts for #1423 (update all components to accept woodwork datatables). I think they should be pretty straightforward to address--when you're resolving the conflicts you could call dt.set_logical_types(...) to do this transformation, or you could keep doing the transformation with pandas. Great that @christopherbunn is adding test coverage for this case, that'll help!

Ah, #1423 was just merged!

@christopherbunn if you need a hand updating this, please ping @angela97lin or I

Oo yup, just saw this. Yes yes, please reach out if you need help fixing the merge conflicts!

christopherbunn · 2020-11-20T23:22:19Z

@dsherry Thanks for the review! I looked at the test case a bit deeper and part of the reason why that passes is that np.nan actually gets casted to True when the data type is set to bool. As an example, this snippet below

    X_multi = pd.DataFrame({
        "bool with nan": pd.Series([True, np.nan, False, np.nan, False], dtype=bool),
        "bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
    })

is actually equal to

    pd.DataFrame({
        "bool with nan": pd.Series([True, True, False, True, False], dtype=bool),
        "bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
    })

Because of this, I'm not actually sure if its possible to impute a bool-only df like X_multi since it will always be filled with values that are either True or False. X_multi could exist with True, False, and np.nan values but only if the dtype is Categorical from the beginning. In that case, the existing code would already work.

I think that resolving this issue should instead focus on just catching this value error. The current HEAD of this PR will just return all-bool dfs as is without imputing it and I added a new case to test this behavior. Thoughts @angela97lin and @dsherry?

angela97lin · 2020-11-24T16:41:01Z

@christopherbunn Yup, I'd agree that resolving this issue to just catch the ValueError sounds good to me. It's a limitation or design choice made by pandas, and one I came across as well while implementing Woodwork stuff. Now that we have woodwork integration in place for components, users can use Woodwork structures, which use the new nullable pd.NA, and this would be converted to a category (or pandas "object") column, which you've mentioned already works. (Maybe this is a worthwhile test to add for the imputer specifically?)

dsherry · 2020-11-24T17:15:07Z

Discussed with @angela97lin @christopherbunn . Recommendations:

For SimpleImputer, if we get this error, catch the error and return the original data
Unit tests should cover both pandas as input and woodwork as input

dsherry

I left a few suggestions which we should resolve before merging but LGTM!

evalml/pipelines/components/transformers/imputers/simple_imputer.py

evalml/tests/component_tests/test_imputer.py

dsherry · 2020-11-25T01:48:56Z

evalml/tests/component_tests/test_simple_imputer.py

+    imputer = SimpleImputer()
+    imputer.fit(X, y)
+    X_t = imputer.transform(X)
+    assert_frame_equal(X_expected_arr, X_t)


This test is great!

I think the first two cases (i.e. above this line) are fully covered by the last case, right? I suggest you delete the first two cases, above this line, and keep the last one below.

I'd argue for keeping all of the test cases. The first test case triggers the type conversion while the second one checks for proper boolean impution. The third one is more of an extension of the second one but it also combines a complete boolean column to make sure that the types are being processed properly.

dsherry · 2020-11-25T01:49:45Z

docs/source/release_notes.rst

@@ -25,6 +25,7 @@ Release Notes
        * Updated enum classes to show possible enum values as attributes :pr:`1391`
        * Updated calls to ``Woodwork``'s ``to_pandas()`` to ``to_series()`` and ``to_dataframe()`` :pr:`1428`
        * Fixed bug in OHE where column names were not guaranteed to be unique :pr:`1349`
+        * Added conversion of all bool to categories internally for imputers :pr:`1215`


Suggest "Fix SimpleImputer error which occurs when all features are bool type"

freddyaboulton

@christopherbunn I think this is fantastic! Thanks for making this change. I think your unit test for the Imputer shows that SimpleImputer and Imputer convert dtypes between input and output differently (even though Imputer calls SimpleImputer under the hood).

I don't think that should block merge (maybe that behavior existed before this change) so maybe the best thing is to address @dsherry 's comments (and any others that come up) and file an issue to continue the discussion.

evalml/tests/component_tests/test_imputer.py

dsherry · 2020-12-01T15:19:12Z

evalml/tests/component_tests/test_imputer.py

+    imputer = Imputer()
+    imputer.fit(X_multi, y)
+    X_multi_t = imputer.transform(X_multi)
+    assert_frame_equal(X_multi_expected_arr, X_multi_t)


Unit tests discussed with @christopherbunn

All cols are bool-type
X = pd.DataFrame([True, True, False, True, True], dtype=bool)

Imputation strategies work for bool type
X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object)
X = pd.DataFrame([True, np.nan, False, np.nan, False], dtype=object)

Multi-type, with at least one bool typed col

X_multi_expected_arr = pd.DataFrame({ "bool with nan": pd.Series([True, False, False, False, False], dtype=object), "bool no nan": pd.Series([False, False, False, False, True], dtype=bool), })

dsherry · 2020-12-01T15:20:21Z

evalml/pipelines/components/transformers/imputers/simple_imputer.py

@@ -70,6 +74,9 @@ def transform(self, X, y=None):
        # Convert None to np.nan, since None cannot be properly handled
        X = X.fillna(value=np.nan)

+        # Return early since bool dtype doesn't support nans and sklearn errors if all cols are bool


@christopherbunn should this comment be in fit too?

dsherry

@christopherbunn thanks for chasing this down!

christopherbunn changed the title ~~Added conversion of all bool to categories~~ Added conversion of all bool to categories for imputera Sep 23, 2020

christopherbunn changed the title ~~Added conversion of all bool to categories for imputera~~ Added conversion of all bool to categories for imputers Sep 23, 2020

christopherbunn changed the title ~~Added conversion of all bool to categories for imputers~~ Added conversion of all bool to categories internally for imputers Sep 23, 2020

christopherbunn force-pushed the 1166_bool_impute branch from ddc2b78 to f842c7e Compare September 24, 2020 14:51

christopherbunn force-pushed the 1166_bool_impute branch from f842c7e to 63c643e Compare November 16, 2020 16:10

christopherbunn marked this pull request as ready for review November 16, 2020 17:45

christopherbunn requested review from dsherry, angela97lin, freddyaboulton, bchen1116, eccabay, jeremyliweishih and ParthivNaresh November 18, 2020 16:27

christopherbunn force-pushed the 1166_bool_impute branch from aa3b6af to a0e8d3e Compare November 18, 2020 16:28

jeremyliweishih approved these changes Nov 18, 2020

View reviewed changes

dsherry approved these changes Nov 19, 2020

View reviewed changes

christopherbunn force-pushed the 1166_bool_impute branch from a0e8d3e to b1bbe56 Compare November 20, 2020 23:21

christopherbunn force-pushed the 1166_bool_impute branch 2 times, most recently from 5e3ab39 to 9603b1f Compare November 24, 2020 21:02

christopherbunn requested review from dsherry and jeremyliweishih November 24, 2020 21:52

dsherry approved these changes Nov 25, 2020

View reviewed changes

freddyaboulton approved these changes Nov 25, 2020

View reviewed changes

evalml/tests/component_tests/test_imputer.py Outdated Show resolved Hide resolved

christopherbunn changed the title ~~Added conversion of all bool to categories internally for imputers~~ Fix SimpleImputer error which occurs when all features are bool type Nov 25, 2020

christopherbunn force-pushed the 1166_bool_impute branch 2 times, most recently from be01dca to aed294d Compare November 25, 2020 19:24

christopherbunn force-pushed the 1166_bool_impute branch 2 times, most recently from 93ef88c to f81a4b4 Compare November 30, 2020 22:23

dsherry reviewed Dec 1, 2020

View reviewed changes

dsherry approved these changes Dec 1, 2020

View reviewed changes

christopherbunn added 11 commits December 1, 2020 11:55

Added conversion of all bool to categories

9d03737

Updated changelog

1be655c

Dropped extra if-else case

a77c175

Changed behavior to just return all-bool DS

512262a

Updated all bool implementation and tests

59bcb5d

Remove extra line and added bool type test case

e99342d

Removed extraneous comment

2b97f2b

Fixed original all bool test case

a279665

Added comment for imputer and updated release notes

d96b5cc

Moved code comment for simple_imputer

162ef91

Separated out imputer unit tests and added more context

c47f85d

christopherbunn force-pushed the 1166_bool_impute branch from f372da4 to c47f85d Compare December 1, 2020 16:56

christopherbunn merged commit d94397f into main Dec 1, 2020

freddyaboulton deleted the 1166_bool_impute branch May 13, 2022 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SimpleImputer error which occurs when all features are bool type #1215

Fix SimpleImputer error which occurs when all features are bool type #1215

christopherbunn commented Sep 23, 2020 •

edited

codecov bot commented Sep 23, 2020 •

edited

jeremyliweishih left a comment

dsherry left a comment

dsherry Nov 19, 2020

dsherry Nov 19, 2020

dsherry Nov 19, 2020

angela97lin Nov 19, 2020

christopherbunn commented Nov 20, 2020

angela97lin commented Nov 24, 2020

dsherry commented Nov 24, 2020

dsherry left a comment

dsherry Nov 25, 2020

christopherbunn Nov 25, 2020

dsherry Nov 25, 2020

freddyaboulton left a comment •

edited

dsherry Dec 1, 2020

dsherry Dec 1, 2020

dsherry left a comment

Fix SimpleImputer error which occurs when all features are bool type #1215

Fix SimpleImputer error which occurs when all features are bool type #1215

Conversation

christopherbunn commented Sep 23, 2020 • edited

codecov bot commented Sep 23, 2020 • edited

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Nov 20, 2020

angela97lin commented Nov 24, 2020

dsherry commented Nov 24, 2020

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

christopherbunn commented Sep 23, 2020 •

edited

codecov bot commented Sep 23, 2020 •

edited

freddyaboulton left a comment •

edited