Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SimpleImputer error which occurs when all features are bool type #1215

Merged
merged 11 commits into from Dec 1, 2020

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Sep 23, 2020

Fixed issues with all-bool inputs for Imputer and SimpleImputer by internally converting to the category datatype and reconverting to bools at the end.

Resolves #1166

@christopherbunn christopherbunn changed the title Added conversion of all bool to categories Added conversion of all bool to categories for imputera Sep 23, 2020
@christopherbunn christopherbunn changed the title Added conversion of all bool to categories for imputera Added conversion of all bool to categories for imputers Sep 23, 2020
@christopherbunn christopherbunn changed the title Added conversion of all bool to categories for imputers Added conversion of all bool to categories internally for imputers Sep 23, 2020
@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #1215 (c47f85d) into main (b213240) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1215     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         223      223             
  Lines       15025    15100     +75     
=========================================
+ Hits        15018    15093     +75     
  Misses          7        7             
Impacted Files Coverage Δ
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b213240...c47f85d. Read the comment docs.

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool looks good!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn thank you, looks good! I left one question about the behavior of the imputer for boolean inputs, may be a bug there.

imputer = Imputer()
imputer.fit(X_multi, y)
X_multi_expected_arr = pd.DataFrame({
"bool with nan": pd.Series([True, True, False, True, False]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were nans here filled with True rather than False? Your example data had one True and two False. I thought the default categorical strategy of most_frequent would fill nans with False.

@@ -67,6 +67,10 @@ def fit(self, X, y=None):

self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
X_copy = X.copy()

if (X.dtypes == bool).all():
X_copy = X_copy.astype('category')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin heads up these changes will cause conflicts for #1423 (update all components to accept woodwork datatables). I think they should be pretty straightforward to address--when you're resolving the conflicts you could call dt.set_logical_types(...) to do this transformation, or you could keep doing the transformation with pandas. Great that @christopherbunn is adding test coverage for this case, that'll help!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, #1423 was just merged!

@christopherbunn if you need a hand updating this, please ping @angela97lin or I

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo yup, just saw this. Yes yes, please reach out if you need help fixing the merge conflicts!

@christopherbunn
Copy link
Contributor Author

@dsherry Thanks for the review! I looked at the test case a bit deeper and part of the reason why that passes is that np.nan actually gets casted to True when the data type is set to bool. As an example, this snippet below

    X_multi = pd.DataFrame({
        "bool with nan": pd.Series([True, np.nan, False, np.nan, False], dtype=bool),
        "bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
    })

is actually equal to

    pd.DataFrame({
        "bool with nan": pd.Series([True, True, False, True, False], dtype=bool),
        "bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
    })

Because of this, I'm not actually sure if its possible to impute a bool-only df like X_multi since it will always be filled with values that are either True or False. X_multi could exist with True, False, and np.nan values but only if the dtype is Categorical from the beginning. In that case, the existing code would already work.

I think that resolving this issue should instead focus on just catching this value error. The current HEAD of this PR will just return all-bool dfs as is without imputing it and I added a new case to test this behavior. Thoughts @angela97lin and @dsherry?

@angela97lin
Copy link
Contributor

@christopherbunn Yup, I'd agree that resolving this issue to just catch the ValueError sounds good to me. It's a limitation or design choice made by pandas, and one I came across as well while implementing Woodwork stuff. Now that we have woodwork integration in place for components, users can use Woodwork structures, which use the new nullable pd.NA, and this would be converted to a category (or pandas "object") column, which you've mentioned already works. (Maybe this is a worthwhile test to add for the imputer specifically?)

@dsherry
Copy link
Contributor

dsherry commented Nov 24, 2020

Discussed with @angela97lin @christopherbunn . Recommendations:

  • For SimpleImputer, if we get this error, catch the error and return the original data
  • Unit tests should cover both pandas as input and woodwork as input

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions which we should resolve before merging but LGTM!

evalml/tests/component_tests/test_imputer.py Outdated Show resolved Hide resolved
imputer = SimpleImputer()
imputer.fit(X, y)
X_t = imputer.transform(X)
assert_frame_equal(X_expected_arr, X_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is great!

I think the first two cases (i.e. above this line) are fully covered by the last case, right? I suggest you delete the first two cases, above this line, and keep the last one below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue for keeping all of the test cases. The first test case triggers the type conversion while the second one checks for proper boolean impution. The third one is more of an extension of the second one but it also combines a complete boolean column to make sure that the types are being processed properly.

@@ -25,6 +25,7 @@ Release Notes
* Updated enum classes to show possible enum values as attributes :pr:`1391`
* Updated calls to ``Woodwork``'s ``to_pandas()`` to ``to_series()`` and ``to_dataframe()`` :pr:`1428`
* Fixed bug in OHE where column names were not guaranteed to be unique :pr:`1349`
* Added conversion of all bool to categories internally for imputers :pr:`1215`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "Fix SimpleImputer error which occurs when all features are bool type"

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I think this is fantastic! Thanks for making this change. I think your unit test for the Imputer shows that SimpleImputer and Imputer convert dtypes between input and output differently (even though Imputer calls SimpleImputer under the hood).

I don't think that should block merge (maybe that behavior existed before this change) so maybe the best thing is to address @dsherry 's comments (and any others that come up) and file an issue to continue the discussion.

evalml/tests/component_tests/test_imputer.py Outdated Show resolved Hide resolved
@christopherbunn christopherbunn changed the title Added conversion of all bool to categories internally for imputers Fix SimpleImputer error which occurs when all features are bool type Nov 25, 2020
@christopherbunn christopherbunn force-pushed the 1166_bool_impute branch 2 times, most recently from be01dca to aed294d Compare November 25, 2020 19:24
@christopherbunn christopherbunn force-pushed the 1166_bool_impute branch 2 times, most recently from 93ef88c to f81a4b4 Compare November 30, 2020 22:23
imputer = Imputer()
imputer.fit(X_multi, y)
X_multi_t = imputer.transform(X_multi)
assert_frame_equal(X_multi_expected_arr, X_multi_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests discussed with @christopherbunn

  1. All cols are bool-type
    X = pd.DataFrame([True, True, False, True, True], dtype=bool)
  2. Imputation strategies work for bool type
    X = pd.DataFrame([True, np.nan, False, np.nan, True], dtype=object)
    X = pd.DataFrame([True, np.nan, False, np.nan, False], dtype=object)
  3. Multi-type, with at least one bool typed col
    X_multi_expected_arr = pd.DataFrame({
        "bool with nan": pd.Series([True, False, False, False, False], dtype=object),
        "bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
    })

@@ -70,6 +74,9 @@ def transform(self, X, y=None):
# Convert None to np.nan, since None cannot be properly handled
X = X.fillna(value=np.nan)

# Return early since bool dtype doesn't support nans and sklearn errors if all cols are bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn should this comment be in fit too?

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn thanks for chasing this down!

@christopherbunn christopherbunn merged commit d94397f into main Dec 1, 2020
@freddyaboulton freddyaboulton deleted the 1166_bool_impute branch May 13, 2022 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Boolean column to Imputer breaks if it is the only categorical column
5 participants