Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BooleanNullable SimpleImputer bug #3959

Merged
merged 10 commits into from
Jan 26, 2023
Merged

Fix BooleanNullable SimpleImputer bug #3959

merged 10 commits into from
Jan 26, 2023

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jan 26, 2023

Fixes the bug where all-null BooleanNullable columns will break the simple imputer during transform, when fit on nullable data that has a non-null value.

@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Merging #3959 (82d29ef) into main (7e035b0) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3959     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        347     347             
  Lines      36768   36776      +8     
=======================================
+ Hits       36647   36656      +9     
+ Misses       121     120      -1     
Impacted Files Coverage Δ
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (+1.7%) ⬆️
evalml/pipelines/components/utils.py 96.2% <100.0%> (-<0.1%) ⬇️
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_utils.py 99.1% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines 88 to 90
self._boolean_cols = X.ww.schema._filter_cols(
include=["Boolean", "BooleanNullable"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be being pedantic, but is the preferred way to call a private function on the schema? I thought there was a select function on the ww accessor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. I stole this line directly from set_boolean_columns_to_integer, but I can switch both places over to use select instead.

@@ -124,11 +134,9 @@ def transform(self, X, y=None):

new_schema = original_schema.get_subset_schema(X_t.columns)

# TODO: Fix this after WW adds inference of object type booleans to BooleanNullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

if logical_type in [NaturalLanguage, Categorical]:
impute_strategy_to_use = "most_frequent"
if logical_type in [NaturalLanguage, Categorical, Boolean, BooleanNullable]:
impute_strategy = "most_frequent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of how this was originally done - with impute_strategy iterating over a subset of the total impute_strategy available and changing it in the test. But that's not your problem...we might want to think about rewriting this.

Comment on lines +613 to +615
X_train = pd.DataFrame({"a": [pd.NA] * 20 + [1.0] + [pd.NA] * 20})
y = pd.Series(range(len(X_train)))
X_test = pd.DataFrame({"a": [pd.NA] * 10})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Times like these, I think it's helpful to docstring the test to get at what exactly you're testing here. The test name doesn't seem to match what's going on. The test case here is that you're train is sparse and your test set happens to not be fully representative of all the classes available in X, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yeah. It's really just testing having an all-null test set when the training had non-null values. I'll update the test name and add a docstring

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending select change, looks great!

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @chukarsten's comments but otherwise LGTM

evalml/tests/component_tests/test_simple_imputer.py Outdated Show resolved Hide resolved
@eccabay eccabay enabled auto-merge (squash) January 26, 2023 18:28
@eccabay eccabay merged commit 10a4980 into main Jan 26, 2023
@eccabay eccabay deleted the booleannullable-fix branch January 26, 2023 19:01
@chukarsten chukarsten mentioned this pull request Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants