-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BooleanNullable SimpleImputer bug #3959
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3959 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 347
Lines 36768 36776 +8
=======================================
+ Hits 36647 36656 +9
+ Misses 121 120 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
self._boolean_cols = X.ww.schema._filter_cols( | ||
include=["Boolean", "BooleanNullable"], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be being pedantic, but is the preferred way to call a private function on the schema? I thought there was a select function on the ww accessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. I stole this line directly from set_boolean_columns_to_integer
, but I can switch both places over to use select
instead.
@@ -124,11 +134,9 @@ def transform(self, X, y=None): | |||
|
|||
new_schema = original_schema.get_subset_schema(X_t.columns) | |||
|
|||
# TODO: Fix this after WW adds inference of object type booleans to BooleanNullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
if logical_type in [NaturalLanguage, Categorical]: | ||
impute_strategy_to_use = "most_frequent" | ||
if logical_type in [NaturalLanguage, Categorical, Boolean, BooleanNullable]: | ||
impute_strategy = "most_frequent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of how this was originally done - with impute_strategy
iterating over a subset of the total impute_strategy
available and changing it in the test. But that's not your problem...we might want to think about rewriting this.
X_train = pd.DataFrame({"a": [pd.NA] * 20 + [1.0] + [pd.NA] * 20}) | ||
y = pd.Series(range(len(X_train))) | ||
X_test = pd.DataFrame({"a": [pd.NA] * 10}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Times like these, I think it's helpful to docstring the test to get at what exactly you're testing here. The test name doesn't seem to match what's going on. The test case here is that you're train is sparse and your test set happens to not be fully representative of all the classes available in X
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, yeah. It's really just testing having an all-null test set when the training had non-null values. I'll update the test name and add a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending select change, looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @chukarsten's comments but otherwise LGTM
Co-authored-by: Jeremy Shih <jeremyliweishih@gmail.com>
…booleannullable-fix
Fixes the bug where all-null BooleanNullable columns will break the simple imputer during transform, when fit on nullable data that has a non-null value.