-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update MultiplyNumericBoolean to work with more inputs #2393
Conversation
>>> multiply_numeric_boolean = MultiplyNumericBoolean() | ||
>>> multiply_numeric_boolean([2, 1, 2], [True, True, False]).tolist() | ||
[2, 1, 0] | ||
>>> multiply_numeric_boolean([2, None, None], [True, True, False]).tolist() | ||
[2.0, nan, nan] | ||
>>> multiply_numeric_boolean([2, 1, 2], [True, True, None]).tolist() | ||
[2.0, 1.0, nan] | ||
>>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype="boolean")).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new code the old boolean series had a dtype of object and pandas wasn't converting this to Int64
. Updated here to use boolean
since we should always get WW initialized series as inputs during normal CFM operations and that will have a dtype of either bool
or boolean
.
Codecov Report
@@ Coverage Diff @@
## main #2393 +/- ##
=======================================
Coverage 99.50% 99.50%
=======================================
Files 322 322
Lines 20146 20161 +15
=======================================
+ Hits 20046 20061 +15
Misses 100 100
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -1808,7 +1808,7 @@ def test_multiply_numeric_boolean(): | |||
|
|||
multiply_numeric_boolean = MultiplyNumericBoolean() | |||
for input in test_cases: | |||
vals = pd.Series(input["val"]) | |||
vals = pd.Series(input["val"]).astype("Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't covert here the input with a single pd.NA
value ends up with a dtype of object
which would never happen in real usage. An object
input to the primitive also causes an error.
@@ -41,7 +41,7 @@ dependencies = [ | |||
"distributed >= 2022.2.0, !=2022.10.1", | |||
"holidays >= 0.13", | |||
"numpy >= 1.21.0", | |||
"packaging >= 20.0", | |||
"packaging >= 20.0, <22.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary for testing - can remove here or in PR #2389 depending on timing of merges.
# Only apply mask where the input is not null | ||
mask = mask.where(vals_not_null) | ||
result = vals.where(mask, mask.replace({False: 0})) | ||
result = vals * bools.astype("Int64") | ||
# Replace all pd.NA with np.nan to avoid WW init error | ||
result = result.replace({pd.NA: np.nan}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line? If so, is it for a particular case? In the doctest, the result has a pd.NA
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. On the surface it seems like it is no longer necessary, but let me look a bit closer.
One thing I'm concerned about - the doc test doesn't actually initialize Woodwork, that happens later in the CFM process and this comment said it was done to avoid a WW init error. The doctest wouldn't cover that, and also the doctest example doesn't contain a np.nan
because the return type is Int64
which cannot hold np.nan
, so the replace call does nothing in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran DFS/CFM using a dataframe that had all the column dtype combinations that we should run into, and it worked successfully. That leads me to believe this replace
call is no longer required. I'll try to remove and see if our CI passes for the min package versions as well.
Here are the features that were generated and computed successfully:
[<Feature: int>,
<Feature: Int_no_null>,
<Feature: Int_null>,
<Feature: float>,
<Feature: float_null>,
<Feature: bool>,
<Feature: boolean_no_null>,
<Feature: boolean_null>,
<Feature: bool * Int_no_null>,
<Feature: bool * Int_null>,
<Feature: bool * float>,
<Feature: bool * float_null>,
<Feature: bool * int>,
<Feature: boolean_no_null * Int_no_null>,
<Feature: boolean_no_null * Int_null>,
<Feature: boolean_no_null * float>,
<Feature: boolean_no_null * float_null>,
<Feature: boolean_no_null * int>,
<Feature: boolean_null * Int_no_null>,
<Feature: boolean_null * Int_null>,
<Feature: boolean_null * float>,
<Feature: boolean_null * float_null>,
<Feature: boolean_null * int>]
Update MultiplyNumericBoolean to work with more inputs
Certain combination of input dtypes would result in errors with newer pandas versions. This PR refactors the calculation in
MultiplyNumericBoolean
to resolve this issue.Fixes #2390