Update MultiplyNumericBoolean to work with more inputs#2393
Conversation
| [2.0, nan, nan] | ||
| >>> multiply_numeric_boolean([2, 1, 2], [True, True, None]).tolist() | ||
| [2.0, 1.0, nan] | ||
| >>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype="boolean")).tolist() |
There was a problem hiding this comment.
With the new code the old boolean series had a dtype of object and pandas wasn't converting this to Int64. Updated here to use boolean since we should always get WW initialized series as inputs during normal CFM operations and that will have a dtype of either bool or boolean.
Codecov Report
@@ Coverage Diff @@
## main #2393 +/- ##
=======================================
Coverage 99.50% 99.50%
=======================================
Files 322 322
Lines 20146 20161 +15
=======================================
+ Hits 20046 20061 +15
Misses 100 100
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
| multiply_numeric_boolean = MultiplyNumericBoolean() | ||
| for input in test_cases: | ||
| vals = pd.Series(input["val"]) | ||
| vals = pd.Series(input["val"]).astype("Int64") |
There was a problem hiding this comment.
If we don't covert here the input with a single pd.NA value ends up with a dtype of object which would never happen in real usage. An object input to the primitive also causes an error.
| "holidays >= 0.13", | ||
| "numpy >= 1.21.0", | ||
| "packaging >= 20.0", | ||
| "packaging >= 20.0, <22.0", |
There was a problem hiding this comment.
Temporary for testing - can remove here or in PR #2389 depending on timing of merges.
| result = vals.where(mask, mask.replace({False: 0})) | ||
| result = vals * bools.astype("Int64") | ||
| # Replace all pd.NA with np.nan to avoid WW init error | ||
| result = result.replace({pd.NA: np.nan}) |
There was a problem hiding this comment.
Do we need this line? If so, is it for a particular case? In the doctest, the result has a pd.NA.
There was a problem hiding this comment.
That's a good question. On the surface it seems like it is no longer necessary, but let me look a bit closer.
One thing I'm concerned about - the doc test doesn't actually initialize Woodwork, that happens later in the CFM process and this comment said it was done to avoid a WW init error. The doctest wouldn't cover that, and also the doctest example doesn't contain a np.nan because the return type is Int64 which cannot hold np.nan, so the replace call does nothing in that case.
There was a problem hiding this comment.
I ran DFS/CFM using a dataframe that had all the column dtype combinations that we should run into, and it worked successfully. That leads me to believe this replace call is no longer required. I'll try to remove and see if our CI passes for the min package versions as well.
Here are the features that were generated and computed successfully:
[<Feature: int>,
<Feature: Int_no_null>,
<Feature: Int_null>,
<Feature: float>,
<Feature: float_null>,
<Feature: bool>,
<Feature: boolean_no_null>,
<Feature: boolean_null>,
<Feature: bool * Int_no_null>,
<Feature: bool * Int_null>,
<Feature: bool * float>,
<Feature: bool * float_null>,
<Feature: bool * int>,
<Feature: boolean_no_null * Int_no_null>,
<Feature: boolean_no_null * Int_null>,
<Feature: boolean_no_null * float>,
<Feature: boolean_no_null * float_null>,
<Feature: boolean_no_null * int>,
<Feature: boolean_null * Int_no_null>,
<Feature: boolean_null * Int_null>,
<Feature: boolean_null * float>,
<Feature: boolean_null * float_null>,
<Feature: boolean_null * int>]
Update MultiplyNumericBoolean to work with more inputs
Certain combination of input dtypes would result in errors with newer pandas versions. This PR refactors the calculation in
MultiplyNumericBooleanto resolve this issue.Fixes #2390