Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MultiplyNumericBoolean to work with more inputs #2393

Merged
merged 8 commits into from
Dec 8, 2022

Conversation

thehomebrewnerd
Copy link
Contributor

Update MultiplyNumericBoolean to work with more inputs

Certain combination of input dtypes would result in errors with newer pandas versions. This PR refactors the calculation in MultiplyNumericBoolean to resolve this issue.

Fixes #2390

>>> multiply_numeric_boolean = MultiplyNumericBoolean()
>>> multiply_numeric_boolean([2, 1, 2], [True, True, False]).tolist()
[2, 1, 0]
>>> multiply_numeric_boolean([2, None, None], [True, True, False]).tolist()
[2.0, nan, nan]
>>> multiply_numeric_boolean([2, 1, 2], [True, True, None]).tolist()
[2.0, 1.0, nan]
>>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype="boolean")).tolist()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new code the old boolean series had a dtype of object and pandas wasn't converting this to Int64. Updated here to use boolean since we should always get WW initialized series as inputs during normal CFM operations and that will have a dtype of either bool or boolean.

@codecov
Copy link

codecov bot commented Dec 8, 2022

Codecov Report

Merging #2393 (d68d93e) into main (f804d4b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2393   +/-   ##
=======================================
  Coverage   99.50%   99.50%           
=======================================
  Files         322      322           
  Lines       20146    20161   +15     
=======================================
+ Hits        20046    20061   +15     
  Misses        100      100           
Impacted Files Coverage Δ
...ndard/transform/binary/multiply_numeric_boolean.py 100.00% <100.00%> (ø)
...s/tests/primitive_tests/test_transform_features.py 99.87% <100.00%> (+<0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -1808,7 +1808,7 @@ def test_multiply_numeric_boolean():

multiply_numeric_boolean = MultiplyNumericBoolean()
for input in test_cases:
vals = pd.Series(input["val"])
vals = pd.Series(input["val"]).astype("Int64")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't covert here the input with a single pd.NA value ends up with a dtype of object which would never happen in real usage. An object input to the primitive also causes an error.

@gsheni gsheni requested a review from a team December 8, 2022 17:20
@@ -41,7 +41,7 @@ dependencies = [
"distributed >= 2022.2.0, !=2022.10.1",
"holidays >= 0.13",
"numpy >= 1.21.0",
"packaging >= 20.0",
"packaging >= 20.0, <22.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary for testing - can remove here or in PR #2389 depending on timing of merges.

# Only apply mask where the input is not null
mask = mask.where(vals_not_null)
result = vals.where(mask, mask.replace({False: 0}))
result = vals * bools.astype("Int64")
# Replace all pd.NA with np.nan to avoid WW init error
result = result.replace({pd.NA: np.nan})
Copy link
Contributor

@sbadithe sbadithe Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line? If so, is it for a particular case? In the doctest, the result has a pd.NA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. On the surface it seems like it is no longer necessary, but let me look a bit closer.

One thing I'm concerned about - the doc test doesn't actually initialize Woodwork, that happens later in the CFM process and this comment said it was done to avoid a WW init error. The doctest wouldn't cover that, and also the doctest example doesn't contain a np.nan because the return type is Int64 which cannot hold np.nan, so the replace call does nothing in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran DFS/CFM using a dataframe that had all the column dtype combinations that we should run into, and it worked successfully. That leads me to believe this replace call is no longer required. I'll try to remove and see if our CI passes for the min package versions as well.

Here are the features that were generated and computed successfully:

[<Feature: int>,
 <Feature: Int_no_null>,
 <Feature: Int_null>,
 <Feature: float>,
 <Feature: float_null>,
 <Feature: bool>,
 <Feature: boolean_no_null>,
 <Feature: boolean_null>,
 <Feature: bool * Int_no_null>,
 <Feature: bool * Int_null>,
 <Feature: bool * float>,
 <Feature: bool * float_null>,
 <Feature: bool * int>,
 <Feature: boolean_no_null * Int_no_null>,
 <Feature: boolean_no_null * Int_null>,
 <Feature: boolean_no_null * float>,
 <Feature: boolean_no_null * float_null>,
 <Feature: boolean_no_null * int>,
 <Feature: boolean_null * Int_no_null>,
 <Feature: boolean_null * Int_null>,
 <Feature: boolean_null * float>,
 <Feature: boolean_null * float_null>,
 <Feature: boolean_null * int>]

@thehomebrewnerd thehomebrewnerd enabled auto-merge (squash) December 8, 2022 19:28
@thehomebrewnerd thehomebrewnerd merged commit 722fbe8 into main Dec 8, 2022
@thehomebrewnerd thehomebrewnerd deleted the issue2390-multiply-numeric-boolean-bug branch December 8, 2022 19:28
@thehomebrewnerd thehomebrewnerd mentioned this pull request Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiplyNumericBoolean primitive doesn't work properly with BooleanNullable inputs on recent pandas versions
2 participants