Add-handling-utils #4024

tamargrey · 2023-02-21T22:17:22Z

Pull Request Description

Adds the utilities needed for component-specific handling of nullable types and starts adding related properties and methods to the base component and objective classes.

Remaining To Do

Combine new fixture with imputer fixture
Add new utils to init.py - figure out what the circular import issue was
Expand component tests for _handle_nullable_types to use other configurations of incompatibilities
Parameterize tests where possible

tamargrey · 2023-02-21T22:24:26Z

evalml/objectives/binary_classification_objective.py

+        # Since Objective functions dont have the same safeguards around non woodwork inputs,
+        # we'll choose to avoid the downcasting path since we shouldn't have nullable pandas types
+        # without them being set by Woodwork
+        if isinstance(y_true, pd.Series) and y_true.ww.schema is not None:


Wanted to highlight this behavior. At least in tests, we often call the objective functions with inputs that don't actually have woodwork types, and the objective functions don't have much in the way of logic to initialize woodwork. Given how late in the pipeline these are used and the fact that the nulalble type or not doesn't have any downstream implications here, I decided to just not initialize Woodwork if the data didn't already have types. No reason to waste the time with type inference when the non woodwork inputs should always be compatible anyway.

Is there any risk here of a user passing in pandas nullable types without initializing woodwork? Is that a case we want to handle as well?

A user could pass them in if they were using the objective functions directly (I don't think we're at risk of them getting passed in with woodwork types from automl search).

The good news is, we can't get pandas nullable types in numpy inputs, so I think we'd cover user inputs by changing this to pass pandas data without woodwork types to the downcast utils.

Allowed pandas series without woodwork types in this commit e5e7a52

tamargrey · 2023-02-21T22:25:40Z

evalml/utils/nullable_type_utils.py

+        X with any incompatible nullable types downcasted to compatible equivalents.
+    """
+    # --> consider adding param for expecting there to not be any nans present so we're
+    # notified if we're ever unknowingly converting to Double or Categorical when we shouldnt in automl search


Just a small behavior consideration that I am not going to implement unless/until a need for it arises. Likely, it'd happen during the integration into AutoMLSearch if at all.

codecov · 2023-02-21T22:25:44Z

Codecov Report

Merging #4024 (878cfc2) into main (be98201) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4024     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        347     349      +2     
  Lines      36954   37200    +246     
=======================================
+ Hits       36833   37079    +246     
  Misses       121     121

Impacted Files	Coverage Δ
...alml/objectives/binary_classification_objective.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_imputer.py	`100.0% <ø> (ø)`
.../tests/component_tests/test_time_series_imputer.py	`100.0% <ø> (ø)`
evalml/objectives/objective_base.py	`100.0% <100.0%> (ø)`
evalml/pipelines/components/component_base.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`99.0% <100.0%> (+0.1%)`	⬆️
evalml/tests/conftest.py	`98.3% <100.0%> (+0.2%)`	⬆️
evalml/tests/objective_tests/test_objectives.py	`100.0% <100.0%> (ø)`
...alml/tests/utils_tests/test_nullable_type_utils.py	`100.0% <100.0%> (ø)`
evalml/utils/__init__.py	`100.0% <100.0%> (ø)`
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tamargrey · 2023-02-21T22:27:08Z

evalml/utils/nullable_type_utils.py

+        "BooleanNullable": ("Boolean", "Categorical"),
+        "IntegerNullable": ("Integer", "Double"),
+        # --> age fractional or double? I think AgeFractional to avoid losing info
+        "AgeNullable": ("Age", "AgeFractional"),


We were previously converting to Integer/Double for age nullable columns, but as far as I can tell, there's no reason to not maintain this information when possible!

What's the underlying dtype of AgeFractional?

float64! So it's pretty much the same thing as what we're doing to IntegerNullalbe but just with information maintained

(source: https://github.com/alteryx/woodwork/blob/main/woodwork/logical_types.py#L130-L141)

tamargrey · 2023-02-21T22:28:27Z

evalml/utils/nullable_type_utils.py

+    Returns:
+        LogicalType string to be used to downcast incompatible nullable logical types.
+    """
+    # --> maybe this can be configurable so we could easily choose different values to downcast to for specific components?


Another idea that might prove useful to implement at some point in the future. Say a component needed to convert from BooleanNullable to IntegerNullable for some reason, it'd be cool if we could leave the actual downcasted logical type decisions to the components.

This would add a level of complexity, though, that we shouldnt introduce unless necessary.

eccabay

This is some awesome work! Most of my comments are related to condensing logic - there's a lot that's repeated between X and y, and I think it would be cleaner to combine them a bit.

evalml/objectives/binary_classification_objective.py

eccabay · 2023-02-22T11:40:53Z

evalml/objectives/binary_classification_objective.py

+        # Since Objective functions dont have the same safeguards around non woodwork inputs,
+        # we'll choose to avoid the downcasting path since we shouldn't have nullable pandas types
+        # without them being set by Woodwork
+        if isinstance(y_true, pd.Series) and y_true.ww.schema is not None:


Is there any risk here of a user passing in pandas nullable types without initializing woodwork? Is that a case we want to handle as well?

evalml/objectives/binary_classification_objective.py

eccabay · 2023-02-22T11:46:44Z

evalml/tests/component_tests/test_components.py

+    MockComponent, _, _ = test_classes
+    y = nullable_type_target(ltype="IntegerNullable", has_nans=False)
+    X = nullable_type_test_data(has_nans=False)


Could you parameterize this test through a variety of inputs? i.e., no incompatibilities in (X, y), only incompatibilities in one of (X, y), only incompatibilities with (Boolean, Integer), etc. That might be a lot of work with the MockComponent class, so I'm open to other ways to test this, but I do want to make sure we have the coverage somewhere.

100% agree! I will look into the best way to test. Right now, I'm just thinking that it might be better to make a separate fixture that can take inputs instead of further bastardizing the test_classes like I am right now, which are used for much much simpler tests elsewhere in this file.

evalml/tests/conftest.py

evalml/utils/nullable_type_utils.py

eccabay · 2023-02-22T12:10:17Z

evalml/utils/nullable_type_utils.py

+import woodwork as ww
+
+
+def _downcast_nullable_X(X, handle_boolean_nullable=True, handle_integer_nullable=True):


There's a lot of similar logic here between this and _downcast_nullable_y. I think it would make sense to combine these into a single function, and separate the X/y logic based on whether the input is a DataFrame or a Series.

It feels a bit more flexible to have a single function rather than two separate ones, and it would be able to condense down the number of tests in test_nullable_type_utils.py, since there's a lot of repeated logic between X and y tests there as well.

I don't disagree about the shared logic. I generally like utilities to not do too many things (it helps me with readability of code and how I think about tests but does, indeed, end up with more tests, and I totally see where you're coming from that these tests are very similar.)

I'm gonna play around with different ways of sharing more logic among the separate utils vs combining them and see if I can put something together that doesn't feel as repetitive. One thing may be useful here would be to test the shared logic separately, so that the tests for the two utils are really only testing the differing apis for series vs dataframes.

@eccabay I pulled out more shared logic in the utils (along with some variable renaming) and refactored some tests to not be so repetitive across X and y downcasters: 601535b. I still feel like it makes sense to not squish them into one util with an if/else block since the woodwork dataframe and series apis are different enough that the two blocks of code wouldn't change much if they were in a single util at this point. It would just put the onus on anyone trying to understand the util in the future to understand the scope of the util's abilities.

If we decide to go with one util in the end, I would keep the tests the same and just remove the downcast_util parameter in fixtures, so making the change to one util will be very quick, and I can definitely be convinced that that's the right path.

I chose not to refactor tests that would need branching logic at both the setup and assertions, as I feel like that is a good indicator of when two checks really deserve their own tests.

eccabay · 2023-02-22T12:12:07Z

evalml/utils/nullable_type_utils.py

+        "BooleanNullable": ("Boolean", "Categorical"),
+        "IntegerNullable": ("Integer", "Double"),
+        # --> age fractional or double? I think AgeFractional to avoid losing info
+        "AgeNullable": ("Age", "AgeFractional"),


What's the underlying dtype of AgeFractional?

eccabay

Approved pending the remaining comments! Overall, this looks very solid and I'm excited for the final result.

evalml/objectives/binary_classification_objective.py

evalml/pipelines/components/component_base.py

…jective bases

…e tests

jeremyliweishih

LGTM other than the test misses. Looks like nullable_ltype is never an instance of y_compatible_ltypes across all the tests. Should be good to merge once its fixed! Great work!

jeremyliweishih · 2023-02-23T18:27:09Z

evalml/utils/nullable_type_utils.py

+        "AgeNullable": ("Age", "AgeFractional"),
+    }
+
+    no_nans_ltype, has_nans_ltype = downcast_matches[str(col.ww.logical_type)]


evalml/tests/utils_tests/test_nullable_type_utils.py

evalml/tests/component_tests/test_components.py

auto-assign bot assigned tamargrey Feb 21, 2023

tamargrey commented Feb 21, 2023

View reviewed changes

eccabay requested changes Feb 22, 2023

View reviewed changes

eccabay approved these changes Feb 22, 2023

View reviewed changes

evalml/objectives/binary_classification_objective.py Outdated Show resolved Hide resolved

evalml/pipelines/components/component_base.py Outdated Show resolved Hide resolved

Tamar Grey added 19 commits February 22, 2023 17:07

Add initial implementation of X downcast and initial tests

d0ad0ff

Add nullable types fixture and more tests with small bugfixes

c080074

Fill out tests for downcast X

51078ec

Pull out ltype chooser to util and add y downcaster

2f746dd

Add properties and _handle_nullable_types methods to component and ob…

1d913ae

…jective bases

Confirm handling works for one component as poc

873c28d

Add docstrings

ed11003

Make handle nullable types tests more generic and refactor X downcaster

511d8b3

Start cleaning up arrow comments

ed9bdc9

Cleanup

961fdd0

Add release note

9307a81

remove unnecesary early return from downcast X

41b777a

Pull out more common logic from downcasters and parameterize tests

bd1a183

dont implement objective function on test fixture

94c6059

Move nullable type logic to ObjectiveBase to generalize

4a4ef7b

Allow pandas series without woodwork types in objective fn handle method

577948e

Reuse fixtures for imputer and nullable type fixtures

f7cfa4a

Use actual numpy type

b576986

Fix docstring

0ec0d30

tamargrey force-pushed the add-handling-utils branch from 513ddd2 to 0ec0d30 Compare February 22, 2023 22:07

Tamar Grey added 2 commits February 22, 2023 17:24

Add new utils to __init__ without breaking

fdd2f17

Cover all combinations of incompatibilities in component and objectiv…

f0ec752

…e tests

Tamar Grey added 2 commits February 23, 2023 11:08

Condense tests and pull out repeated logic to util

05c1576

remove outdated comments

9f7c486

tamargrey requested review from jeremyliweishih, chukarsten, fjlanasa and christopherbunn February 23, 2023 16:36

jeremyliweishih approved these changes Feb 23, 2023

View reviewed changes

Fix codecov issue in y compatibility checks

878cfc2

chukarsten approved these changes Feb 23, 2023

View reviewed changes

tamargrey merged commit addf98a into main Feb 23, 2023

tamargrey deleted the add-handling-utils branch February 23, 2023 20:52

chukarsten mentioned this pull request Mar 15, 2023

Release v0.69.0 #4078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add-handling-utils #4024

Add-handling-utils #4024

tamargrey commented Feb 21, 2023 •

edited

Loading

tamargrey Feb 21, 2023

eccabay Feb 22, 2023

tamargrey Feb 22, 2023

tamargrey Feb 22, 2023

tamargrey Feb 21, 2023 •

edited

Loading

codecov bot commented Feb 21, 2023 •

edited

Loading

tamargrey Feb 21, 2023

eccabay Feb 22, 2023

tamargrey Feb 22, 2023

tamargrey Feb 21, 2023

eccabay left a comment

eccabay Feb 22, 2023

eccabay Feb 22, 2023

tamargrey Feb 22, 2023

eccabay Feb 22, 2023

tamargrey Feb 22, 2023 •

edited

Loading

tamargrey Feb 22, 2023 •

edited

Loading

eccabay Feb 22, 2023

eccabay left a comment

jeremyliweishih left a comment

jeremyliweishih Feb 23, 2023

		import woodwork as ww


		def _downcast_nullable_X(X, handle_boolean_nullable=True, handle_integer_nullable=True):

Add-handling-utils #4024

Add-handling-utils #4024

Conversation

tamargrey commented Feb 21, 2023 • edited Loading

Pull Request Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey Feb 21, 2023 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 21, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

tamargrey Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Feb 21, 2023 •

edited

Loading

tamargrey Feb 21, 2023 •

edited

Loading

codecov bot commented Feb 21, 2023 •

edited

Loading

tamargrey Feb 22, 2023 •

edited

Loading

tamargrey Feb 22, 2023 •

edited

Loading