-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add-handling-utils #4024
Add-handling-utils #4024
Conversation
# Since Objective functions dont have the same safeguards around non woodwork inputs, | ||
# we'll choose to avoid the downcasting path since we shouldn't have nullable pandas types | ||
# without them being set by Woodwork | ||
if isinstance(y_true, pd.Series) and y_true.ww.schema is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to highlight this behavior. At least in tests, we often call the objective functions with inputs that don't actually have woodwork types, and the objective functions don't have much in the way of logic to initialize woodwork. Given how late in the pipeline these are used and the fact that the nulalble type or not doesn't have any downstream implications here, I decided to just not initialize Woodwork if the data didn't already have types. No reason to waste the time with type inference when the non woodwork inputs should always be compatible anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk here of a user passing in pandas nullable types without initializing woodwork? Is that a case we want to handle as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user could pass them in if they were using the objective functions directly (I don't think we're at risk of them getting passed in with woodwork types from automl search).
The good news is, we can't get pandas nullable types in numpy inputs, so I think we'd cover user inputs by changing this to pass pandas data without woodwork types to the downcast utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowed pandas series without woodwork types in this commit e5e7a52
evalml/utils/nullable_type_utils.py
Outdated
X with any incompatible nullable types downcasted to compatible equivalents. | ||
""" | ||
# --> consider adding param for expecting there to not be any nans present so we're | ||
# notified if we're ever unknowingly converting to Double or Categorical when we shouldnt in automl search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small behavior consideration that I am not going to implement unless/until a need for it arises. Likely, it'd happen during the integration into AutoMLSearch if at all.
Codecov Report
@@ Coverage Diff @@
## main #4024 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 349 +2
Lines 36954 37200 +246
=======================================
+ Hits 36833 37079 +246
Misses 121 121
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
"BooleanNullable": ("Boolean", "Categorical"), | ||
"IntegerNullable": ("Integer", "Double"), | ||
# --> age fractional or double? I think AgeFractional to avoid losing info | ||
"AgeNullable": ("Age", "AgeFractional"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were previously converting to Integer/Double for age nullable columns, but as far as I can tell, there's no reason to not maintain this information when possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the underlying dtype of AgeFractional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64
! So it's pretty much the same thing as what we're doing to IntegerNullalbe but just with information maintained
(source: https://github.com/alteryx/woodwork/blob/main/woodwork/logical_types.py#L130-L141)
evalml/utils/nullable_type_utils.py
Outdated
Returns: | ||
LogicalType string to be used to downcast incompatible nullable logical types. | ||
""" | ||
# --> maybe this can be configurable so we could easily choose different values to downcast to for specific components? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea that might prove useful to implement at some point in the future. Say a component needed to convert from BooleanNullable to IntegerNullable for some reason, it'd be cool if we could leave the actual downcasted logical type decisions to the components.
This would add a level of complexity, though, that we shouldnt introduce unless necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some awesome work! Most of my comments are related to condensing logic - there's a lot that's repeated between X and y, and I think it would be cleaner to combine them a bit.
# Since Objective functions dont have the same safeguards around non woodwork inputs, | ||
# we'll choose to avoid the downcasting path since we shouldn't have nullable pandas types | ||
# without them being set by Woodwork | ||
if isinstance(y_true, pd.Series) and y_true.ww.schema is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk here of a user passing in pandas nullable types without initializing woodwork? Is that a case we want to handle as well?
MockComponent, _, _ = test_classes | ||
y = nullable_type_target(ltype="IntegerNullable", has_nans=False) | ||
X = nullable_type_test_data(has_nans=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you parameterize this test through a variety of inputs? i.e., no incompatibilities in (X, y), only incompatibilities in one of (X, y), only incompatibilities with (Boolean, Integer), etc. That might be a lot of work with the MockComponent
class, so I'm open to other ways to test this, but I do want to make sure we have the coverage somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree! I will look into the best way to test. Right now, I'm just thinking that it might be better to make a separate fixture that can take inputs instead of further bastardizing the test_classes
like I am right now, which are used for much much simpler tests elsewhere in this file.
import woodwork as ww | ||
|
||
|
||
def _downcast_nullable_X(X, handle_boolean_nullable=True, handle_integer_nullable=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of similar logic here between this and _downcast_nullable_y
. I think it would make sense to combine these into a single function, and separate the X/y logic based on whether the input is a DataFrame or a Series.
It feels a bit more flexible to have a single function rather than two separate ones, and it would be able to condense down the number of tests in test_nullable_type_utils.py
, since there's a lot of repeated logic between X and y tests there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree about the shared logic. I generally like utilities to not do too many things (it helps me with readability of code and how I think about tests but does, indeed, end up with more tests, and I totally see where you're coming from that these tests are very similar.)
I'm gonna play around with different ways of sharing more logic among the separate utils vs combining them and see if I can put something together that doesn't feel as repetitive. One thing may be useful here would be to test the shared logic separately, so that the tests for the two utils are really only testing the differing apis for series vs dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I pulled out more shared logic in the utils (along with some variable renaming) and refactored some tests to not be so repetitive across X and y downcasters: 601535b. I still feel like it makes sense to not squish them into one util with an if/else block since the woodwork dataframe and series apis are different enough that the two blocks of code wouldn't change much if they were in a single util at this point. It would just put the onus on anyone trying to understand the util in the future to understand the scope of the util's abilities.
If we decide to go with one util in the end, I would keep the tests the same and just remove the downcast_util
parameter in fixtures, so making the change to one util will be very quick, and I can definitely be convinced that that's the right path.
I chose not to refactor tests that would need branching logic at both the setup and assertions, as I feel like that is a good indicator of when two checks really deserve their own tests.
"BooleanNullable": ("Boolean", "Categorical"), | ||
"IntegerNullable": ("Integer", "Double"), | ||
# --> age fractional or double? I think AgeFractional to avoid losing info | ||
"AgeNullable": ("Age", "AgeFractional"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the underlying dtype of AgeFractional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending the remaining comments! Overall, this looks very solid and I'm excited for the final result.
513ddd2
to
0ec0d30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the test misses. Looks like nullable_ltype
is never an instance of y_compatible_ltypes
across all the tests. Should be good to merge once its fixed! Great work!
"AgeNullable": ("Age", "AgeFractional"), | ||
} | ||
|
||
no_nans_ltype, has_nans_ltype = downcast_matches[str(col.ww.logical_type)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
Pull Request Description
closes #3990
Adds the utilities needed for component-specific handling of nullable types and starts adding related properties and methods to the base component and objective classes.
Remaining To Do