-
Notifications
You must be signed in to change notification settings - Fork 91
Allow AutoMLSearch to handle Unknown type #2477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2477 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 283 283
Lines 25808 25897 +89
=======================================
+ Hits 25707 25796 +89
Misses 101 101
Continue to review full report at Codecov.
|
| for logical_type in override_types: | ||
| # Column with Nans to boolean used to fail. Now it doesn't | ||
| if has_nan and logical_type == Boolean: | ||
| if has_nan and logical_type in [Boolean, NaturalLanguage]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave out NaturalLanguage since casting this will result in np.nan becoming pd.NA, which fails the imputer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how come this wouldn't happen before? Woodwork doesn't convert np.nan to pd.NA in 0.4.2? Is this because of the pandas upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton I believe it's due to the pandas upgrade!

Looking at the 1.3.0 release docs, seems like there's a lot of changes with NaN handling, and they're using <NA> for scalar types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @bchen1116 Let's list this as a breaking change for now. I imagine we might want to file an issue to discuss if there are any changes we need to make to the simpleimputer? If users run it on natural language after this pr they'll get a stacktrace they didn't get before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton updated the release notes with the breaking change and filed the issue here!
| assert nl_nan_check.validate([nl_col, nl_col_without_nan]) == expected | ||
|
|
||
| # test np.array | ||
| assert nl_nan_check.validate(np.array([nl_col, nl_col_without_nan])) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can't cast this to NL type without ww, we remove this from the tests. They'll be cast as Unknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 Thank you for doing this! The changes look good to me and I'm glad the performance looks better with this change for a lot of the datasets we commonly test with.
One thing I'd like to find out/cover with (small/fast) unit tests before merge though - Do our model understanding methods support Unknown features?
I think we should focus on
- partial dependence
- permutation importance
- prediction explanations
My hope is yes but I'm not 100% sure.
| for logical_type in override_types: | ||
| # Column with Nans to boolean used to fail. Now it doesn't | ||
| if has_nan and logical_type == Boolean: | ||
| if has_nan and logical_type in [Boolean, NaturalLanguage]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how come this wouldn't happen before? Woodwork doesn't convert np.nan to pd.NA in 0.4.2? Is this because of the pandas upgrade?
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit here and there, but overall, I like where this is headed. Some of the things I'd like for you to consider, particularly with respect to the explanation predictions, partial dependence and permutation importance is whether we're adding back in more run time? I need to reacquaint myself with Freddy's testing environment/context manager to see whether that could help us there, but I'd just like to be cognizant of whether we're walking back any of his work by adding in these tests.
Overall, though, this is solid. Thanks a lot for addressing it so quickly from design to impl and getting it out there.
| partial_dependence(pl, X, indices, grid_resolution=10) | ||
| return | ||
| s = partial_dependence(pl, X, indices, grid_resolution=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How low can we keep this grid_resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowest is 2
|
@chukarsten two of the tests take about 1 second, while the permutation importance test add about 3 seconds. I think it's important to have the coverage on these tests, and it doesn't take too long to raise concern for now imo. Let me know what you think! |
fix #2426
Perf tests here