-
Notifications
You must be signed in to change notification settings - Fork 89
Adds Imputer to allow different imputation strategies for numerical and categorical dtypes #991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## main #991 +/- ##
========================================
Coverage 99.86% 99.86%
========================================
Files 179 181 +2
Lines 9436 9584 +148
========================================
+ Hits 9423 9571 +148
Misses 13 13
Continue to review full report at Codecov.
|
evalml/pipelines/components/transformers/imputers/simple_imputer.py
Outdated
Show resolved
Hide resolved
@kmax12 @dsherry @freddyaboulton Might refactor tests to try all imputation strategies in this PR or in a latter one but implementation is ready for review again and I wanted to try kicking off the perf testing so please re-review if you get a chance! :') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin thanks for hustling to get this updated! It looks great to me.
Approved, pending perf test results comparison to ensure that a) timing and accuracy aren't degraded by this change, and b) we don't see any new errors on the perf test datasets introduced by this change.
I also left some comments I think are blocking, mostly minor:
- Let's resolve the discussion in
fit
about defining a list of allowed categorical types. I think we should do that rather than just filtering out the numerics, to avoid bugs. - Remove "constant" strategy from automl search hyperparameters
- Small docstring update
- Add another release notes entry to highlight the fact that this is a potential performance improvement on mixed-type data.
Your test coverage looks great. A couple things I thought of:
- What happens if the provided data isn't empty, but also doesn't contain any numeric or categorical cols? I.e. all datetimes or something. Related to discussion in
fit
about datatype selection. - Checking that the default parameters match what we expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin well done on this!!
I did some poking around at the first batch perf test results earlier. Looked good to me at first glance. I'd say since this is a low-risk PR and we're more interested in defense than we are in performance gain here, feel free to merge this now so we can start the release moving, and post the perf test plots after that. That ok?
@dsherry I just generated the plot! |
Closes #881
Notes:
fill_value
as is right now to be used by both the numeric and categorical imputers. Default ofNone
works for both right now, but could be an issue if user sets it. Could file separate issue or fix here, open to thoughts! (Filed Imputer: same fill value used for numeric/categorical for "constant" strategy #1000)