Skip to content

Adds Imputer to allow different imputation strategies for numerical and categorical dtypes #991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jul 31, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jul 28, 2020

Closes #881

Notes:

@angela97lin angela97lin self-assigned this Jul 28, 2020
@angela97lin angela97lin changed the title Adds TypedImputer to address allowing different imputation strategies for numerical and categorical dtypes Adds TypedImputer to allow different imputation strategies for numerical and categorical dtypes Jul 28, 2020
@angela97lin angela97lin added this to the July 2020 milestone Jul 29, 2020
@codecov
Copy link

codecov bot commented Jul 30, 2020

Codecov Report

Merging #991 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #991    +/-   ##
========================================
  Coverage   99.86%   99.86%            
========================================
  Files         179      181     +2     
  Lines        9436     9584   +148     
========================================
+ Hits         9423     9571   +148     
  Misses         13       13            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/label_leakage_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/outliers_data_check.py 100.00% <100.00%> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
...lines/components/transformers/imputers/__init__.py 100.00% <100.00%> (ø)
...elines/components/transformers/imputers/imputer.py 100.00% <100.00%> (ø)
...components/transformers/imputers/simple_imputer.py 100.00% <100.00%> (ø)
evalml/pipelines/regression_pipeline.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 030896b...bdc623c. Read the comment docs.

@angela97lin angela97lin mentioned this pull request Jul 30, 2020
3 tasks
@angela97lin angela97lin requested a review from kmax12 July 30, 2020 15:23
@angela97lin angela97lin changed the title Adds TypedImputer to allow different imputation strategies for numerical and categorical dtypes Adds Imputer to allow different imputation strategies for numerical and categorical dtypes Jul 31, 2020
@angela97lin
Copy link
Contributor Author

@kmax12 @dsherry @freddyaboulton Might refactor tests to try all imputation strategies in this PR or in a latter one but implementation is ready for review again and I wanted to try kicking off the perf testing so please re-review if you get a chance! :')

@angela97lin angela97lin requested review from kmax12 and dsherry July 31, 2020 15:37
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin thanks for hustling to get this updated! It looks great to me.

Approved, pending perf test results comparison to ensure that a) timing and accuracy aren't degraded by this change, and b) we don't see any new errors on the perf test datasets introduced by this change.

I also left some comments I think are blocking, mostly minor:

  • Let's resolve the discussion in fit about defining a list of allowed categorical types. I think we should do that rather than just filtering out the numerics, to avoid bugs.
  • Remove "constant" strategy from automl search hyperparameters
  • Small docstring update
  • Add another release notes entry to highlight the fact that this is a potential performance improvement on mixed-type data.

Your test coverage looks great. A couple things I thought of:

  • What happens if the provided data isn't empty, but also doesn't contain any numeric or categorical cols? I.e. all datetimes or something. Related to discussion in fit about datatype selection.
  • Checking that the default parameters match what we expect.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin well done on this!!

I did some poking around at the first batch perf test results earlier. Looked good to me at first glance. I'd say since this is a low-risk PR and we're more interested in defense than we are in performance gain here, feel free to merge this now so we can start the release moving, and post the perf test plots after that. That ok?

@angela97lin
Copy link
Contributor Author

angela97lin commented Jul 31, 2020

@dsherry I just generated the plot!

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML can configure SimpleImputer to apply invalid imputation for categorical dtype
6 participants