Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Imputer to allow different imputation strategies for numerical and categorical dtypes #991

Merged
merged 34 commits into from Jul 31, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jul 28, 2020

Closes #881

Notes:

@angela97lin angela97lin self-assigned this Jul 28, 2020
@angela97lin angela97lin changed the title Adds TypedImputer to address allowing different imputation strategies for numerical and categorical dtypes Adds TypedImputer to allow different imputation strategies for numerical and categorical dtypes Jul 28, 2020
@angela97lin angela97lin added this to the July 2020 milestone Jul 29, 2020
@codecov
Copy link

codecov bot commented Jul 30, 2020

Codecov Report

Merging #991 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #991    +/-   ##
========================================
  Coverage   99.86%   99.86%            
========================================
  Files         179      181     +2     
  Lines        9436     9584   +148     
========================================
+ Hits         9423     9571   +148     
  Misses         13       13            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/label_leakage_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/outliers_data_check.py 100.00% <100.00%> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
...lines/components/transformers/imputers/__init__.py 100.00% <100.00%> (ø)
...elines/components/transformers/imputers/imputer.py 100.00% <100.00%> (ø)
...components/transformers/imputers/simple_imputer.py 100.00% <100.00%> (ø)
evalml/pipelines/regression_pipeline.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 030896b...bdc623c. Read the comment docs.

@angela97lin angela97lin mentioned this pull request Jul 30, 2020
3 tasks
@angela97lin angela97lin requested a review from kmax12 Jul 30, 2020
@angela97lin angela97lin changed the title Adds TypedImputer to allow different imputation strategies for numerical and categorical dtypes Adds Imputer to allow different imputation strategies for numerical and categorical dtypes Jul 31, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented Jul 31, 2020

@kmax12 @dsherry @freddyaboulton Might refactor tests to try all imputation strategies in this PR or in a latter one but implementation is ready for review again and I wanted to try kicking off the perf testing so please re-review if you get a chance! :')

@angela97lin angela97lin requested review from kmax12 and dsherry Jul 31, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

@angela97lin thanks for hustling to get this updated! It looks great to me.

Approved, pending perf test results comparison to ensure that a) timing and accuracy aren't degraded by this change, and b) we don't see any new errors on the perf test datasets introduced by this change.

I also left some comments I think are blocking, mostly minor:

  • Let's resolve the discussion in fit about defining a list of allowed categorical types. I think we should do that rather than just filtering out the numerics, to avoid bugs.
  • Remove "constant" strategy from automl search hyperparameters
  • Small docstring update
  • Add another release notes entry to highlight the fact that this is a potential performance improvement on mixed-type data.

Your test coverage looks great. A couple things I thought of:

  • What happens if the provided data isn't empty, but also doesn't contain any numeric or categorical cols? I.e. all datetimes or something. Related to discussion in fit about datatype selection.
  • Checking that the default parameters match what we expect.

evalml/utils/gen_utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Show resolved Hide resolved
docs/source/release_notes.rst Show resolved Hide resolved
Copy link
Collaborator

@dsherry dsherry left a comment

@angela97lin well done on this!!

I did some poking around at the first batch perf test results earlier. Looked good to me at first glance. I'd say since this is a low-risk PR and we're more interested in defense than we are in performance gain here, feel free to merge this now so we can start the release moving, and post the perf test plots after that. That ok?

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@angela97lin
Copy link
Contributor Author

angela97lin commented Jul 31, 2020

@dsherry I just generated the plot!

image

image

@angela97lin angela97lin merged commit 8a92cf5 into main Jul 31, 2020
2 checks passed
@angela97lin angela97lin mentioned this pull request Jul 31, 2020
dsherry added a commit that referenced this pull request Jul 31, 2020
…erical and categorical dtypes (#991)"

This reverts commit 8a92cf5.
@angela97lin angela97lin deleted the 881_typed_transformer branch Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML can configure SimpleImputer to apply invalid imputation for categorical dtype
6 participants