Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fill_value parameter to SimpleImputer #509

Merged
merged 13 commits into from Mar 25, 2020
Merged

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 18, 2020

Closes #336

@angela97lin angela97lin self-assigned this Mar 18, 2020
@dsherry
Copy link
Collaborator

dsherry commented Mar 18, 2020

@angela97lin cool! Is this ready for review, and if so can you pls move it to review/qa?

@angela97lin
Copy link
Contributor Author

angela97lin commented Mar 18, 2020

@dsherry Yup, almost ready--just need to fix up something really quickly. Will move and request review when I'm ready :)

@angela97lin angela97lin changed the title Add fill_value parameter to SimpleImputer [WIP] Add fill_value parameter to SimpleImputer Mar 18, 2020
@codecov
Copy link

codecov bot commented Mar 18, 2020

Codecov Report

Merging #509 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage   98.58%   98.60%   +0.02%     
==========================================
  Files         111      112       +1     
  Lines        3744     3805      +61     
==========================================
+ Hits         3691     3752      +61     
  Misses         53       53              
Impacted Files Coverage Δ
...sts/pipeline_tests/test_catboost_classification.py 100.00% <ø> (ø)
...l/tests/pipeline_tests/test_catboost_regression.py 100.00% <ø> (ø)
...lml/tests/pipeline_tests/test_linear_regression.py 100.00% <ø> (ø)
...l/tests/pipeline_tests/test_logistic_regression.py 100.00% <ø> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <ø> (ø)
evalml/tests/pipeline_tests/test_rf.py 100.00% <ø> (ø)
evalml/tests/pipeline_tests/test_rf_regression.py 100.00% <ø> (ø)
evalml/tests/pipeline_tests/test_xgboost.py 100.00% <ø> (ø)
...components/transformers/imputers/simple_imputer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99a8ffd...e2cd3a1. Read the comment docs.

@angela97lin angela97lin requested a review from dsherry Mar 18, 2020
@angela97lin angela97lin changed the title [WIP] Add fill_value parameter to SimpleImputer Add fill_value parameter to SimpleImputer Mar 18, 2020

from evalml.pipelines.components import SimpleImputer


Copy link
Collaborator

@dsherry dsherry Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for additional test coverage:

  • Add a test for mean
  • Add a test for most_frequent
  • Have a column which is all nans, and test with each impute strategy
  • Have a column which is all strings, test with each impute strategy

Not required for this PR. If you agree this coverage would be helpful, and if you don't end up doing it here, could you please file an issue for this and stick it in the dev backlog?

Copy link
Contributor Author

@angela97lin angela97lin Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! Added the first two.

For a column with all nan, I filed #514. Sci-kit learn drops the column during transform if the column is all nan, but we currently don't support that.

For a column with all strings... I added a test, not really sure if using a numerical fill value for a column of all strings should hypothetically work or make sense but it does. 🤔

Copy link
Collaborator

@dsherry dsherry Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

Yeah, definitely, we don't want people, or automl, using "mean"/"median" for string types. I hope we can tackle datatype validation and stuff in the data checks. It's good to know that it throws an intelligible error though.

Copy link
Contributor Author

@angela97lin angela97lin Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a comment to #49 for now :)

@@ -16,7 +16,8 @@ def test_rf_init(X_y):
objective = PrecisionMicro()
parameters = {
'Simple Imputer': {
'impute_strategy': 'mean'
'impute_strategy': 'mean',
'fill_value': 2
Copy link
Collaborator

@dsherry dsherry Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this isn't None for these tests? If not it would be helpful to keep it as None to keep the tests simple

Copy link
Contributor Author

@angela97lin angela97lin Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasoning was to test that parameters were actually set properly, rather than just being set to the default!

Copy link
Collaborator

@dsherry dsherry Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. So in my opinion, it's nice to have that sort of check done in test_pipeline.py, because checking that non-default parameter settings work is something we want to work for all pipelines. I think we have that coverage, which is great. So this is fine too, I only pushed back because if I were a new developer opening this test, I might be confused by why that param was being set here.

Copy link
Contributor Author

@angela97lin angela97lin Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense to me. Will update and remove for specific pipelines then!

Copy link
Collaborator

@dsherry dsherry left a comment

Looks great, left some suggestions, will approve next round!

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Looks good! Are we planning on adding this to pipelines hyperparameter ranges?

@angela97lin
Copy link
Contributor Author

angela97lin commented Mar 20, 2020

@jeremyliweishih I don't know if I'd see a lot of benefit to that (range could be super large, and fill value could also be non-numerical values)... what do ya think?

@jeremyliweishih
Copy link
Contributor

jeremyliweishih commented Mar 20, 2020

@jeremyliweishih I don't know if I'd see a lot of benefit to that (range could be super large, and fill value could also be non-numerical values)... what do ya think?

The only case I can think of being useful would be to impute using 0 though I don't have much experience doing so myself. I guess when data is normalized 0 might be a good impute value. If its too niche we can just leave it out!

@jeremyliweishih
Copy link
Contributor

jeremyliweishih commented Mar 20, 2020

Might be best to leave it out. After reading a bit more it seems that although zero imputation is the simplest it might bias linear models in unintended ways.

@angela97lin
Copy link
Contributor Author

angela97lin commented Mar 20, 2020

@jeremyliweishih Yeah, I feel like the idea of optimizing for a potentially random numerical value (or even 0) could have some more tricky implications, but it was still an interesting thought :D

Copy link
Collaborator

@dsherry dsherry left a comment

Left a couple more suggestions, but LGTM!

@kmax12 kmax12 removed their request for review Mar 24, 2020
@angela97lin angela97lin mentioned this pull request Mar 24, 2020
@angela97lin angela97lin requested a review from dsherry Mar 25, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented Mar 25, 2020

@dsherry I know you already reviewed and approved, but I updated docstring + tests based on your suggestion. LMK what you think! :D

Copy link
Collaborator

@dsherry dsherry left a comment

🚢 👍 😁

@angela97lin angela97lin merged commit 9c707d9 into master Mar 25, 2020
2 checks passed
@angela97lin angela97lin deleted the 336_constant_imputer branch Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants