Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to only impute specific columns #3123

Merged
merged 8 commits into from
Dec 8, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #3039

@angela97lin angela97lin self-assigned this Dec 3, 2021
@codecov
Copy link

codecov bot commented Dec 3, 2021

Codecov Report

Merging #3123 (d83d1af) into main (c44f74b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3123     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        317     317             
  Lines      30685   30711     +26     
=======================================
+ Hits       30581   30607     +26     
  Misses       104     104             
Impacted Files Coverage Δ
evalml/tests/component_tests/test_components.py 99.0% <ø> (ø)
...onents/transformers/imputers/per_column_imputer.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_per_column_imputer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c44f74b...d83d1af. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, such a simple yet elegant change! Just left a few small questions.

@@ -61,7 +66,9 @@ def fit(self, X, y=None):
"""
X = infer_feature_types(X)
self.imputers = dict()
for column in X.columns:

columns_to_impute = X.columns if self.impute_all else self.impute_strategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work correctly even though X.columns is a list and self.impute_strategies is a dict? Why not use self.impute_strategies.keys()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I think here it works because we just use it to iterate, and iterating over a dictionary will use its keys! But happy to update this to be more clear / in case we update this logic :)

Comment on lines +23 to +24
If False, only columns specified as keys in the `impute_strategies` dictionary are imputed. If False and `impute_strategies` is None,
no columns will be imputed. Defaults to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to warn the user in the case that the imputer does not impute anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning during fit time for this case :)

@@ -18,6 +18,10 @@ class PerColumnImputer(Transformer):
default_impute_strategy (str): Impute strategy to fall back on when none is provided for a certain column.
Valid values include "mean", "median", "most_frequent", "constant" for numerical data,
and "most_frequent", "constant" for object data types. Defaults to "most_frequent".
impute_all (bool): Whether or not to impute all columns or just the columns that are specified in `impute_strategies`. If True,
all columns will be imputed either using the strategy specified in `impute_strategies` or using the `default_impute_strategy`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by "If True, all columns will be imputed either using the strategy specified in impute_strategies". Aren't the impute_strategies defined on a per-column basis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Yup, impute_strategies is a dictionary that looks like { "col_name": {"impute_strategy": "mean", "fill_value": 0}, "col_name2": {"impute_strategy": "mean", "fill_value": 0} ... }, but if they're not specified in the dictionary then we use the default_impute_strategy for that column. Does that make sense / do you think this could be clarified more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the wording a bit!

@angela97lin angela97lin merged commit 3aa1b36 into main Dec 8, 2021
@angela97lin angela97lin deleted the 3039_impute_specific_only branch December 8, 2021 00:29
@chukarsten chukarsten mentioned this pull request Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to only impute specific columns
5 participants