-
Notifications
You must be signed in to change notification settings - Fork 91
Add ability to only impute specific columns #3123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3123 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 317 317
Lines 30685 30711 +26
=======================================
+ Hits 30581 30607 +26
Misses 104 104
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, such a simple yet elegant change! Just left a few small questions.
@@ -61,7 +66,9 @@ def fit(self, X, y=None): | |||
""" | |||
X = infer_feature_types(X) | |||
self.imputers = dict() | |||
for column in X.columns: | |||
|
|||
columns_to_impute = X.columns if self.impute_all else self.impute_strategies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work correctly even though X.columns
is a list and self.impute_strategies
is a dict? Why not use self.impute_strategies.keys()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I think here it works because we just use it to iterate, and iterating over a dictionary will use its keys! But happy to update this to be more clear / in case we update this logic :)
If False, only columns specified as keys in the `impute_strategies` dictionary are imputed. If False and `impute_strategies` is None, | ||
no columns will be imputed. Defaults to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth it to warn the user in the case that the imputer does not impute anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a warning during fit time for this case :)
@@ -18,6 +18,10 @@ class PerColumnImputer(Transformer): | |||
default_impute_strategy (str): Impute strategy to fall back on when none is provided for a certain column. | |||
Valid values include "mean", "median", "most_frequent", "constant" for numerical data, | |||
and "most_frequent", "constant" for object data types. Defaults to "most_frequent". | |||
impute_all (bool): Whether or not to impute all columns or just the columns that are specified in `impute_strategies`. If True, | |||
all columns will be imputed either using the strategy specified in `impute_strategies` or using the `default_impute_strategy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by "If True, all columns will be imputed either using the strategy specified in impute_strategies
". Aren't the impute_strategies
defined on a per-column basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Yup, impute_strategies
is a dictionary that looks like { "col_name": {"impute_strategy": "mean", "fill_value": 0}, "col_name2": {"impute_strategy": "mean", "fill_value": 0} ... }
, but if they're not specified in the dictionary then we use the default_impute_strategy
for that column. Does that make sense / do you think this could be clarified more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the wording a bit!
Closes #3039