-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add option for dummying all columns with NA #44
Conversation
…errors every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Methodologically, it doesn't seem helpful to add an indicator column where there's no nulls, does it? A column of pure 0s doesn't add any information. If there's a missing value at prediction time in a column where there were no missing values at training time, an indicator column would probably just add noise.
civismlext/preprocessing.py
Outdated
- 'all': add indicator columns for all columns with missing values | ||
in fit data | ||
- 'expanded': add indicator columns for all categorically expanded | ||
columns (matches `True` behavior from version 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns (matches `True` behavior from version 1) | |
columns (matches `True` behavior from version 0.1) |
civismlext/preprocessing.py
Outdated
@@ -273,6 +312,9 @@ def fit(self, X, y=None): | |||
self._check_sentinels(X) | |||
self.levels_ = self._create_levels(X) | |||
|
|||
# optionally flag unexpanded columns with nans | |||
self.unexpanded_nans = self._flag_unexpanded_nans(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is set during the fit
, I think should it be self.unexpanded_nans_
. (Or self._unexpanded_nans
if you want to keep it as internal-only.)
|
||
|
||
def test_dummy_na_bad_value(data_raw): | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth checking that the ValueError
is the expected one. Otherwise this might be hiding an unrelated ValueError
.
@@ -157,7 +183,7 @@ def _create_col_names(self, X): | |||
for col in unexpanded_cnames: | |||
if col in self._cols_to_expand: | |||
col_levels = self.levels_[col] | |||
if self.dummy_na: | |||
if self._dummy_na in ['expanded', 'all']: | |||
# avoid exposing the sentinel to the user by replacing | |||
# it with 'NaN'. If 'NaN' is already a level, use the | |||
# sentinel to prevent column name duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth expanding this comment to indicate that if _dummy_na
is "all", then the sentinel will only be present if there's actually a missing value. I missed / forgot that the first time I read through the code.
I agree that it's not helpful to add indicator columns if there's no nulls, at least from a methodological perspective. I suppose it could be helpful so that the user can know in advance what the output columns will be without having to check for nulls themselves? I'm not sure if this was an intentional behavior or a spandrel. What do you think of the choice to have "expanded" match this (arguably suboptimal) behavior from 0.1? I wanted to preserve some semblance of backwards compatibility, but I could be persuaded to have both options only create columns for features with nulls. |
I don't remember why we decided to always create the "is null" column. I think it was about handling null values at prediction time when there were no null values at training time, but that doesn't seem helpful. I think it would be too confusing to change the behavior of the "expanded" option -- I'd be in favor of keeping the old behavior. It doesn't seem very harmful, just not useful. I intended my comment about indicator columns with no nulls to be an agreement with your decision not to create an indicator for all columns. |
Is there a test case for "no nulls at training, but nulls at transform time" with the new |
There isn't currently a test case for that, but I can add one. |
No rush on this, but wanted to flag that I added the test case you suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the bump. I'd missed that this was ready. LGTM!
This PR adds new options to
DataFrameETL
through thedummy_na
parameter. The old behavior was (with defaultTrue
):dummy_na=True
: create a NaN column for all categorically expanded columnsdummy_na=False
: do not create any NaN indicator columnsUsers may want NaN indicator columns for any columns with missing data, not just categorical columns. The new behavior is (with default
'all'
):dummy_na=False
anddummy_na=None
: do not create any NaN indicator columnsdummy_na='expanded'
: matches theTrue
behavior in the previous versiondummy_na='all'
: create a NaN column for all columns with missing dataNote that some columns which get a NaN indicator when
dummy_na='expanded'
may not get an indicator whendummy_na='all'
(that is, columns which are expanded, but don't have missing data). I'm concerned this may be confusing for users, but it seems more confusing to have an option which creates indicators for all expanded columns, but only for unexpanded columns with missing data. Creating an indicator for all columns regardless of missing data seems quite memory inefficient, and I wanted to keep an option that preserved backwards compatibility. I'm open to suggestions if this seems too confusing for users.