Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add option for dummying all columns with NA #44

Merged
merged 11 commits into from Feb 25, 2019

Conversation

elsander
Copy link
Contributor

@elsander elsander commented Feb 5, 2019

This PR adds new options to DataFrameETL through the dummy_na parameter. The old behavior was (with default True):

  • dummy_na=True: create a NaN column for all categorically expanded columns
  • dummy_na=False: do not create any NaN indicator columns

Users may want NaN indicator columns for any columns with missing data, not just categorical columns. The new behavior is (with default 'all'):

  • dummy_na=False and dummy_na=None: do not create any NaN indicator columns
  • dummy_na='expanded': matches the True behavior in the previous version
  • dummy_na='all': create a NaN column for all columns with missing data

Note that some columns which get a NaN indicator when dummy_na='expanded' may not get an indicator when dummy_na='all' (that is, columns which are expanded, but don't have missing data). I'm concerned this may be confusing for users, but it seems more confusing to have an option which creates indicators for all expanded columns, but only for unexpanded columns with missing data. Creating an indicator for all columns regardless of missing data seems quite memory inefficient, and I wanted to keep an option that preserved backwards compatibility. I'm open to suggestions if this seems too confusing for users.

.travis.yml Show resolved Hide resolved
Copy link
Contributor

@stephen-hoover stephen-hoover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Methodologically, it doesn't seem helpful to add an indicator column where there's no nulls, does it? A column of pure 0s doesn't add any information. If there's a missing value at prediction time in a column where there were no missing values at training time, an indicator column would probably just add noise.

- 'all': add indicator columns for all columns with missing values
in fit data
- 'expanded': add indicator columns for all categorically expanded
columns (matches `True` behavior from version 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
columns (matches `True` behavior from version 1)
columns (matches `True` behavior from version 0.1)

@@ -273,6 +312,9 @@ def fit(self, X, y=None):
self._check_sentinels(X)
self.levels_ = self._create_levels(X)

# optionally flag unexpanded columns with nans
self.unexpanded_nans = self._flag_unexpanded_nans(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is set during the fit, I think should it be self.unexpanded_nans_. (Or self._unexpanded_nans if you want to keep it as internal-only.)



def test_dummy_na_bad_value(data_raw):
with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth checking that the ValueError is the expected one. Otherwise this might be hiding an unrelated ValueError.

@@ -157,7 +183,7 @@ def _create_col_names(self, X):
for col in unexpanded_cnames:
if col in self._cols_to_expand:
col_levels = self.levels_[col]
if self.dummy_na:
if self._dummy_na in ['expanded', 'all']:
# avoid exposing the sentinel to the user by replacing
# it with 'NaN'. If 'NaN' is already a level, use the
# sentinel to prevent column name duplicates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth expanding this comment to indicate that if _dummy_na is "all", then the sentinel will only be present if there's actually a missing value. I missed / forgot that the first time I read through the code.

@stephen-hoover stephen-hoover added this to the v0.2.0 milestone Feb 12, 2019
@elsander
Copy link
Contributor Author

I agree that it's not helpful to add indicator columns if there's no nulls, at least from a methodological perspective. I suppose it could be helpful so that the user can know in advance what the output columns will be without having to check for nulls themselves? I'm not sure if this was an intentional behavior or a spandrel.

What do you think of the choice to have "expanded" match this (arguably suboptimal) behavior from 0.1? I wanted to preserve some semblance of backwards compatibility, but I could be persuaded to have both options only create columns for features with nulls.

@stephen-hoover
Copy link
Contributor

I don't remember why we decided to always create the "is null" column. I think it was about handling null values at prediction time when there were no null values at training time, but that doesn't seem helpful.

I think it would be too confusing to change the behavior of the "expanded" option -- I'd be in favor of keeping the old behavior. It doesn't seem very harmful, just not useful. I intended my comment about indicator columns with no nulls to be an agreement with your decision not to create an indicator for all columns.

@stephen-hoover
Copy link
Contributor

Is there a test case for "no nulls at training, but nulls at transform time" with the new dummy_na='all'? Is it worth adding one?

@elsander
Copy link
Contributor Author

There isn't currently a test case for that, but I can add one.

@elsander
Copy link
Contributor Author

No rush on this, but wanted to flag that I added the test case you suggested.

Copy link
Contributor

@stephen-hoover stephen-hoover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the bump. I'd missed that this was ready. LGTM!

@elsander elsander merged commit a88dae1 into civisanalytics:master Feb 25, 2019
@elsander elsander deleted the is_na_all_features branch February 25, 2019 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants