ENH add option for dummying all columns with NA #44

elsander · 2019-02-05T21:34:45Z

This PR adds new options to DataFrameETL through the dummy_na parameter. The old behavior was (with default True):

dummy_na=True: create a NaN column for all categorically expanded columns
dummy_na=False: do not create any NaN indicator columns

Users may want NaN indicator columns for any columns with missing data, not just categorical columns. The new behavior is (with default 'all'):

dummy_na=False and dummy_na=None: do not create any NaN indicator columns
dummy_na='expanded': matches the True behavior in the previous version
dummy_na='all': create a NaN column for all columns with missing data

Note that some columns which get a NaN indicator when dummy_na='expanded' may not get an indicator when dummy_na='all' (that is, columns which are expanded, but don't have missing data). I'm concerned this may be confusing for users, but it seems more confusing to have an option which creates indicators for all expanded columns, but only for unexpanded columns with missing data. Creating an indicator for all columns regardless of missing data seems quite memory inefficient, and I wanted to keep an option that preserved backwards compatibility. I'm open to suggestions if this seems too confusing for users.

…errors every time

.travis.yml

stephen-hoover

Overall looks good. Methodologically, it doesn't seem helpful to add an indicator column where there's no nulls, does it? A column of pure 0s doesn't add any information. If there's a missing value at prediction time in a column where there were no missing values at training time, an indicator column would probably just add noise.

stephen-hoover · 2019-02-12T20:01:44Z

civismlext/preprocessing.py

+        - 'all': add indicator columns for all columns with missing values
+          in fit data
+        - 'expanded': add indicator columns for all categorically expanded
+          columns (matches `True` behavior from version 1)


Suggested change

columns (matches `True` behavior from version 1)

columns (matches `True` behavior from version 0.1)

stephen-hoover · 2019-02-12T20:13:59Z

civismlext/preprocessing.py

@@ -273,6 +312,9 @@ def fit(self, X, y=None):
            self._check_sentinels(X)
            self.levels_ = self._create_levels(X)

+        # optionally flag unexpanded columns with nans
+        self.unexpanded_nans = self._flag_unexpanded_nans(X)


Since this is set during the fit, I think should it be self.unexpanded_nans_. (Or self._unexpanded_nans if you want to keep it as internal-only.)

stephen-hoover · 2019-02-12T20:21:41Z

civismlext/test/test_preprocessing.py

+
+
+def test_dummy_na_bad_value(data_raw):
+    with pytest.raises(ValueError):


It's worth checking that the ValueError is the expected one. Otherwise this might be hiding an unrelated ValueError.

stephen-hoover · 2019-02-12T20:29:03Z

civismlext/preprocessing.py

@@ -157,7 +183,7 @@ def _create_col_names(self, X):
        for col in unexpanded_cnames:
            if col in self._cols_to_expand:
                col_levels = self.levels_[col]
-                if self.dummy_na:
+                if self._dummy_na in ['expanded', 'all']:
                    # avoid exposing the sentinel to the user by replacing
                    # it with 'NaN'. If 'NaN' is already a level, use the
                    # sentinel to prevent column name duplicates.


Maybe it's worth expanding this comment to indicate that if _dummy_na is "all", then the sentinel will only be present if there's actually a missing value. I missed / forgot that the first time I read through the code.

elsander · 2019-02-18T20:56:09Z

I agree that it's not helpful to add indicator columns if there's no nulls, at least from a methodological perspective. I suppose it could be helpful so that the user can know in advance what the output columns will be without having to check for nulls themselves? I'm not sure if this was an intentional behavior or a spandrel.

What do you think of the choice to have "expanded" match this (arguably suboptimal) behavior from 0.1? I wanted to preserve some semblance of backwards compatibility, but I could be persuaded to have both options only create columns for features with nulls.

stephen-hoover · 2019-02-18T23:30:32Z

I don't remember why we decided to always create the "is null" column. I think it was about handling null values at prediction time when there were no null values at training time, but that doesn't seem helpful.

I think it would be too confusing to change the behavior of the "expanded" option -- I'd be in favor of keeping the old behavior. It doesn't seem very harmful, just not useful. I intended my comment about indicator columns with no nulls to be an agreement with your decision not to create an indicator for all columns.

stephen-hoover · 2019-02-18T23:33:12Z

Is there a test case for "no nulls at training, but nulls at transform time" with the new dummy_na='all'? Is it worth adding one?

elsander · 2019-02-19T18:58:06Z

There isn't currently a test case for that, but I can add one.

elsander · 2019-02-25T17:53:25Z

No rush on this, but wanted to flag that I added the test case you suggested.

stephen-hoover

Thank you for the bump. I'd missed that this was ready. LGTM!

Liz Sander added 6 commits February 5, 2019 15:27

ENH add option for dummying all columns with NA

cd9990e

STY flake8

d3747eb

TST skip deprecationwarning for 2.7; try to resolve pandas version issue

ad8ea17

TST take 2 on travis 3.6 failure

56e64e4

STY one more pep8 issue

8c3dca0

TST I remember the days when I made PRs and didn't get random travis …

8137541

…errors every time

shelbrudy reviewed Feb 9, 2019

View reviewed changes

.travis.yml Show resolved Hide resolved

Liz Sander added 2 commits February 11, 2019 09:46

MAINT python 3.6 fix

e677e54

MAINT trigger travis build

6a09432

elsander requested a review from stephen-hoover February 11, 2019 19:23

elsander assigned stephen-hoover Feb 11, 2019

stephen-hoover suggested changes Feb 12, 2019

View reviewed changes

stephen-hoover added the enhancement label Feb 12, 2019

stephen-hoover added this to the v0.2.0 milestone Feb 12, 2019

Liz Sander added 2 commits February 18, 2019 15:57

ENH code review

0928c07

STY pep8

2ce9c39

TST add test on nulls after transform

3f0b2f1

stephen-hoover approved these changes Feb 25, 2019

View reviewed changes

stephen-hoover assigned elsander and unassigned stephen-hoover Feb 25, 2019

elsander merged commit a88dae1 into civisanalytics:master Feb 25, 2019

elsander deleted the is_na_all_features branch February 25, 2019 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add option for dummying all columns with NA #44

ENH add option for dummying all columns with NA #44

elsander commented Feb 5, 2019

stephen-hoover left a comment

stephen-hoover Feb 12, 2019

stephen-hoover Feb 12, 2019

stephen-hoover Feb 12, 2019

stephen-hoover Feb 12, 2019

elsander commented Feb 18, 2019

stephen-hoover commented Feb 18, 2019

stephen-hoover commented Feb 18, 2019

elsander commented Feb 19, 2019

elsander commented Feb 25, 2019

stephen-hoover left a comment

	columns (matches `True` behavior from version 1)
	columns (matches `True` behavior from version 0.1)



		def test_dummy_na_bad_value(data_raw):
		with pytest.raises(ValueError):

ENH add option for dummying all columns with NA #44

ENH add option for dummying all columns with NA #44

Conversation

elsander commented Feb 5, 2019

stephen-hoover left a comment

Choose a reason for hiding this comment

stephen-hoover Feb 12, 2019

Choose a reason for hiding this comment

stephen-hoover Feb 12, 2019

Choose a reason for hiding this comment

stephen-hoover Feb 12, 2019

Choose a reason for hiding this comment

stephen-hoover Feb 12, 2019

Choose a reason for hiding this comment

elsander commented Feb 18, 2019

stephen-hoover commented Feb 18, 2019

stephen-hoover commented Feb 18, 2019

elsander commented Feb 19, 2019

elsander commented Feb 25, 2019

stephen-hoover left a comment

Choose a reason for hiding this comment