BUG Handle Categorical columns with non-string types #20

stephen-hoover · 2018-01-04T18:12:57Z

We previously could not expand pd.Categorical columns which had non-string levels. A column with all integer levels would be marked as a numeric column, causing us to try to fill missing values with our sentinel on the column itself. This errored because the sentinel wasn't part of the existing levels. Mixed types would fail because the expansion code did sorting operations, and numbers have no defined order relative to strings.

This commit refactors the expansion code to cast columns to pd.Categorical types and use pd.get_dummies, still on a column-wise basis. Previous behavior is preserved, but now the DataFrameETL should handle all valid pd.Categorical columns in the input. This also allows some simplification in the code. Since we're handling columns as categoricals, we can use the same sentinel for all columns.

Closes #18 .

beckermr · 2018-01-04T18:29:53Z

We need to make sure to throw a really big DataFrame at this. pd.get_dummies scares me. @jmemich, didn't you have trouble with it before?

beckermr · 2018-01-04T18:34:39Z

Also, why do we need the sentinels if we use a categorical type in pandas? I think we can just keep it as np.NaN everywhere, right? The sentinels were something I invented to be able to use sklearn one-hot-encoding since it does not handle NaNs. We do not need this if we use pd.get_dummies.

beckermr · 2018-01-04T18:44:34Z

One more idea. Can we just cast all categorical columns to object types internally when we process them and leave the rest of the code as is?

Edit: This might not solve the sorting issue.

We previously could not expand `pd.Categorical` columns which had non-string levels. A column with all integer levels would be marked as a numeric column, causing us to try to fill missing values with our sentinel on the column itself. This errored because the sentinel wasn't part of the existing levels. Mixed types would fail because the expansion code did sorting operations, and numbers have no defined order relative to strings. This commit refactors the expansion code to cast columns to `pd.Categorical` types and use `pd.get_dummies`, still on a column-wise basis. Previous behavior is preserved, but now the `DataFrameETL` should handle all valid `pd.Categorical` columns in the input. This also allows some simplification in the code. Since we're handling columns as categoricals, we can use the same sentinel for all columns.

stephen-hoover · 2018-01-04T18:58:23Z

IIRC, the resource problem with get_dummies came when we used it on the entire DataFrame at once.

I kept the sentinel because the current behavior of the DataFrameETL is to completely ignore new categories found at transform time. The simplest way to make sure that we have the same levels when transforming as for fitting is to set the levels on the columns to transform to be the same as the levels seen when fitting. In this case, pandas treats new categories as missing data. Using the sentinel avoids this confusion. There would be ways to handle this without a sentinel, but I think it would involve extra mucking with the categories and removing extra columns after expansion. I think the code would end up being more complicated that way.

My first solution involved casting columns to object types, but you're right, that still runs into errors when trying to sort.

jmemich · 2018-01-04T19:00:08Z

FWIW, I was looking the pandas source code and calling pd.get_dummies one column at a time avoids this copy, which may have rendered one of my earlier implementations unworkable.

beckermr · 2018-01-04T19:03:31Z

Thanks @stephen-hoover and @jmemich! I am going to throw my standard test problem at this before we merge to make sure we don't have any hidden performance degradations. I am more convinced this is a good solution now.

jmemich · 2018-01-04T19:06:35Z

my standard test problem

@beckermr if you have a script that's in decent enough shape I think it would make sense to add it to a benchmarks/ directory. This kind of information is a good thing to tie to the codebase, in the codebase.

stephen-hoover · 2018-01-04T19:25:15Z

civismlext/test/test_preprocessing.py

    assert arr.shape == expected_array.shape
    assert_almost_equal(arr, expected_array)
+
+
+def test_categorical_looks_like_int():


This test fails before the changes in this PR.

Should we include a mixed-type categorical here as well?

You're right, we should. Added!

beckermr · 2018-01-04T19:48:11Z

Good point @jmemich. I can try and throw something together.

elsander

I'm a big fan of this change, it cleans up the code quite a bit. I do want to wait to merge this until we've tested it against a large dataset.

elsander · 2018-01-04T20:00:57Z

civismlext/preprocessing.py

@@ -176,33 +86,14 @@ def _flag_nulls(self, X, cols_to_drop):
                                 '"raise", or "warn"]')
        return cols_to_drop + null_cols

-    def _flag_numeric(self, levels):


So excited to see so many LOC go away!

elsander · 2018-01-04T20:28:30Z

civismlext/test/test_preprocessing.py

    assert arr.shape == expected_array.shape
    assert_almost_equal(arr, expected_array)
+
+
+def test_categorical_looks_like_int():


Should we include a mixed-type categorical here as well?

elsander · 2018-01-04T20:33:03Z

civismlext/test/test_preprocessing.py

+
+def test_categorical_looks_like_int():
+    # Verify that the right thing happens if the fit DataFrame has a
+    # categorical with


It looks like this comment got cut off.

beckermr · 2018-01-05T18:07:09Z

Alright. I did some testing and am super stoked. This appears to be 3x faster with no memory penalty. LGTM!

elsander

Awesome! LGTM from me as well!

stephen-hoover · 2018-01-05T19:20:45Z

I did some testing of my own. Transforming 4 columns of an input table with ~350 columns, (~170 total categories present at the larger two sizes) I find memory use of:

N rows	v0.1.5	This PR
10,000	255 MiB	255 MiB
100,000	850 MiB	850 MiB
1,000,000	7500 MiB	7200 MIB

I'm measuring memory use in Civis Platform scripts. Note that these numbers vary a bit from run to run. I find no evidence for increased memory usage, consistent with Matt's tests.

When I timed the 1M row expansion, the transform step took ~11.9 s on average (over 100 tries) with v0.1.5, and ~6.7 s on average with the code from this branch.

stephen-hoover added the bug label Jan 4, 2018

stephen-hoover force-pushed the expand-cols-with-get-dummies branch from 4dc684f to a1c4777 Compare January 4, 2018 18:16

MAINT Ignore PyCharm files in repo

bb06242

stephen-hoover force-pushed the expand-cols-with-get-dummies branch from a1c4777 to 6c9b0bb Compare January 4, 2018 18:17

stephen-hoover force-pushed the expand-cols-with-get-dummies branch from 6c9b0bb to 7f12611 Compare January 4, 2018 18:51

stephen-hoover requested a review from elsander January 4, 2018 19:11

stephen-hoover assigned elsander Jan 4, 2018

stephen-hoover commented Jan 4, 2018

View reviewed changes

stephen-hoover added this to the v0.2.0 milestone Jan 4, 2018

elsander suggested changes Jan 4, 2018

View reviewed changes

TST Add test for category with mixed-type levels

a2989a2

stephen-hoover force-pushed the expand-cols-with-get-dummies branch from 8f9ab62 to a2989a2 Compare January 4, 2018 22:56

beckermr approved these changes Jan 5, 2018

View reviewed changes

elsander approved these changes Jan 5, 2018

View reviewed changes

stephen-hoover merged commit 1b1b8a4 into civisanalytics:master Jan 5, 2018

stephen-hoover deleted the expand-cols-with-get-dummies branch January 5, 2018 19:21

stephen-hoover mentioned this pull request Jan 8, 2018

DOC Add CHANGELOG for DataFrameETL fix #21

Merged

stephen-hoover modified the milestones: v0.2.0, v0.1.6 Mar 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG Handle Categorical columns with non-string types #20

BUG Handle Categorical columns with non-string types #20

stephen-hoover commented Jan 4, 2018

beckermr commented Jan 4, 2018

beckermr commented Jan 4, 2018

beckermr commented Jan 4, 2018 •

edited

Loading

stephen-hoover commented Jan 4, 2018

jmemich commented Jan 4, 2018

beckermr commented Jan 4, 2018

jmemich commented Jan 4, 2018

stephen-hoover Jan 4, 2018

elsander Jan 4, 2018

beckermr Jan 4, 2018

stephen-hoover Jan 4, 2018

beckermr commented Jan 4, 2018

elsander left a comment

elsander Jan 4, 2018

elsander Jan 4, 2018

elsander Jan 4, 2018

stephen-hoover Jan 4, 2018

beckermr commented Jan 5, 2018

elsander left a comment

stephen-hoover commented Jan 5, 2018

BUG Handle Categorical columns with non-string types #20

BUG Handle Categorical columns with non-string types #20

Conversation

stephen-hoover commented Jan 4, 2018

beckermr commented Jan 4, 2018

beckermr commented Jan 4, 2018

beckermr commented Jan 4, 2018 • edited Loading

stephen-hoover commented Jan 4, 2018

jmemich commented Jan 4, 2018

beckermr commented Jan 4, 2018

jmemich commented Jan 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckermr commented Jan 4, 2018

elsander left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckermr commented Jan 5, 2018

elsander left a comment

Choose a reason for hiding this comment

stephen-hoover commented Jan 5, 2018

beckermr commented Jan 4, 2018 •

edited

Loading