Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG Handle Categorical columns with non-string types #20

Conversation

stephen-hoover
Copy link
Contributor

We previously could not expand pd.Categorical columns which had non-string levels. A column with all integer levels would be marked as a numeric column, causing us to try to fill missing values with our sentinel on the column itself. This errored because the sentinel wasn't part of the existing levels. Mixed types would fail because the expansion code did sorting operations, and numbers have no defined order relative to strings.

This commit refactors the expansion code to cast columns to pd.Categorical types and use pd.get_dummies, still on a column-wise basis. Previous behavior is preserved, but now the DataFrameETL should handle all valid pd.Categorical columns in the input. This also allows some simplification in the code. Since we're handling columns as categoricals, we can use the same sentinel for all columns.

Closes #18 .

@beckermr
Copy link
Contributor

beckermr commented Jan 4, 2018

We need to make sure to throw a really big DataFrame at this. pd.get_dummies scares me. @jmemich, didn't you have trouble with it before?

@beckermr
Copy link
Contributor

beckermr commented Jan 4, 2018

Also, why do we need the sentinels if we use a categorical type in pandas? I think we can just keep it as np.NaN everywhere, right? The sentinels were something I invented to be able to use sklearn one-hot-encoding since it does not handle NaNs. We do not need this if we use pd.get_dummies.

@beckermr
Copy link
Contributor

beckermr commented Jan 4, 2018

One more idea. Can we just cast all categorical columns to object types internally when we process them and leave the rest of the code as is?

Edit: This might not solve the sorting issue.

We previously could not expand `pd.Categorical` columns which had non-string levels. A column with all integer levels would be marked as a numeric column, causing us to try to fill missing values with our sentinel on the column itself. This errored because the sentinel wasn't part of the existing levels. Mixed types would fail because the expansion code did sorting operations, and numbers have no defined order relative to strings.

This commit refactors the expansion code to cast columns to `pd.Categorical` types and use `pd.get_dummies`, still on a column-wise basis. Previous behavior is preserved, but now the `DataFrameETL` should handle all valid `pd.Categorical` columns in the input. This also allows some simplification in the code. Since we're handling columns as categoricals, we can use the same sentinel for all columns.
@stephen-hoover
Copy link
Contributor Author

IIRC, the resource problem with get_dummies came when we used it on the entire DataFrame at once.

I kept the sentinel because the current behavior of the DataFrameETL is to completely ignore new categories found at transform time. The simplest way to make sure that we have the same levels when transforming as for fitting is to set the levels on the columns to transform to be the same as the levels seen when fitting. In this case, pandas treats new categories as missing data. Using the sentinel avoids this confusion. There would be ways to handle this without a sentinel, but I think it would involve extra mucking with the categories and removing extra columns after expansion. I think the code would end up being more complicated that way.

My first solution involved casting columns to object types, but you're right, that still runs into errors when trying to sort.

@jmemich
Copy link
Contributor

jmemich commented Jan 4, 2018

FWIW, I was looking the pandas source code and calling pd.get_dummies one column at a time avoids this copy, which may have rendered one of my earlier implementations unworkable.

@beckermr
Copy link
Contributor

beckermr commented Jan 4, 2018

Thanks @stephen-hoover and @jmemich! I am going to throw my standard test problem at this before we merge to make sure we don't have any hidden performance degradations. I am more convinced this is a good solution now.

@jmemich
Copy link
Contributor

jmemich commented Jan 4, 2018

my standard test problem

@beckermr if you have a script that's in decent enough shape I think it would make sense to add it to a benchmarks/ directory. This kind of information is a good thing to tie to the codebase, in the codebase.

assert arr.shape == expected_array.shape
assert_almost_equal(arr, expected_array)


def test_categorical_looks_like_int():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails before the changes in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a mixed-type categorical here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should. Added!

@beckermr
Copy link
Contributor

beckermr commented Jan 4, 2018

Good point @jmemich. I can try and throw something together.

@stephen-hoover stephen-hoover added this to the v0.2.0 milestone Jan 4, 2018
Copy link
Contributor

@elsander elsander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of this change, it cleans up the code quite a bit. I do want to wait to merge this until we've tested it against a large dataset.

@@ -176,33 +86,14 @@ def _flag_nulls(self, X, cols_to_drop):
'"raise", or "warn"]')
return cols_to_drop + null_cols

def _flag_numeric(self, levels):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited to see so many LOC go away!

assert arr.shape == expected_array.shape
assert_almost_equal(arr, expected_array)


def test_categorical_looks_like_int():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a mixed-type categorical here as well?


def test_categorical_looks_like_int():
# Verify that the right thing happens if the fit DataFrame has a
# categorical with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this comment got cut off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@beckermr
Copy link
Contributor

beckermr commented Jan 5, 2018

Alright. I did some testing and am super stoked. This appears to be 3x faster with no memory penalty. LGTM!

Copy link
Contributor

@elsander elsander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM from me as well!

@stephen-hoover
Copy link
Contributor Author

I did some testing of my own. Transforming 4 columns of an input table with ~350 columns, (~170 total categories present at the larger two sizes) I find memory use of:

N rows v0.1.5 This PR
10,000 255 MiB 255 MiB
100,000 850 MiB 850 MiB
1,000,000 7500 MiB 7200 MIB

I'm measuring memory use in Civis Platform scripts. Note that these numbers vary a bit from run to run. I find no evidence for increased memory usage, consistent with Matt's tests.

When I timed the 1M row expansion, the transform step took ~11.9 s on average (over 100 tries) with v0.1.5, and ~6.7 s on average with the code from this branch.

@stephen-hoover stephen-hoover merged commit 1b1b8a4 into civisanalytics:master Jan 5, 2018
@stephen-hoover stephen-hoover deleted the expand-cols-with-get-dummies branch January 5, 2018 19:21
@stephen-hoover stephen-hoover modified the milestones: v0.2.0, v0.1.6 Mar 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants