Support non-uniform categoricals by jcrist · Pull Request #1877 · dask/dask

jcrist · 2017-01-03T07:03:38Z

This adds support for non-uniform categoricals (different categories in each partition) to dask.dataframe. Currently this is written to only support pandas 0.19.0. It could be made backwards compatible, but we may also want to discuss dropping support for < 0.19.0.

Fixes #1836.

TODO:

mrocklin · 2017-01-03T17:18:56Z

This looks cool. It would be nice to remove this roadblock when using categoricals with dask.dataframe.

cc @TomAugspurger @sinhrks

mrocklin · 2017-01-06T18:45:38Z

What is the status here? @jcrist are you waiting on feedback? Please let me know if this is blocking on me.

jcrist · 2017-01-06T18:47:43Z

No, I'm still working on this.

This adds support for non-consistent categoricals in each partition.

Both `get_dummies` and `pivot_table` error properly if categories are not fully known.

Throw a NotImplementedError if the categories are unknown. An alternative may be to compute immediately, or to return a lazy Index object.

Test common operations work properly

We do this because pandas frequently has inconsistent dtypes in operations with empty dataframes. We may want to move this into `methods.concat`, but at the same time that needs to be able to use empty frames to ensure partitions match the desired schema. This all is pretty mucky stuff, hopefully can find a cleaner way.

Reading directly into a categorical dtype from `read_csv` now returns an unknown categorical.

- `known` is whether the categories are known - `as_known` converts unknown -> known - `as_unknown` converts known -> unknown

- pin conda to fix conda bug? - skip csv test in pandas < 0.19.2

Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing. Related to this work in dask: dask/dask#1877. Author: Jim Crist <crist042@umn.edu> Closes #15143 from jcrist/categories_hash_consistently and squashes the following commits: f1aea13 [Jim Crist] Address comments 7878c55 [Jim Crist] Categoricals hash consistently

Due to changes to `concat`, python ints no longer convert to int32 on int32 systems when concatenated

jcrist · 2017-01-18T21:39:05Z

Ok, this should be ready for review. A quick walk through of what's added here:

This adds support for different partitions having different categorical mappings. This is accomplished through union_categoricals, a method added in pandas 0.19.0. As such, support for pandas < 0.19.0 has been dropped.

Changes to metadata:

Categorical columns/indices now have two states:

known categoricals have the categories known statically (on the _meta attribute)
unknown categoricals don't know the categories statically, and are indicated by the presence of dd.utils.UNKNOWN_CATEGORIES in the categories on the meta attribute. This is nice, because operations on the categorical in most cases propagate the presence of this value, so the known/unknown state carries through inference.

Changes to the categorical accessor

To query and convert between these states, following has been added to the categorical accessor:

.cat.known is a property that indicates if the categories are known
.cat.as_known() converts unknown to known (and is a no-op if already known). Note that this requires computation.
.cat.as_unknown() converts known to unknown. This requires no computation - it's just a change in metadata.

If the categories are known, then .cat.categories returns statically (no computation needed). If they are unknown, then an error is raised (rather than computing them)

Changes to concat

Concatenation of dask dataframes with different categoricals is now supported. For each column/index in a dataframe, the output dask column/index will have known categories only if all categories are known. Otherwise the categories will be unknown.

After a call to concat, each partition is guaranteed to match the metadata on the containing dask dataframe. This fixes a bug from before where in certain cases partitions could have different columns/datatypes.

Changes to `pivot_table` and `get_dummies`

These functions only work if the categories are known. If unknown, an error is raised

Changes to `read_csv`

For pandas > 0.19.2, dask now supports reading from csvs directly into categorical columns. These will be unknown categories

Changes to `merge`

No changes are made to merge. Pandas merge doesn't support merging categoricals, even if the categoricals are identical. As such, dask doesn't error, but the categorical datatype is dropped. When this is fixed in pandas, support for merge on differing categoricals can be added.

Changes to `hash_pandas_object`

Previously, this would hash categories based on codes only. We now hash based on the values the categoricals represent. This was patched in pandas, and will be in the next release. The fix is backported to dask for versions 0.19.0 to 0.19.2. The faster (and more stable) object hash will be used for pandas 0.19.2.

jcrist · 2017-01-18T21:41:36Z

And a quick demo using the airline dataset:

In [1]: import pandas as pd

In [2]: dtypes = pd.read_csv('1987.csv', dtype={'Origin': 'category', 'Dest': 'category'}).dtypes.to_dict()

In [3]: import dask.dataframe as dd

# The dtypes to `Origin` and `Dest` are `category`, so they're
# read directly in as categoricals
In [4]: ddf = dd.read_csv('198*.csv', dtype=dtypes)

In [5]: ddf
Out[5]: dd.DataFrame<from-de..., npartitions=18>

In [6]: ddf.Origin.dtype
Out[6]: category

In [7]: ddf.Origin.cat.known    # categories are unknown at this point
Out[7]: False

In [8]: worst_avg_delay = ddf.DepDelay.groupby(ddf.Origin).mean().nlargest(10)

In [9]: worst_avg_delay.index.dtype
Out[9]: category

In [10]: worst_avg_delay.index.cat.known
Out[10]: False

In [11]: worst_avg_delay.compute()
Out[11]:
Origin
PIR    85.000000
ABI    36.500000
EGE    32.363636
BFI    28.000000
ROP    24.700000
ACV    19.740223
SUN    19.000000
YAP    16.902482
RDD    15.724234
HDN    14.147350
Name: DepDelay, dtype: float64

# Convert unknown categoricals to known
In [12]: known_origin = ddf.Origin.cat.as_known()

In [13]: known_origin.cat.known
Out[13]: True

In [14]: known_origin.cat.categories
Out[14]:
Index(['ABE', 'ABQ', 'ACV', 'AGS', 'ALB', 'ALO', 'AMA', 'ANC', 'APF', 'ATL',
       ...
       'PIR', 'ACY', 'GST', 'DET', 'BFI', 'ROP', 'ABI', 'TVC', 'EGE', 'SUN'],
      dtype='object', length=246)

TomAugspurger · 2017-01-18T22:56:02Z

One question (would be nice to confirm in the docs): .cat.codes.compute() on a categorical should be "correct" in that all the category: code mapping should be consistent. It looks like that's the case here, but want to make sure that's not a fluke:

In [300]: s = dd.from_pandas(pd.Series(pd.Categorical(['c', 'b', 'b', 'a']), name='name'), npartitions=2).cat.as_unknown()

In [301]: s.cat.codes.compute()
Out[301]:
0    2
1    1
2    1
3    0
dtype: int8

jcrist · 2017-01-18T23:03:36Z

Ooh, that's a good point, I didn't think of that one. This is actually True only for categoricals that are known. For unknown categoricals, we'd need need a unifying step, which can't be done without a full dataset traversal. My preference is to error in this case, rather than doing something inefficient.

Thanks for coming up with all the things I hadn't thought of yet. This is a good review :).

Cleanups to categorical accessor: - Refactor internal design of accessor - Accessor key names include accessor name - Move categorical accessor to `categorical.py` - Categorical accessor works for indices Cleanups to categorize: - Categorize of categorical is idempotent - Categorize transforms unknown -> known categoricals if requested - Categorize works on index

- Categorical methods present on categorical index - Test methods work on categorical index - Methods present in `dir` of index if present - Access to `cat` and `str` fail if not proper dtype

Previously non-Series/Index output from accessor methods would wrap their return values in calls to `pd.Series`. This resulted in inconsistencies between pandas and dask. Since we now support arrays with unknown chunks, we can return the proper type. Also forward the `dt` properties for datetime indices through the accessor, instead of adding methods on directly. Also improved error messages for categorical failures.

jcrist · 2017-01-21T00:10:11Z

I believe all comments have been addressed.

One new change that people may want to look at: previously we'd wrap all output from the accessors in calls to pd.Index/pd.Series if the return type was a numpy array. This caused differences between the pandas and dask interfaces. Since we now support arrays with unknown shape, I switched these to return dask arrays. This was part of a larger refactor of how the accessor class is implemented, and how the methods get forwarded to the dd.Index class.

jcrist · 2017-01-21T00:26:39Z

Failure is due to groupby not supporting grouping by array. Looks like an easy fix (we already support conversion). It's late here, will fix tomorrow.

TomAugspurger · 2017-01-21T13:16:43Z

FYI, I won't have a chance to review this any more closely till next week, so I'm +1 to merge when you're ready. Great work! On Fri, Jan 20, 2017 at 6:26 PM -0600, "Jim Crist" <notifications@github.com> wrote: Failure is due to groupby not supporting grouping by array. Looks like an easy fix (we already support conversion). It's late here, will fix tomorrow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jcrist · 2017-01-23T15:41:28Z

So it turns out that having the accessor's return arrays (when they do in pandas) isn't the best thing, because we lose the division information (we actually have a groupby test with an accessor and known divisions). So I need to revert the behavior here :(.

Q: Previously, when an accessor returns a numpy array, we'd wrap it in the same type as the partition type (so for foo that returns a numpy array in numpy, index.foo -> index, series.foo -> series). All but one of these methods return a numpy array only on indices. I'm not sure if this is the best way to do this, or if we should be consistent with what we wrap numpy arrays with (e.g. always series/always index).

Right now I'm leaning towards always wrapping numpy array return values in Series. Thoughts @sinhrks, @TomAugspurger?

Revert to previous behavior, where `x.accessor.foo` returns the same type as `x` if the pandas version returns a numpy array. If `x` is a series, the output series will have the same index as `x`. This is to allow for broadcasting to work properly.

jcrist · 2017-01-23T17:42:53Z

Ok, I reverted to the previous behavior, where x.accessor.foo returns the same type as x if pandas would return a numpy array. However, we now also set the index as equivalent in the case where x is a series. This is needed to support operations where pandas would align indices before computation (e.g. groupby).

After further though, I think this is for the best. For many operations, pandas treats arrays and indices effectively as equivalent (e.g. not aligning on index beforehand), making them a better match than series. For the one series operation that returns a numpy array (str.to_pydatetime) I figured it made sense to return a series, as it matches the other string accessor methods.

jcrist · 2017-01-23T17:52:19Z

All tests passed, I think this is good to go now.

jreback · 2017-01-23T17:54:36Z

@jcrist

str.to_pydatetime

you mean .dt.to_pydatetime right? and this returns a ndarray for compat with numpy. To be honest, dask really doesn't have a need for this at all. personally I would remove it (and any non Series/Index routines from accessors).

jcrist · 2017-01-23T18:03:00Z

you mean .dt.to_pydatetime right?

Oop, yes, that's what I meant.

personally I would remove it (and any non Series/Index routines from accessors).

Many of the index methods/properties that match those on the cat/dt accessor return numpy arrays for indices. Since we'd like to support things like df.index.day, then this is needed. The maintenance cost here is minimal, and I think the decision to return indices makes sense.

jreback · 2017-01-23T18:05:41Z

@jcrist sure makes sense. I think pandas will soon also simply return Index/Series for these in any event; the reason they do now is compat.

mrocklin · 2017-01-23T18:06:37Z

I just gave this PR a quick test drive and was quite pleased with the experience.

jcrist · 2017-01-23T19:10:51Z

I'm going to merge this now, and add docs in a follow-up PR. I think it'd be good to get this on master now so others can try it out easier. Any changes/additional features can be added as needed later.

mrocklin · 2017-01-23T19:39:31Z

I'm glad to see this in. Thanks taking on a large task like this @jcrist .

Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing. Related to this work in dask: dask/dask#1877. Author: Jim Crist <crist042@umn.edu> Closes pandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits: f1aea13 [Jim Crist] Address comments 7878c55 [Jim Crist] Categoricals hash consistently

jcrist mentioned this pull request Jan 3, 2017

Drop support for pandas < 0.19? #1878

Closed

jcrist mentioned this pull request Jan 9, 2017

Looser meta enforcement option (aka allowing apply to not enforce a rename in dd.apply_and_enforce) #1895

Closed

jcrist added 14 commits January 12, 2017 11:46

dd.core._concat unions categoricals

dcd124d

This adds support for non-consistent categoricals in each partition.

meta and meta_nonempty support unknown categories

78438f6

dask.dataframe.reshape unknown cat errors

866d814

Both `get_dummies` and `pivot_table` error properly if categories are not fully known.

df.cat.categories errors on unknown categories

74d7f71

Throw a NotImplementedError if the categories are unknown. An alternative may be to compute immediately, or to return a lazy Index object.

Add test for unknown categories

4f510d3

Test common operations work properly

Refactor

efc215d

More refactoring

f117e23

Add tests for concat with categorical data

5e27fb6

df.append works with mixed categoricals

e42fcd8

Support categorical dtype in read_csv

ddee3e3

Reading directly into a categorical dtype from `read_csv` now returns an unknown categorical.

Add known, as_known and as_unknown to .cat

e46f4b9

- `known` is whether the categories are known - `as_known` converts unknown -> known - `as_unknown` converts known -> unknown

astype('category') returns unknown categories

0db3ee3

Drop pandas < 0.19

566f67c

jcrist force-pushed the concat-cat branch from f844b80 to 566f67c Compare January 13, 2017 21:05

A few fixups

d079010

- pin conda to fix conda bug? - skip csv test in pandas < 0.19.2

jcrist mentioned this pull request Jan 16, 2017

Categoricals hash consistently pandas-dev/pandas#15143

Closed

jcrist added 4 commits January 18, 2017 14:35

Backport hash_pandas_object fixes

1f5c412

Test categorical partition_index

7b5cbe2

Remove pin of conda version

f672b48

Remove int32 bit

aa4f871

Due to changes to `concat`, python ints no longer convert to int32 on int32 systems when concatenated

jcrist changed the title ~~[WIP] Support non-uniform categoricals~~ Support non-uniform categoricals Jan 18, 2017

jcrist added 3 commits January 20, 2017 14:23

Categorical methods present on categorical index

7917af1

- Categorical methods present on categorical index - Test methods work on categorical index - Methods present in `dir` of index if present - Access to `cat` and `str` fail if not proper dtype

jcrist added 2 commits January 23, 2017 11:30

dd.from_dask_array works with unknown dimensions

8ca504e

accessors return equivalent types

484d019

Revert to previous behavior, where `x.accessor.foo` returns the same type as `x` if the pandas version returns a numpy array. If `x` is a series, the output series will have the same index as `x`. This is to allow for broadcasting to work properly.

jcrist merged commit 42986b0 into dask:master Jan 23, 2017

jcrist deleted the concat-cat branch January 23, 2017 19:16

jcrist added the dataframe label Jan 23, 2017

jcrist mentioned this pull request Jan 23, 2017

Add categorical section to dataframe docs #1929

Merged

sinhrks added this to the 0.14.0 milestone Mar 4, 2017

jcrist mentioned this pull request Mar 17, 2017

The Joy and Agony of Categorical Types #1040

Closed

jcrist mentioned this pull request Apr 10, 2017

ENH: Add comparison operations to Index #2195

Closed

chris-b1 mentioned this pull request Jul 6, 2017

DataFrame.groupby with multiple non-identical Categoricals #2510

Closed

jcrist mentioned this pull request Dec 5, 2017

BUG: Fixed concat index dtype with categorical #2963

Merged

jcrist mentioned this pull request Feb 28, 2018

Error using df.index in and condition #3227

Closed

Uh oh!

Conversation

jcrist commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jan 3, 2017

Uh oh!

mrocklin commented Jan 6, 2017

Uh oh!

jcrist commented Jan 6, 2017

Uh oh!

jcrist commented Jan 18, 2017

Changes to metadata:

Changes to the categorical accessor

Changes to concat

Changes to pivot_table and get_dummies

Changes to read_csv

Changes to merge

Changes to hash_pandas_object

Uh oh!

jcrist commented Jan 18, 2017

Uh oh!

TomAugspurger commented Jan 18, 2017

Uh oh!

jcrist commented Jan 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Jan 21, 2017

Uh oh!

jcrist commented Jan 21, 2017

Uh oh!

TomAugspurger commented Jan 21, 2017 via email

Uh oh!

jcrist commented Jan 23, 2017

Uh oh!

jcrist commented Jan 23, 2017

Uh oh!

jcrist commented Jan 23, 2017

Uh oh!

jreback commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Jan 23, 2017

Uh oh!

jreback commented Jan 23, 2017

Uh oh!

mrocklin commented Jan 23, 2017

Uh oh!

jcrist commented Jan 23, 2017

Uh oh!

mrocklin commented Jan 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jcrist commented Jan 3, 2017 •

edited

Loading

Changes to `pivot_table` and `get_dummies`

Changes to `read_csv`

Changes to `merge`

Changes to `hash_pandas_object`

jcrist commented Jan 18, 2017 •

edited

Loading

jreback commented Jan 23, 2017 •

edited

Loading