Skip to content

Support non-uniform categoricals#1877

Merged
jcrist merged 24 commits intodask:masterfrom
jcrist:concat-cat
Jan 23, 2017
Merged

Support non-uniform categoricals#1877
jcrist merged 24 commits intodask:masterfrom
jcrist:concat-cat

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Jan 3, 2017

This adds support for non-uniform categoricals (different categories in each partition) to dask.dataframe. Currently this is written to only support pandas 0.19.0. It could be made backwards compatible, but we may also want to discuss dropping support for < 0.19.0.

Fixes #1836.

TODO:

  • Support non-uniform categoricals in partitions
  • Basic functionality works (reductions, groupby, etc...)
  • categorical accessor
  • get_dummies
  • pivot_table
  • concat
  • append
  • merge
  • read_csv (as per ENH: parse categoricals in read_csv pandas-dev/pandas#13406)
  • hash_pandas_object is categorical-mapping agnostic
  • Fix bug in partd
  • Maybe make backwards compatible with pandas 0.18?

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jan 3, 2017

This looks cool. It would be nice to remove this roadblock when using categoricals with dask.dataframe.

cc @TomAugspurger @sinhrks

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jan 6, 2017

What is the status here? @jcrist are you waiting on feedback? Please let me know if this is blocking on me.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 6, 2017

No, I'm still working on this.

jcrist added 14 commits January 12, 2017 11:46
This adds support for non-consistent categoricals in each partition.
Both `get_dummies` and `pivot_table` error properly if categories are
not fully known.
Throw a NotImplementedError if the categories are unknown. An
alternative may be to compute immediately, or to return a lazy Index
object.
Test common operations work properly
We do this because pandas frequently has inconsistent dtypes in
operations with empty dataframes. We may want to move this into
`methods.concat`, but at the same time that needs to be able to use
empty frames to ensure partitions match the desired schema. This all is
pretty mucky stuff, hopefully can find a cleaner way.
Reading directly into a categorical dtype from `read_csv` now returns an
unknown categorical.
- `known` is whether the categories are known
- `as_known` converts unknown -> known
- `as_unknown` converts known -> unknown
- pin conda to fix conda bug?
- skip csv test in pandas < 0.19.2
jreback pushed a commit to pandas-dev/pandas that referenced this pull request Jan 18, 2017
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.    This PR makes the hashing dependent on the values the
categorical  represents, rather than on the codes. The categories are
first hashed,  and then the codes are remapped to the hashed values.
This is slightly  slower than before (still need to hash the
categories, where we didn't  before), but allows for more consistent
hashing.    Related to this work in dask:
dask/dask#1877.

Author: Jim Crist <crist042@umn.edu>

Closes #15143 from jcrist/categories_hash_consistently and squashes the following commits:

f1aea13 [Jim Crist] Address comments
7878c55 [Jim Crist] Categoricals hash consistently
Due to changes to `concat`, python ints no longer convert to int32 on
int32 systems when concatenated
@jcrist jcrist changed the title [WIP] Support non-uniform categoricals Support non-uniform categoricals Jan 18, 2017
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 18, 2017

Ok, this should be ready for review. A quick walk through of what's added here:

This adds support for different partitions having different categorical mappings. This is accomplished through union_categoricals, a method added in pandas 0.19.0. As such, support for pandas < 0.19.0 has been dropped.

Changes to metadata:

Categorical columns/indices now have two states:

  • known categoricals have the categories known statically (on the _meta attribute)
  • unknown categoricals don't know the categories statically, and are indicated by the presence of dd.utils.UNKNOWN_CATEGORIES in the categories on the meta attribute. This is nice, because operations on the categorical in most cases propagate the presence of this value, so the known/unknown state carries through inference.

Changes to the categorical accessor

To query and convert between these states, following has been added to the categorical accessor:

  • .cat.known is a property that indicates if the categories are known
  • .cat.as_known() converts unknown to known (and is a no-op if already known). Note that this requires computation.
  • .cat.as_unknown() converts known to unknown. This requires no computation - it's just a change in metadata.

If the categories are known, then .cat.categories returns statically (no computation needed). If they are unknown, then an error is raised (rather than computing them)

Changes to concat

Concatenation of dask dataframes with different categoricals is now supported. For each column/index in a dataframe, the output dask column/index will have known categories only if all categories are known. Otherwise the categories will be unknown.

After a call to concat, each partition is guaranteed to match the metadata on the containing dask dataframe. This fixes a bug from before where in certain cases partitions could have different columns/datatypes.

Changes to pivot_table and get_dummies

These functions only work if the categories are known. If unknown, an error is raised

Changes to read_csv

For pandas > 0.19.2, dask now supports reading from csvs directly into categorical columns. These will be unknown categories

Changes to merge

No changes are made to merge. Pandas merge doesn't support merging categoricals, even if the categoricals are identical. As such, dask doesn't error, but the categorical datatype is dropped. When this is fixed in pandas, support for merge on differing categoricals can be added.

Changes to hash_pandas_object

Previously, this would hash categories based on codes only. We now hash based on the values the categoricals represent. This was patched in pandas, and will be in the next release. The fix is backported to dask for versions 0.19.0 to 0.19.2. The faster (and more stable) object hash will be used for pandas 0.19.2.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 18, 2017

And a quick demo using the airline dataset:

In [1]: import pandas as pd

In [2]: dtypes = pd.read_csv('1987.csv', dtype={'Origin': 'category', 'Dest': 'category'}).dtypes.to_dict()

In [3]: import dask.dataframe as dd

# The dtypes to `Origin` and `Dest` are `category`, so they're
# read directly in as categoricals
In [4]: ddf = dd.read_csv('198*.csv', dtype=dtypes)

In [5]: ddf
Out[5]: dd.DataFrame<from-de..., npartitions=18>

In [6]: ddf.Origin.dtype
Out[6]: category

In [7]: ddf.Origin.cat.known    # categories are unknown at this point
Out[7]: False

In [8]: worst_avg_delay = ddf.DepDelay.groupby(ddf.Origin).mean().nlargest(10)

In [9]: worst_avg_delay.index.dtype
Out[9]: category

In [10]: worst_avg_delay.index.cat.known
Out[10]: False

In [11]: worst_avg_delay.compute()
Out[11]:
Origin
PIR    85.000000
ABI    36.500000
EGE    32.363636
BFI    28.000000
ROP    24.700000
ACV    19.740223
SUN    19.000000
YAP    16.902482
RDD    15.724234
HDN    14.147350
Name: DepDelay, dtype: float64

# Convert unknown categoricals to known
In [12]: known_origin = ddf.Origin.cat.as_known()

In [13]: known_origin.cat.known
Out[13]: True

In [14]: known_origin.cat.categories
Out[14]:
Index(['ABE', 'ABQ', 'ACV', 'AGS', 'ALB', 'ALO', 'AMA', 'ANC', 'APF', 'ATL',
       ...
       'PIR', 'ACY', 'GST', 'DET', 'BFI', 'ROP', 'ABI', 'TVC', 'EGE', 'SUN'],
      dtype='object', length=246)

@TomAugspurger
Copy link
Copy Markdown
Member

One question (would be nice to confirm in the docs): .cat.codes.compute() on a categorical should be "correct" in that all the category: code mapping should be consistent. It looks like that's the case here, but want to make sure that's not a fluke:

In [300]: s = dd.from_pandas(pd.Series(pd.Categorical(['c', 'b', 'b', 'a']), name='name'), npartitions=2).cat.as_unknown()

In [301]: s.cat.codes.compute()
Out[301]:
0    2
1    1
2    1
3    0
dtype: int8

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 18, 2017

Ooh, that's a good point, I didn't think of that one. This is actually True only for categoricals that are known. For unknown categoricals, we'd need need a unifying step, which can't be done without a full dataset traversal. My preference is to error in this case, rather than doing something inefficient.

Thanks for coming up with all the things I hadn't thought of yet. This is a good review :).

Cleanups to categorical accessor:
- Refactor internal design of accessor
- Accessor key names include accessor name
- Move categorical accessor to `categorical.py`
- Categorical accessor works for indices

Cleanups to categorize:
- Categorize of categorical is idempotent
- Categorize transforms unknown -> known categoricals if requested
- Categorize works on index
- Categorical methods present on categorical index
- Test methods work on categorical index
- Methods present in `dir` of index if present
- Access to `cat` and `str` fail if not proper dtype
Previously non-Series/Index output from accessor methods would wrap
their return values in calls to `pd.Series`. This resulted in
inconsistencies between pandas and dask. Since we now support arrays
with unknown chunks, we can return the proper type.

Also forward the `dt` properties for datetime indices through the
accessor, instead of adding methods on directly.

Also improved error messages for categorical failures.
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 21, 2017

I believe all comments have been addressed.

One new change that people may want to look at: previously we'd wrap all output from the accessors in calls to pd.Index/pd.Series if the return type was a numpy array. This caused differences between the pandas and dask interfaces. Since we now support arrays with unknown shape, I switched these to return dask arrays. This was part of a larger refactor of how the accessor class is implemented, and how the methods get forwarded to the dd.Index class.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 21, 2017

Failure is due to groupby not supporting grouping by array. Looks like an easy fix (we already support conversion). It's late here, will fix tomorrow.

@TomAugspurger
Copy link
Copy Markdown
Member

TomAugspurger commented Jan 21, 2017 via email

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 23, 2017

So it turns out that having the accessor's return arrays (when they do in pandas) isn't the best thing, because we lose the division information (we actually have a groupby test with an accessor and known divisions). So I need to revert the behavior here :(.

Q: Previously, when an accessor returns a numpy array, we'd wrap it in the same type as the partition type (so for foo that returns a numpy array in numpy, index.foo -> index, series.foo -> series). All but one of these methods return a numpy array only on indices. I'm not sure if this is the best way to do this, or if we should be consistent with what we wrap numpy arrays with (e.g. always series/always index).

Right now I'm leaning towards always wrapping numpy array return values in Series. Thoughts @sinhrks, @TomAugspurger?

Revert to previous behavior, where `x.accessor.foo` returns the same
type as `x` if the pandas version returns a numpy array. If `x` is a
series, the output series will have the same index as `x`. This is to
allow for broadcasting to work properly.
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 23, 2017

Ok, I reverted to the previous behavior, where x.accessor.foo returns the same type as x if pandas would return a numpy array. However, we now also set the index as equivalent in the case where x is a series. This is needed to support operations where pandas would align indices before computation (e.g. groupby).

After further though, I think this is for the best. For many operations, pandas treats arrays and indices effectively as equivalent (e.g. not aligning on index beforehand), making them a better match than series. For the one series operation that returns a numpy array (str.to_pydatetime) I figured it made sense to return a series, as it matches the other string accessor methods.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 23, 2017

All tests passed, I think this is good to go now.

@jreback
Copy link
Copy Markdown
Contributor

jreback commented Jan 23, 2017

@jcrist

str.to_pydatetime

you mean .dt.to_pydatetime right? and this returns a ndarray for compat with numpy. To be honest, dask really doesn't have a need for this at all. personally I would remove it (and any non Series/Index routines from accessors).

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 23, 2017

you mean .dt.to_pydatetime right?

Oop, yes, that's what I meant.

personally I would remove it (and any non Series/Index routines from accessors).

Many of the index methods/properties that match those on the cat/dt accessor return numpy arrays for indices. Since we'd like to support things like df.index.day, then this is needed. The maintenance cost here is minimal, and I think the decision to return indices makes sense.

@jreback
Copy link
Copy Markdown
Contributor

jreback commented Jan 23, 2017

@jcrist sure makes sense. I think pandas will soon also simply return Index/Series for these in any event; the reason they do now is compat.

@mrocklin
Copy link
Copy Markdown
Member

I just gave this PR a quick test drive and was quite pleased with the experience.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 23, 2017

I'm going to merge this now, and add docs in a follow-up PR. I think it'd be good to get this on master now so others can try it out easier. Any changes/additional features can be added as needed later.

@jcrist jcrist merged commit 42986b0 into dask:master Jan 23, 2017
@jcrist jcrist deleted the concat-cat branch January 23, 2017 19:16
@mrocklin
Copy link
Copy Markdown
Member

I'm glad to see this in. Thanks taking on a large task like this @jcrist .

@sinhrks sinhrks added this to the 0.14.0 milestone Mar 4, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.    This PR makes the hashing dependent on the values the
categorical  represents, rather than on the codes. The categories are
first hashed,  and then the codes are remapped to the hashed values.
This is slightly  slower than before (still need to hash the
categories, where we didn't  before), but allows for more consistent
hashing.    Related to this work in dask:
dask/dask#1877.

Author: Jim Crist <crist042@umn.edu>

Closes pandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits:

f1aea13 [Jim Crist] Address comments
7878c55 [Jim Crist] Categoricals hash consistently
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants