`clear_known_categories` overrides categories.dtype for non-object dtypes #5756

lr4d · 2019-12-31T11:59:44Z

At

Line 262 in c69d10f

x[c].cat.set_categories([UNKNOWN_CATEGORIES], inplace=True)

, dask doesn't check the dtype of the categories, and just assumes it's an object (overwriting the original categorical dtype for one where categories.dtype == object).

Minimal example:

import pandas as pd
date_col = pd.Categorical([], categories=[pd.Timestamp("2019-01-01")])
df = pd.DataFrame({"date":date_col})
df["date"].dtype.categories.dtype == date_col.categories.dtype  # True

from dask.dataframe.categorical import clear_known_categories
clear_known_categories(df)["date"].dtype.categories.dtype == date_col.categories.dtype  # False

This can cause an exception at https://github.com/dask/dask/blob/master/dask/dataframe/methods.py#L460 when using dd.concat if there is a categorical column with non-object categories (in our case, datetimes), when during the concatenation, an empty pandas dataframes and a non-empty one are to be concatenated.

Original stacktrace when calling dd.concat:

/home/xrzq/venv/lib/python3.6/site-packages/dask/base.py:165: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/home/xrzq/venv/lib/python3.6/site-packages/dask/base.py:436: in compute
    results = schedule(dsk, keys, **kwargs)
/home/xrzq/venv/lib/python3.6/site-packages/dask/threaded.py:81: in get
    **kwargs
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:486: in get_async
    raise_exception(exc, tb)
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:316: in reraise
    raise exc
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:222: in execute_task
    result = _execute_task(task, data)
/home/xrzq/venv/lib/python3.6/site-packages/dask/core.py:119: in _execute_task
    return func(*args2)
/home/xrzq/venv/lib/python3.6/site-packages/dask/dataframe/methods.py:356: in concat
    dfs, axis=axis, join=join, uniform=uniform, filter_warning=filter_warning
/home/xrzq/venv/lib/python3.6/site-packages/dask/dataframe/methods.py:460: in concat_pandas
    out[col] = union_categoricals(parts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

to_union = [[], Categories (0, object): [], [2019-12-30 16:59:09.534787, 2019-12-31 16:59:09.534787, 2020-01-01 16:59:09.534787, ..., 2019-12-31 16:59:09.534787,
                                 2020-01-01 16:59:09.534787, 2020-01-02 16:59:09.534787]]
sort_categories = False, ignore_order = False
E           TypeError: dtype of categories must be the same

/home/xrzq/venv/lib/python3.6/site-packages/pandas/core/dtypes/concat.py:346: TypeError

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-12-31T16:40:29Z

Hrm, yes, I can see how that would cause an issue.

Currently it looks like we use the presence of the special value "__UNKNOWN_CATEGORIES__", as a signal that we don't know the categories. We probably can't do this if we want to have unknown categoricals of different dtypes. We would probably need to come up with some other signal for this case. I don't at the moment have a good idea for what that would be.

Alternatively, we might make the concatenation process more relaxed in the case where we have a mix of known and unknown categoricals (maybe we always convert everything to unknown in that case. That might be easier to implement. I don't have a strong background here.

cc @TomAugspurger , who I suspect has more background here than I do.

lr4d · 2020-01-02T10:48:31Z

Alternatively, we might make the concatenation process more relaxed in the case where we have a mix of known and unknown categoricals (maybe we always convert everything to unknown in that case. That might be easier to implement. I don't have a strong background here.

FYI in pandas no exception is raised in pd.concat when this happens, although the column data type changes from categorical to object.

pd.concat([
    pd.Series(pd.Categorical([], categories=[pd.Timestamp("2019-01-01")])),
    pd.Series(pd.Categorical([])) # categories.dtype == object
    ]
)
Out[5]: Series([], dtype: object)

TomAugspurger · 2020-01-02T12:08:42Z

The only think I can think of use to use an empty index of the correct sub-type.

In [11]: pd.Categorical(pd.DatetimeIndex([])).categories
Out[11]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)

Not sure if that will fix everything / break other things, but that's where I would start.

mrocklin · 2020-01-02T15:34:35Z

Are empty categorical rare enough that overlapping emptiness with unknown-ness is ok?

…

On Thu, Jan 2, 2020 at 4:08 AM Tom Augspurger ***@***.***> wrote: The only think I can think of use to use an empty index of the correct sub-type. In [11]: pd.Categorical(pd.DatetimeIndex([])).categories Out[11]: DatetimeIndex([], dtype='datetime64[ns]', freq=None) Not sure if that will fix everything / break other things, but that's where I would start. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5756?email_source=notifications&email_token=AACKZTHZBOW6XH4VNKGLOEDQ3XKMXA5CNFSM4KBVAHUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6GQ7A#issuecomment-570189948>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCM4SPYXL6BYKEECZTQ3XKMXANCNFSM4KBVAHUA> .

TomAugspurger · 2020-01-02T16:07:42Z

I would need to look a bit more closely at how we use the actual value UNKOWN_CATEGORIES" to say. My hope would be that we could encode the fact that the categories are unknown completely independently of the values in the CategoricalDtype's categories, but I may be wrong.

lr4d mentioned this issue Feb 19, 2020

Allow definition of pd.CategoricalDtype with a specific categories.dtype pandas-dev/pandas#32096

Closed

lr4d linked a pull request May 26, 2020 that will close this issue

WIP: fix categoricals for non-object dtypes #6242

Draft

GenevieveBuckley added the dataframe label Oct 13, 2021

charlesbluca mentioned this issue Mar 15, 2023

Improving support for pandas extension dtypes #10069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`clear_known_categories` overrides categories.dtype for non-object dtypes #5756

`clear_known_categories` overrides categories.dtype for non-object dtypes #5756

lr4d commented Dec 31, 2019

mrocklin commented Dec 31, 2019

lr4d commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

mrocklin commented Jan 2, 2020 via email

TomAugspurger commented Jan 2, 2020

clear_known_categories overrides categories.dtype for non-object dtypes #5756

clear_known_categories overrides categories.dtype for non-object dtypes #5756

Comments

lr4d commented Dec 31, 2019

mrocklin commented Dec 31, 2019

lr4d commented Jan 2, 2020

TomAugspurger commented Jan 2, 2020

mrocklin commented Jan 2, 2020 via email

TomAugspurger commented Jan 2, 2020

`clear_known_categories` overrides categories.dtype for non-object dtypes #5756

`clear_known_categories` overrides categories.dtype for non-object dtypes #5756