Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear_known_categories overrides categories.dtype for non-object dtypes #5756

Open
lr4d opened this issue Dec 31, 2019 · 5 comments · May be fixed by #6242
Open

clear_known_categories overrides categories.dtype for non-object dtypes #5756

lr4d opened this issue Dec 31, 2019 · 5 comments · May be fixed by #6242

Comments

@lr4d
Copy link
Contributor

lr4d commented Dec 31, 2019

At

x[c].cat.set_categories([UNKNOWN_CATEGORIES], inplace=True)
, dask doesn't check the dtype of the categories, and just assumes it's an object (overwriting the original categorical dtype for one where categories.dtype == object).

Minimal example:

import pandas as pd
date_col = pd.Categorical([], categories=[pd.Timestamp("2019-01-01")])
df = pd.DataFrame({"date":date_col})
df["date"].dtype.categories.dtype == date_col.categories.dtype  # True

from dask.dataframe.categorical import clear_known_categories
clear_known_categories(df)["date"].dtype.categories.dtype == date_col.categories.dtype  # False

This can cause an exception at https://github.com/dask/dask/blob/master/dask/dataframe/methods.py#L460 when using dd.concat if there is a categorical column with non-object categories (in our case, datetimes), when during the concatenation, an empty pandas dataframes and a non-empty one are to be concatenated.

Original stacktrace when calling dd.concat:

/home/xrzq/venv/lib/python3.6/site-packages/dask/base.py:165: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/home/xrzq/venv/lib/python3.6/site-packages/dask/base.py:436: in compute
    results = schedule(dsk, keys, **kwargs)
/home/xrzq/venv/lib/python3.6/site-packages/dask/threaded.py:81: in get
    **kwargs
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:486: in get_async
    raise_exception(exc, tb)
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:316: in reraise
    raise exc
/home/xrzq/venv/lib/python3.6/site-packages/dask/local.py:222: in execute_task
    result = _execute_task(task, data)
/home/xrzq/venv/lib/python3.6/site-packages/dask/core.py:119: in _execute_task
    return func(*args2)
/home/xrzq/venv/lib/python3.6/site-packages/dask/dataframe/methods.py:356: in concat
    dfs, axis=axis, join=join, uniform=uniform, filter_warning=filter_warning
/home/xrzq/venv/lib/python3.6/site-packages/dask/dataframe/methods.py:460: in concat_pandas
    out[col] = union_categoricals(parts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

to_union = [[], Categories (0, object): [], [2019-12-30 16:59:09.534787, 2019-12-31 16:59:09.534787, 2020-01-01 16:59:09.534787, ..., 2019-12-31 16:59:09.534787,
                                 2020-01-01 16:59:09.534787, 2020-01-02 16:59:09.534787]]
sort_categories = False, ignore_order = False
E           TypeError: dtype of categories must be the same

/home/xrzq/venv/lib/python3.6/site-packages/pandas/core/dtypes/concat.py:346: TypeError
@mrocklin
Copy link
Member

Hrm, yes, I can see how that would cause an issue.

Currently it looks like we use the presence of the special value "__UNKNOWN_CATEGORIES__", as a signal that we don't know the categories. We probably can't do this if we want to have unknown categoricals of different dtypes. We would probably need to come up with some other signal for this case. I don't at the moment have a good idea for what that would be.

Alternatively, we might make the concatenation process more relaxed in the case where we have a mix of known and unknown categoricals (maybe we always convert everything to unknown in that case. That might be easier to implement. I don't have a strong background here.

cc @TomAugspurger , who I suspect has more background here than I do.

@lr4d
Copy link
Contributor Author

lr4d commented Jan 2, 2020

Alternatively, we might make the concatenation process more relaxed in the case where we have a mix of known and unknown categoricals (maybe we always convert everything to unknown in that case. That might be easier to implement. I don't have a strong background here.

FYI in pandas no exception is raised in pd.concat when this happens, although the column data type changes from categorical to object.

pd.concat([
    pd.Series(pd.Categorical([], categories=[pd.Timestamp("2019-01-01")])),
    pd.Series(pd.Categorical([])) # categories.dtype == object
    ]
)
Out[5]: Series([], dtype: object)

@TomAugspurger
Copy link
Member

The only think I can think of use to use an empty index of the correct sub-type.

In [11]: pd.Categorical(pd.DatetimeIndex([])).categories
Out[11]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)

Not sure if that will fix everything / break other things, but that's where I would start.

@mrocklin
Copy link
Member

mrocklin commented Jan 2, 2020 via email

@TomAugspurger
Copy link
Member

I would need to look a bit more closely at how we use the actual value UNKOWN_CATEGORIES" to say. My hope would be that we could encode the fact that the categories are unknown completely independently of the values in the CategoricalDtype's categories, but I may be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants