DataFrame has incorrect dtype after .compute() when group by categorical column #6134

eugeneh101 · 2020-04-24T23:05:16Z

If you join Dask DataFrame on a categorical column, then the outputted Dask DataFrame column is still category dtype. However, the moment you .compute() the outputted Dask DataFrame, then the column is the wrong dtype, not categorical.

Tested on Dask 2.14.0 and Pandas 1.0.3
This example where the category type looks like a float, so after .compute(), the dtype is float.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : [0, 0, 1, 1],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes) # prints category
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes) # prints int64

If the categorical column looks like an float, then Dask DataFrame says "category" type but upon .compute(), the pandas DataFrame says float type.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : [0.0, 0.0, 1.0, 1.0],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes)
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes)

Similarly, if the categorical column is a string that does not look like an integer, then Dask DataFrame says "category" type but upon .compute(), the pandas DataFrame says object type.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : ["a", "a", "b", "b"],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes) # prints category
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes) # prints object

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-04-27T16:11:43Z

@eugeneh101 thanks for the report. It looks like the issue is somewhere in

dask/dask/dataframe/multi.py

Lines 231 to 233 in 340103b

    
           def merge_chunk(lhs, *args, **kwargs): 
        
               empty_index_dtype = kwargs.pop("empty_index_dtype", None) 
        
               out = lhs.merge(*args, **kwargs)

.

With unknown categories (like you have) the pandas dataframes that get there might have different concrete CategoricalDtypes since the observed values differ in each partition. When pandas has multiple categorical dtypes in concat, it upcasts to object dtype. Ideally we would to union the dtypes for all the categoricals that are involved in the merge (and maybe the other columns as well?)

Are you interested in working on this? Pandas has a pd.api.types.union_categorical that might be helpful. And the function calling merge_chunk can say exactly which columns (and index) are categorical and should be unioned.

eugeneh101 · 2020-04-27T22:12:52Z

Hello Tom!
Thank you for the quick reply. Thank you for explaining the logic on what goes behind the scenes. I would like to respectfully decline as I don't think I know enough to make it work. Nonetheless, thank you for your fantastic work on Dask. I use it every single day!

TomAugspurger mentioned this issue Apr 27, 2020

DataFrame.merge raises with AttributeError #6142

Closed

TomAugspurger added dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project. labels Apr 28, 2020

jsignell self-assigned this May 13, 2020

jsignell mentioned this issue May 13, 2020

Try to make categoricals work on join #6205

Merged

3 tasks

TomAugspurger closed this as completed in #6205 Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

eugeneh101 commented Apr 24, 2020

TomAugspurger commented Apr 27, 2020

eugeneh101 commented Apr 27, 2020

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

Comments

eugeneh101 commented Apr 24, 2020

TomAugspurger commented Apr 27, 2020

eugeneh101 commented Apr 27, 2020