Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

Closed
eugeneh101 opened this issue Apr 24, 2020 · 2 comments · Fixed by #6205
Closed

DataFrame has incorrect dtype after .compute() when group by categorical column #6134

eugeneh101 opened this issue Apr 24, 2020 · 2 comments · Fixed by #6205
Assignees
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.

Comments

@eugeneh101
Copy link
Contributor

If you join Dask DataFrame on a categorical column, then the outputted Dask DataFrame column is still category dtype. However, the moment you .compute() the outputted Dask DataFrame, then the column is the wrong dtype, not categorical.

Tested on Dask 2.14.0 and Pandas 1.0.3
This example where the category type looks like a float, so after .compute(), the dtype is float.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : [0, 0, 1, 1],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes) # prints category
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes) # prints int64

If the categorical column looks like an float, then Dask DataFrame says "category" type but upon .compute(), the pandas DataFrame says float type.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : [0.0, 0.0, 1.0, 1.0],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes)
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes)

Similarly, if the categorical column is a string that does not look like an integer, then Dask DataFrame says "category" type but upon .compute(), the pandas DataFrame says object type.

import dask.dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "GROUPBY_COL" : ["a", "a", "b", "b"],
    "TO_AGG_COL": [0, 0, 10, 10],
})

ddf = dd.from_pandas(df, npartitions=2)
ddf["GROUPBY_COL"] = ddf["GROUPBY_COL"].astype("category")
ddf2 = ddf.groupby("GROUPBY_COL").agg("sum").rename(columns={"TO_AGG_COL": "AGGED_COL"})

print(ddf.join(ddf2, on="GROUPBY_COL", how="left").dtypes) # prints category
print(ddf.join(ddf2, on="GROUPBY_COL", how="left").compute().dtypes) # prints object
@TomAugspurger
Copy link
Member

@eugeneh101 thanks for the report. It looks like the issue is somewhere in

def merge_chunk(lhs, *args, **kwargs):
empty_index_dtype = kwargs.pop("empty_index_dtype", None)
out = lhs.merge(*args, **kwargs)
.

With unknown categories (like you have) the pandas dataframes that get there might have different concrete CategoricalDtypes since the observed values differ in each partition. When pandas has multiple categorical dtypes in concat, it upcasts to object dtype. Ideally we would to union the dtypes for all the categoricals that are involved in the merge (and maybe the other columns as well?)

Are you interested in working on this? Pandas has a pd.api.types.union_categorical that might be helpful. And the function calling merge_chunk can say exactly which columns (and index) are categorical and should be unioned.

@eugeneh101
Copy link
Contributor Author

Hello Tom!
Thank you for the quick reply. Thank you for explaining the logic on what goes behind the scenes. I would like to respectfully decline as I don't think I know enough to make it work. Nonetheless, thank you for your fantastic work on Dask. I use it every single day!

@TomAugspurger TomAugspurger added dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project. labels Apr 28, 2020
@jsignell jsignell self-assigned this May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe good first issue Clearly described and easy to accomplish. Good for beginners to the project.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants