Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical column turned to NaN after P2P Shuffle #8186

Closed
NickSchouten opened this issue Sep 14, 2023 · 2 comments · Fixed by #8332
Closed

Categorical column turned to NaN after P2P Shuffle #8186

NickSchouten opened this issue Sep 14, 2023 · 2 comments · Fixed by #8332
Labels
bug Something is broken shuffle

Comments

@NickSchouten
Copy link

NickSchouten commented Sep 14, 2023

Describe the issue:
If you P2P shuffle while your dataframe has a categorical column that column will be turned into all NaNs.

Minimal Complete Verifiable Example:

from dask.distributed import Client
from dask.distributed import LocalCluster
import dask.dataframe as dd

cluster = LocalCluster()
client = Client(cluster)
client.cluster.scale(1)


dask_df = dd.from_dict(
    {
        "b": ["a", "shuffle", "column"],
        "test": ["apple", "pear", "citrus"],
    },
    npartitions=2,
)
dask_df.test = dask_df.test.astype('category')  
print(list(dask_df.test.unique().compute())) # ['apple', 'pear', 'citrus']
dask_df = dask_df.shuffle(on="b").persist() 
print(list(dask_df.test.unique().compute())) # [nan]

Anything else we need to know?:
Probably due to the p2p shuffling.
dask_df = dask_df.shuffle(on="b", shuffle="tasks").persist()
does not have the issue.

If the column is not categorical it also does not happen.

Might be related to
#8183 and #8165

Environment:

Dask version: 2023.9.1
Python version: 3.10
Operating System: Linux
Install method (conda, pip, source): conda

@jrbourbeau jrbourbeau added bug Something is broken shuffle and removed needs triage labels Sep 14, 2023
@jrbourbeau
Copy link
Member

Thanks for the issue @NickSchouten. I'm able to reproduce.

cc @hendrikmakait @fjetter for visibility

@fjetter
Copy link
Member

fjetter commented Sep 14, 2023

bisect blames #7879, i.e. first affected version was already 2023.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken shuffle
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants