Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

mscanlon-exos · 2024-04-25T16:25:55Z

Describe the issue:

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

data = {
    'Column1': range(30),
    'Column2': range(30, 60)
}

pdf = pd.DataFrame(data)

# Convert the Pandas DataFrame to a Dask DataFrame with 3 partitions
ddf = dd.from_pandas(pdf, npartitions=1)
ddf = ddf.set_index('Column1', sort=True, divisions=[0,10,20,29], shuffle='tasks')

print(ddf.npartitions)

ddf = ddf.reset_index()

unique = ddf['Column1'].unique().compute()

Anything else we need to know?:

Environment:

Dask version: 2024.4.2
Python version: 3.10
Operating System: Mac OSx
Install method (conda, pip, source): dask[dataframe]

phofl · 2024-04-25T17:49:04Z

Thx for the report. I pushed out a release with a fix. It's already on pypi and will be on conda forge soon

github-actions bot added the needs triage Needs a response from a contributor label Apr 25, 2024

phofl mentioned this issue Apr 25, 2024

Fix shuffle after set_index from 1 partition df dask/dask-expr#1040

Merged

phofl closed this as completed in dask/dask-expr#1040 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

mscanlon-exos commented Apr 25, 2024

phofl commented Apr 25, 2024

Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

Comments

mscanlon-exos commented Apr 25, 2024

phofl commented Apr 25, 2024