Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique Operation fails on dataframe repartitioned using set index after resetting the index #11073

Closed
mscanlon-exos opened this issue Apr 25, 2024 · 1 comment · Fixed by dask/dask-expr#1040
Labels
needs triage Needs a response from a contributor

Comments

@mscanlon-exos
Copy link

Describe the issue:

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

data = {
    'Column1': range(30),
    'Column2': range(30, 60)
}

pdf = pd.DataFrame(data)

# Convert the Pandas DataFrame to a Dask DataFrame with 3 partitions
ddf = dd.from_pandas(pdf, npartitions=1)
ddf = ddf.set_index('Column1', sort=True, divisions=[0,10,20,29], shuffle='tasks')

print(ddf.npartitions)

ddf = ddf.reset_index()

unique = ddf['Column1'].unique().compute()

Anything else we need to know?:

Environment:

  • Dask version: 2024.4.2
  • Python version: 3.10
  • Operating System: Mac OSx
  • Install method (conda, pip, source): dask[dataframe]
@phofl
Copy link
Collaborator

phofl commented Apr 25, 2024

Thx for the report. I pushed out a release with a fix. It's already on pypi and will be on conda forge soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants