Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle not raising exception when on does not exist #11174

Closed
fjetter opened this issue Jun 12, 2024 · 1 comment · Fixed by dask/dask-expr#1091
Closed

Shuffle not raising exception when on does not exist #11174

fjetter opened this issue Jun 12, 2024 · 1 comment · Fixed by dask/dask-expr#1091
Assignees
Labels
bug Something is broken

Comments

@fjetter
Copy link
Member

fjetter commented Jun 12, 2024

Upstream issue #11160

from dask.datasets import timeseries
ddf = (
    timeseries()
    .shuffle(on='does_not_exist')
    .repartition(partition_size='100MB')
    .compute()
)

This should raise an exception already at construction time but instead it just computes the dataframe and seems to return the full result.
Particularly with other expressions combined, this can cause very weird behavior. For example

import dask.dataframe as dd
timeseries().to_parquet("data_test.parquet")

# reading back again
ddf = (
    dd.read_parquet("data_test.parquet")
    .shuffle(on='does_not_exist')
    .repartition(partition_size='100MB')
    .compute()
)

causes the issue that's been reported upstream but only if repartition and read_parquet is used in combination with the erroneous shuffle

@fjetter fjetter added the bug Something is broken label Jun 12, 2024
@fjetter fjetter transferred this issue from dask/dask-expr Jun 12, 2024
@fjetter
Copy link
Member Author

fjetter commented Jun 12, 2024

This is also faulty without dask-expr enabled. We'll have different exceptions depending on whether there is a parquet reader or not

ddf = (
    # dd.read_parquet("data_test.parquet")
    timeseries()
    .shuffle(on='does_not_exist')
    .repartition(partition_size='100MB')
    .compute()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants