Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make repartition a no-op when divisions match #9924

Merged
merged 1 commit into from Feb 9, 2023

Conversation

jrbourbeau
Copy link
Member

No need to actually repartition anything if the input divisions are already equal to the existing divisions

@j-bennet
Copy link
Contributor

j-bennet commented Feb 6, 2023

Does this close #9922?

@jrbourbeau
Copy link
Member Author

No, this is separate from that issue. Though that issue was what made me think about this use case.

@jrbourbeau
Copy link
Member Author

Will plan to merge this in a few hours if not further comment. I don't think the changes here are particularly controversial.

Copy link
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚀

@jrbourbeau jrbourbeau merged commit 0058050 into dask:main Feb 9, 2023
@jrbourbeau jrbourbeau deleted the repartition-fastpath branch February 9, 2023 23:10
@epizut
Copy link
Contributor

epizut commented Feb 10, 2023

No, this is separate from that issue. Though that issue was what made me think about this use case.

I am the author of #9922, my complaint was to avoid adding unnecessary repartition nodes when old and new divisions are equal. So to me, this PR solves my issue.

I mentioned force in the title, but it was not related to the force parameter of repartition()

@@ -7696,6 +7696,10 @@ def repartition(df, divisions=None, force=False):
>>> ddf = dd.repartition(df, [0, 5, 10, 20]) # doctest: +SKIP
"""

# no-op fastpath for when we already have matching divisions
if is_dask_collection(df) and df.divisions == divisions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing if divisions is a list, it only works with tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an issue because if we concat a few Dask DataFrames with identical divisions then it will still repartition every dds:
dask.dataframe.multi.concat() calls align_partitions() that calls df.repartition() with a list divisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I open a separate issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue would be good, thanks @epizut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants