Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized code paths for drop_duplicates #10542

Merged
merged 6 commits into from Oct 16, 2023

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 2, 2023

While working on a RAPIDS workflow requiring a high-cardinality drop_duplicates operation, I discovered that _Frame.drop_duplicates only supports the ACA algorithm (which is only performant when split_out == 1).

This PR introduces the shuffle argument to drop_duplicates. It also takes advantage of known divisions for Index.drop_duplicates, and avoids shuffling any data at all for the special case that we are calling Index.drop_duplicates and Index.known_divisions is True.

NOTE: This PR does not address #10374, because such an algorithm would not be more performant for the particular workflow I have in mind. However, I do think there are real use cases that would benefit from that optimization as well.

@mrocklin
Copy link
Member

In principle this approach makes sense to me, and I suspect that it will be more robust generally. If we can make shuffles more robustly performant to small partition sizes I'd be ok making this the default.

@rjzamora
Copy link
Member Author

I'm planning to merge this PR later today if there are no more comments.

@rjzamora rjzamora merged commit b4bd120 into dask:main Oct 16, 2023
24 checks passed
@rjzamora rjzamora deleted the shuffle-drop_duplicates branch October 16, 2023 20:38
@mrocklin
Copy link
Member

@rjzamora can I ask you to say more about this PR about what defaults should be? Should we set default to this and remove the repartition step? What would you recommend users experience if they don't know enough to find this algorithm?

@mrocklin
Copy link
Member

Or, asking for a bit more, if you think that this is the right solution, can I ask you to add it to dask-expr, not as another optional algorithm (I'm currently feeling kinda done with those) but as the full solution?

@rjzamora
Copy link
Member Author

Good question @mrocklin - My current take is that it probably doesn't make sense to reduce down to a single partitions by default. Therefore, it probably does make sense to use a shuffle-based approach with a split_out=True default (meaning we preserve partition count).

Possible reasons to hesitate:

  • I only have anecdotal evidence to support the assumption that drop_duplicates rarely drops a large fraction of the data
  • The result of a shuffle-based approach will feel less consistent with the behavior of pandas, because the original ordering of rows will not be preserved. I doubt this matters to most large-data users, but the result of ddf.drop_duplicates().compute() may appear strange to some people.

can I ask you to add it to dask-expr

Should work right now with ddf.drop_duplicates(split_out=True).compute() (dask/dask-expr#351).

not as another optional algorithm (I'm currently feeling kinda done with those)

Yeah, dask-expr simplifies things a bit by just using a shuffle for split_out>1 and a tree reduction for split_out=1 for all "reductions".

@mrocklin
Copy link
Member

I only have anecdotal evidence to support the assumption that drop_duplicates rarely drops a large fraction of the data

I know that @fjetter has been keen to get folks thinking about using benchmarks and AB tests to help make these decisions. Maybe this is something you'd be interested in learning more about? Anyone on our team can help walk you through this process.

Should work right now with ddf.drop_duplicates(split_out=True).compute() (dask/dask-expr#351).

Oh awesome, so I guess my question is really "should we swap the default"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants