Add optimized code paths for `drop_duplicates` #10542

rjzamora · 2023-10-02T14:31:53Z

While working on a RAPIDS workflow requiring a high-cardinality drop_duplicates operation, I discovered that _Frame.drop_duplicates only supports the ACA algorithm (which is only performant when split_out == 1).

This PR introduces the shuffle argument to drop_duplicates. It also takes advantage of known divisions for Index.drop_duplicates, and avoids shuffling any data at all for the special case that we are calling Index.drop_duplicates and Index.known_divisions is True.

NOTE: This PR does not address #10374, because such an algorithm would not be more performant for the particular workflow I have in mind. However, I do think there are real use cases that would benefit from that optimization as well.

mrocklin · 2023-10-11T14:48:01Z

In principle this approach makes sense to me, and I suspect that it will be more robust generally. If we can make shuffles more robustly performant to small partition sizes I'd be ok making this the default.

…ates

rjzamora · 2023-10-16T15:09:14Z

I'm planning to merge this PR later today if there are no more comments.

mrocklin · 2023-10-27T20:21:33Z

@rjzamora can I ask you to say more about this PR about what defaults should be? Should we set default to this and remove the repartition step? What would you recommend users experience if they don't know enough to find this algorithm?

mrocklin · 2023-10-27T20:23:13Z

Or, asking for a bit more, if you think that this is the right solution, can I ask you to add it to dask-expr, not as another optional algorithm (I'm currently feeling kinda done with those) but as the full solution?

rjzamora · 2023-10-27T20:49:07Z

Good question @mrocklin - My current take is that it probably doesn't make sense to reduce down to a single partitions by default. Therefore, it probably does make sense to use a shuffle-based approach with a split_out=True default (meaning we preserve partition count).

Possible reasons to hesitate:

I only have anecdotal evidence to support the assumption that drop_duplicates rarely drops a large fraction of the data
The result of a shuffle-based approach will feel less consistent with the behavior of pandas, because the original ordering of rows will not be preserved. I doubt this matters to most large-data users, but the result of ddf.drop_duplicates().compute() may appear strange to some people.

can I ask you to add it to dask-expr

Should work right now with ddf.drop_duplicates(split_out=True).compute() (dask/dask-expr#351).

not as another optional algorithm (I'm currently feeling kinda done with those)

Yeah, dask-expr simplifies things a bit by just using a shuffle for split_out>1 and a tree reduction for split_out=1 for all "reductions".

mrocklin · 2023-10-27T22:00:37Z

I only have anecdotal evidence to support the assumption that drop_duplicates rarely drops a large fraction of the data

I know that @fjetter has been keen to get folks thinking about using benchmarks and AB tests to help make these decisions. Maybe this is something you'd be interested in learning more about? Anyone on our team can help walk you through this process.

Should work right now with ddf.drop_duplicates(split_out=True).compute() (dask/dask-expr#351).

Oh awesome, so I guess my question is really "should we swap the default"

rjzamora added 3 commits September 28, 2023 13:45

start adding shuffle-based drop_duplicates code path

d0c9df7

cleanup

c5d2811

re-org

a93e279

github-actions bot added the dataframe label Oct 2, 2023

remove unused config from groupby file

2b0d866

mrocklin mentioned this pull request Oct 11, 2023

Make shuffle robust to small partitions dask/distributed#8259

Open

rjzamora added 2 commits October 12, 2023 09:42

Merge remote-tracking branch 'upstream/main' into shuffle-drop_duplic…

ec83b51

…ates

Merge remote-tracking branch 'upstream/main' into shuffle-drop_duplic…

da04f15

…ates

rjzamora merged commit b4bd120 into dask:main Oct 16, 2023
24 checks passed

rjzamora deleted the shuffle-drop_duplicates branch October 16, 2023 20:38

charlesbluca mentioned this pull request Nov 3, 2023

Handle split_out=None deprecation warning to drop_duplicates #10612

Open

3 tasks

rjzamora mentioned this pull request Nov 17, 2023

[FEA] Scalable and optimal dask_cudf based drop duplicates rapidsai/cudf#14439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized code paths for `drop_duplicates` #10542

Add optimized code paths for `drop_duplicates` #10542

rjzamora commented Oct 2, 2023 •

edited

mrocklin commented Oct 11, 2023

rjzamora commented Oct 16, 2023

mrocklin commented Oct 27, 2023

mrocklin commented Oct 27, 2023

rjzamora commented Oct 27, 2023

mrocklin commented Oct 27, 2023

Add optimized code paths for drop_duplicates #10542

Add optimized code paths for drop_duplicates #10542

Conversation

rjzamora commented Oct 2, 2023 • edited

mrocklin commented Oct 11, 2023

rjzamora commented Oct 16, 2023

mrocklin commented Oct 27, 2023

mrocklin commented Oct 27, 2023

rjzamora commented Oct 27, 2023

mrocklin commented Oct 27, 2023

Add optimized code paths for `drop_duplicates` #10542

Add optimized code paths for `drop_duplicates` #10542

rjzamora commented Oct 2, 2023 •

edited