New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized code paths for drop_duplicates
#10542
Conversation
In principle this approach makes sense to me, and I suspect that it will be more robust generally. If we can make shuffles more robustly performant to small partition sizes I'd be ok making this the default. |
I'm planning to merge this PR later today if there are no more comments. |
@rjzamora can I ask you to say more about this PR about what defaults should be? Should we set default to this and remove the repartition step? What would you recommend users experience if they don't know enough to find this algorithm? |
Or, asking for a bit more, if you think that this is the right solution, can I ask you to add it to dask-expr, not as another optional algorithm (I'm currently feeling kinda done with those) but as the full solution? |
Good question @mrocklin - My current take is that it probably doesn't make sense to reduce down to a single partitions by default. Therefore, it probably does make sense to use a shuffle-based approach with a Possible reasons to hesitate:
Should work right now with
Yeah, dask-expr simplifies things a bit by just using a shuffle for |
I know that @fjetter has been keen to get folks thinking about using benchmarks and AB tests to help make these decisions. Maybe this is something you'd be interested in learning more about? Anyone on our team can help walk you through this process.
Oh awesome, so I guess my question is really "should we swap the default" |
While working on a RAPIDS workflow requiring a high-cardinality
drop_duplicates
operation, I discovered that_Frame.drop_duplicates
only supports the ACA algorithm (which is only performant whensplit_out == 1
).This PR introduces the
shuffle
argument todrop_duplicates
. It also takes advantage of known divisions forIndex.drop_duplicates
, and avoids shuffling any data at all for the special case that we are callingIndex.drop_duplicates
andIndex.known_divisions is True
.NOTE: This PR does not address #10374, because such an algorithm would not be more performant for the particular workflow I have in mind. However, I do think there are real use cases that would benefit from that optimization as well.