-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data duplication in P2P shuffling #7324
Labels
Comments
This seems reasonable to me. Personally I probably wouldn't implement it
until I thought that it might possibly be a problem (can this occur without
much larger issues occurring?) but it generally seems sensible.
…On Thu, Nov 17, 2022 at 9:19 AM Florian Jetter ***@***.***> wrote:
The current P2P implementation is vulnerable to data duplication if the
transfer tasks are ever executed twice.
We do not have any exactly-once execution guarantees and are not eager to
implement any, see also #6378
<#6378> for a brief discussion.
Particularly in the case of worker failures it may be possible for a
transfer task to be executed (even successfully) without the scheduler even
learning of it.
This specific case could be easily dealt with by restarting/failing an
ongoing shuffle if input workers are leaving the shuffle while it is
running (right now, we only require a failure if output workers are
leaving, alternatively we could fix input == output workers).
While I believe there are no other cases where a task would be executed
twice, I am not certain if we are able to guarantee this forever.
Instead, I propose to deal with this by implementing data deduplication on
receiving side. Every shard can be uniquely identified by the tuple (input_partition_id,
output_partition_id).
The current implementation would allow easy deduplication by this key by
keeping a record of the already received shards.
This might have a slightly negative impact on transfer performance and
might limit our flexibility moving forward in terms of batching. I do not
consider these limitations blockers.
cc @mrocklin <https://github.com/mrocklin>, @hendrikmakait
<https://github.com/hendrikmakait>
—
Reply to this email directly, view it on GitHub
<#7324>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFSKJR7DJPRMOHRS3DWIZEH5ANCNFSM6AAAAAASDPYJG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
2 tasks
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The current P2P implementation is vulnerable to data duplication if the transfer tasks are ever executed twice.
We do not have any exactly-once execution guarantees and are not eager to implement any, see also #6378 for a brief discussion.
Particularly in the case of worker failures it may be possible for a transfer task to be executed (even successfully) without the scheduler even learning of it.
This specific case could be easily dealt with by restarting/failing an ongoing shuffle if input workers are leaving the shuffle while it is running (right now, we only require a failure if output workers are leaving, alternatively we could fix input == output workers).
While I believe there are no other cases where a task would be executed twice, I am not certain if we are able to guarantee this forever.
Instead, I propose to deal with this by implementing data deduplication on receiving side. Every shard can be uniquely identified by the tuple
(input_partition_id, output_partition_id)
.The current implementation would allow easy deduplication by this key by keeping a record of the already received shards.
This might have a slightly negative impact on transfer performance and might limit our flexibility moving forward in terms of batching. I do not consider these limitations blockers.
cc @mrocklin, @hendrikmakait
The text was updated successfully, but these errors were encountered: