Data duplication in P2P shuffling #7324

fjetter · 2022-11-17T15:19:15Z

The current P2P implementation is vulnerable to data duplication if the transfer tasks are ever executed twice.

We do not have any exactly-once execution guarantees and are not eager to implement any, see also #6378 for a brief discussion.

Particularly in the case of worker failures it may be possible for a transfer task to be executed (even successfully) without the scheduler even learning of it.

This specific case could be easily dealt with by restarting/failing an ongoing shuffle if input workers are leaving the shuffle while it is running (right now, we only require a failure if output workers are leaving, alternatively we could fix input == output workers).

While I believe there are no other cases where a task would be executed twice, I am not certain if we are able to guarantee this forever.

Instead, I propose to deal with this by implementing data deduplication on receiving side. Every shard can be uniquely identified by the tuple (input_partition_id, output_partition_id).
The current implementation would allow easy deduplication by this key by keeping a record of the already received shards.

This might have a slightly negative impact on transfer performance and might limit our flexibility moving forward in terms of batching. I do not consider these limitations blockers.

cc @mrocklin, @hendrikmakait

The text was updated successfully, but these errors were encountered:

mrocklin · 2022-11-17T15:28:39Z

This seems reasonable to me. Personally I probably wouldn't implement it until I thought that it might possibly be a problem (can this occur without much larger issues occurring?) but it generally seems sensible.

…

On Thu, Nov 17, 2022 at 9:19 AM Florian Jetter ***@***.***> wrote: The current P2P implementation is vulnerable to data duplication if the transfer tasks are ever executed twice. We do not have any exactly-once execution guarantees and are not eager to implement any, see also #6378 <#6378> for a brief discussion. Particularly in the case of worker failures it may be possible for a transfer task to be executed (even successfully) without the scheduler even learning of it. This specific case could be easily dealt with by restarting/failing an ongoing shuffle if input workers are leaving the shuffle while it is running (right now, we only require a failure if output workers are leaving, alternatively we could fix input == output workers). While I believe there are no other cases where a task would be executed twice, I am not certain if we are able to guarantee this forever. Instead, I propose to deal with this by implementing data deduplication on receiving side. Every shard can be uniquely identified by the tuple (input_partition_id, output_partition_id). The current implementation would allow easy deduplication by this key by keeping a record of the already received shards. This might have a slightly negative impact on transfer performance and might limit our flexibility moving forward in terms of batching. I do not consider these limitations blockers. cc @mrocklin <https://github.com/mrocklin>, @hendrikmakait <https://github.com/hendrikmakait> — Reply to this email directly, view it on GitHub <#7324>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFSKJR7DJPRMOHRS3DWIZEH5ANCNFSM6AAAAAASDPYJG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fjetter added the shuffle label Nov 17, 2022

hendrikmakait mentioned this issue Nov 18, 2022

Fail P2PShuffle gracefully upon worker failure #7326

Merged

2 tasks

fjetter mentioned this issue Dec 6, 2022

Automatically restart P2P shuffles that were aborted due to leaving workers #7353

Closed

hendrikmakait mentioned this issue Jan 18, 2023

P2P shuffle deduplicates data and can be run several times #7486

Merged

2 tasks

hendrikmakait self-assigned this Jan 20, 2023

fjetter closed this as completed in #7486 Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data duplication in P2P shuffling #7324

Data duplication in P2P shuffling #7324

fjetter commented Nov 17, 2022

mrocklin commented Nov 17, 2022 via email

Data duplication in P2P shuffling #7324

Data duplication in P2P shuffling #7324

Comments

fjetter commented Nov 17, 2022

mrocklin commented Nov 17, 2022 via email