Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data duplication in P2P shuffling #7324

Closed
fjetter opened this issue Nov 17, 2022 · 1 comment · Fixed by #7486
Closed

Data duplication in P2P shuffling #7324

fjetter opened this issue Nov 17, 2022 · 1 comment · Fixed by #7486
Assignees
Labels

Comments

@fjetter
Copy link
Member

fjetter commented Nov 17, 2022

The current P2P implementation is vulnerable to data duplication if the transfer tasks are ever executed twice.

We do not have any exactly-once execution guarantees and are not eager to implement any, see also #6378 for a brief discussion.

Particularly in the case of worker failures it may be possible for a transfer task to be executed (even successfully) without the scheduler even learning of it.

This specific case could be easily dealt with by restarting/failing an ongoing shuffle if input workers are leaving the shuffle while it is running (right now, we only require a failure if output workers are leaving, alternatively we could fix input == output workers).

While I believe there are no other cases where a task would be executed twice, I am not certain if we are able to guarantee this forever.

Instead, I propose to deal with this by implementing data deduplication on receiving side. Every shard can be uniquely identified by the tuple (input_partition_id, output_partition_id).
The current implementation would allow easy deduplication by this key by keeping a record of the already received shards.

This might have a slightly negative impact on transfer performance and might limit our flexibility moving forward in terms of batching. I do not consider these limitations blockers.

cc @mrocklin, @hendrikmakait

@mrocklin
Copy link
Member

mrocklin commented Nov 17, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants