Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive downscaling causes P2P to restart #8579

Closed
hendrikmakait opened this issue Mar 13, 2024 · 2 comments · Fixed by #8610
Closed

Adaptive downscaling causes P2P to restart #8579

hendrikmakait opened this issue Mar 13, 2024 · 2 comments · Fixed by #8610
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better shuffle

Comments

@hendrikmakait
Copy link
Member

When adaptive scaling decides to retire a worker that currently participates in a P2P shuffle, it causes the entire shuffle to get restarted. As reported in https://dask.discourse.group/t/shuffle-p2p-unstable-with-adaptive-k8s-operator/2600, this isn't a great UX.

While I'm hesitant to suggest adding even more complexity to P2P, it might make sense to think about a mechanism for P2P to "block" workers from being retired. I'm not sure if a hard block is generally desirable let alone if we can find a mechanism that allows for loose coupling, so this issue is mostly a discussion starter for now.

@hendrikmakait hendrikmakait added enhancement Improve existing functionality or make things work better adaptive All things relating to adaptive scaling shuffle labels Mar 13, 2024
@fjetter
Copy link
Member

fjetter commented Mar 13, 2024

The mechanism for adaptive is to ask Scheduler.workers_to_close which looks at all kinds of metrics, mostly idleness. I would consider it reasonable to not flag workers idle if they are participating in a shuffle.
How to do this is a little tricky but if necessary, we'll just add a certain attribute that we set/reset with the extension. I think the complexity involved is OK. Especially since improperly setting/resetting this would no cause a catastrophic failure.

@hendrikmakait
Copy link
Member Author

The mechanism for adaptive is to ask Scheduler.workers_to_close which looks at all kinds of metrics, mostly idleness. I would consider it reasonable to not flag workers idle if they are participating in a shuffle.

Thanks, I hadn't dug too deeply into this yet. Yeah, I could see see something like Scheduler.pin(identifier, workers) and Scheduler.unpin(identifier (, workers)) that generally allows extensions to flag workers as pinned/required/whatever you want to call it such that adaptive will ignore pinned workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adaptive All things relating to adaptive scaling enhancement Improve existing functionality or make things work better shuffle
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants