Adaptive downscaling causes P2P to restart #8579

hendrikmakait · 2024-03-13T16:02:10Z

When adaptive scaling decides to retire a worker that currently participates in a P2P shuffle, it causes the entire shuffle to get restarted. As reported in https://dask.discourse.group/t/shuffle-p2p-unstable-with-adaptive-k8s-operator/2600, this isn't a great UX.

While I'm hesitant to suggest adding even more complexity to P2P, it might make sense to think about a mechanism for P2P to "block" workers from being retired. I'm not sure if a hard block is generally desirable let alone if we can find a mechanism that allows for loose coupling, so this issue is mostly a discussion starter for now.

fjetter · 2024-03-13T16:41:14Z

The mechanism for adaptive is to ask Scheduler.workers_to_close which looks at all kinds of metrics, mostly idleness. I would consider it reasonable to not flag workers idle if they are participating in a shuffle.
How to do this is a little tricky but if necessary, we'll just add a certain attribute that we set/reset with the extension. I think the complexity involved is OK. Especially since improperly setting/resetting this would no cause a catastrophic failure.

hendrikmakait · 2024-03-13T17:05:15Z

The mechanism for adaptive is to ask Scheduler.workers_to_close which looks at all kinds of metrics, mostly idleness. I would consider it reasonable to not flag workers idle if they are participating in a shuffle.

Thanks, I hadn't dug too deeply into this yet. Yeah, I could see see something like Scheduler.pin(identifier, workers) and Scheduler.unpin(identifier (, workers)) that generally allows extensions to flag workers as pinned/required/whatever you want to call it such that adaptive will ignore pinned workers.

hendrikmakait added enhancement Improve existing functionality or make things work better adaptive All things relating to adaptive scaling shuffle labels Mar 13, 2024

hendrikmakait mentioned this issue Mar 29, 2024

Add a new, more careful option for downscaling dask/dask-kubernetes#877

Closed

fjetter mentioned this issue Apr 9, 2024

Ensure workers are not downscaled when participating in p2p #8610

Merged

fjetter closed this as completed in #8610 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive downscaling causes P2P to restart #8579

Adaptive downscaling causes P2P to restart #8579

hendrikmakait commented Mar 13, 2024

fjetter commented Mar 13, 2024 •

edited

Loading

hendrikmakait commented Mar 13, 2024

Adaptive downscaling causes P2P to restart #8579

Adaptive downscaling causes P2P to restart #8579

Comments

hendrikmakait commented Mar 13, 2024

fjetter commented Mar 13, 2024 • edited Loading

hendrikmakait commented Mar 13, 2024

fjetter commented Mar 13, 2024 •

edited

Loading