Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMM: Increase data transfer priorities for graceful worker retirement #7183

Open
crusaderky opened this issue Oct 25, 2022 · 4 comments
Open
Assignees
Labels

Comments

@crusaderky
Copy link
Collaborator

Active Memory Manager (AMM) data transfers run on hardcoded priority 1:

# Transfer this data after all dependency tasks of computations with
# default or explicitly high (>0) user priority and before all
# computations with low priority (<0). Note that the priority= parameter
# of compute() is multiplied by -1 before it reaches TaskState.priority.
priority=(1,),

This means that if there is a network-heavy workload going on on the default priority 0, to the point that it saturates the network, then AMM will yield and slow down in order not to hamper the workload.

This is generally desirable for general purpose rebalancing and replication. There are two use cases however where it risks being a poor idea:

  1. Graceful worker retirement (AMM RetireWorker) happens chiefly in three cases:
  • whenever a watchdog has intel that the worker is going to die soon. Namely, on AWS you get a 2 minutes warning when an instance is going to be forcefully shut down by Amazon. In this case, graceful retirement is time sensitive, and should be prioritised over computations.
  • on an adaptive cluster, when the workload has dwindled to the point where it can't saturate the cluster anymore. In this case we should expect modest data transfers from the computation, so it shouldn't hurt to raise AMM priority anyway.
  1. Graceful worker retirement will try pushing all the unique data out of a worker at once and hang indefinitely as soon as there's no more capacity anywhere else on the cluster, e.g. the retirement causes all surviving workers to go beyond 80% memory and get paused. If there's a hard shutdown incoming after a certain time, this will mean losing any remaining data and having to recompute it. However, not all data on a worker is equal: task outputs can be recomputed somewhere else; scattered data can't and will cause all computations that rely on it to fall over.

Proposed design

  • In the AMM framework, offer a hook to the policies to specify a priority for replicate suggestions and default to 1 if omitted
  • Add a {key: priority} attribute to AcquireReplicasEvent
  • AMM RetireWorker should replicate with priority -2 for scattered data and -1 for all other data (both are higher than default compute() calls)
@crusaderky crusaderky self-assigned this Oct 25, 2022
@crusaderky crusaderky changed the title Increase data transfer priorities for graceful worker retirement AMM: Increase data transfer priorities for graceful worker retirement Oct 25, 2022
@crusaderky
Copy link
Collaborator Author

The easiest way to cause priority to matter due to "network saturation" (quotes are in order) is to retire a worker in the middle of a spill-heavy computation, since the time the peer workers use to spill data is time where the tasks are still in flight. This would be mitigated by #4424, but not solved because the same ticket would cause peer workers to hit the pause threshold much faster.

@fjetter
Copy link
Member

fjetter commented Oct 25, 2022

AMM RetireWorker should replicate with priority -2 for scattered data and -1 for all other data (both are higher than default compute() calls)

+1

The easiest way to cause priority to matter due to "network saturation" (quotes are in order) is to retire a worker in the middle of a spill-heavy computation

Why would it not matter in other circumstances? A very network heavy workload (e.g. a shuffle) would also block all network even w/out spilling, wouldn't it?

@crusaderky
Copy link
Collaborator Author

The easiest way to cause priority to matter due to "network saturation" (quotes are in order) is to retire a worker in the middle of a spill-heavy computation

Why would it not matter in other circumstances? A very network heavy workload (e.g. a shuffle) would also block all network even w/out spilling, wouldn't it?

It matters. With spill/unspill it's just easier to build a fetch queue that is several minutes long.

@crusaderky
Copy link
Collaborator Author

Worth noting that we just merged a PR (#7167) that uploads the length of the fetch queue to prometheus. We should start monitoring that. Whenever we observe the fetch queue becoming very big is a use case where today AMM RetireWorker would lag behind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants