[Idea] Could workers sometimes know when to release keys on their own? #5114

gjoseph92 · 2021-07-24T00:53:52Z

In #5083 (comment) I wrote up a theory for how high scheduler load can lead to workers running out of memory, because the scheduler is slow to send them free-keys messages, allowing otherwise-releasable data to pile up. Is there a way to make the scheduler less in the critical path for workers to release memory? (This idea probably overlaps a lot with with / is a subset of #4982 and #3974. Also bear in mind that this theory is completely unproven and just something I made up.)

Could we somehow mark tasks as "safe to release", so workers know that when they've completed all the dependents of a task locally, they can release that task, since no other worker (or client) will need the data?

We can't say this at submission time, since we haven't yet scheduled dependencies. (Though tasks with only 1 dependency we could probably eagerly mark as releasable.) But maybe when we assign a task to a worker, we could also look through its immediate dependencies, and any of those that are already assigned to that worker, and have no dependents scheduled on other workers or unscheduled (and not requested by a client), could be marked as releasable.

This could have a nice balanced-budget property, where in many cases the scheduler couldn't hand out new tasks to workers without also giving them some tasks to release (in the future).

cc @fjetter @crusaderky @mrocklin

The text was updated successfully, but these errors were encountered:

mrocklin · 2021-07-25T15:21:04Z

In general I like this idea. It might be hard from a consistency perspective. In general it's nice if the scheduler marks a task as gone before the task goes away, rather than the other way around. We would want the scheduler to know that this task was not reliably present. This might increase complexity a bit on the scheduler's state machine.

Alternatively we go ahead and speculatively delete data. In the common case it's fine and in the uncommon case we accept that we'll have to recompute.

fjetter · 2021-07-26T10:52:52Z

Similar behaviour to what your are describing was one of the reasons for the deadlocks in the recent months. Doing this consistently is very difficult.

I think AMM #4982 will already remove most of the problems motivating this since AMM could remove replicas on most workers while few are still using it. The delay of deletion of data on these few workers should not destabilize an entire cluster.

FWIW, I believe we could implement something like this on worker side for the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory) but I'm not sure if this is a very common case.

I would suggest to hold off until AMM is somewhat operational and then try to estimate whether we perceive this still to be a problem.

gjoseph92 · 2021-07-26T20:40:38Z

Would it help to address consistency by making it so once the scheduler has set this worker_releasable flag to True (or maybe this is even a transition state?), it treats the key as though it doesn't exist any more, and all responsibility for that key's lifecycle is handed off to the worker? So rather than trying to maintain consistency between the scheduler and worker (is this key gone yet?), we basically eliminate the state that needs to be made consistent. The only reason we can do that is because we're very strict about when keys can enter this worker_releasable state: we can guarantee no other worker or client will need them in the future, so why should the scheduler care anymore?

the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory)

I think this is exactly the case I'm talking about. If we expand it a tiny bit from "has all dependents of a task in memory" to "has—or is about to have with this message—all dependents of a task in memory", then I don't think it's that uncommon. Our goal with the root task colocation is basically to make this happen as often as possible—we hope that all the dependents of a task get scheduled on the same worker where that task already lives.

fjetter · 2021-07-27T09:10:23Z

Would it help to address consistency by making it so once the scheduler has set this worker_releasable flag to True (or maybe this is even a transition state?),

Not necessarily. The problem is currently that the worker state machine is rather fragile and minor unexpected disturbances might throw it off balance. For instance, one of the deadlocks I fixed recently was because the scheduler thought the worker released a key already and issued a compute-task assignment to the same worker (rescheduling a task). The worker had still residual state of that task and the compute-task signal corrupted the state because the transition was not handled properly. I know my argument is based on "our code is not stable / not tested enough / etc." but this is unfortunately the reality. I am working hard to make the world a better place :)

I think this is exactly the case I'm talking about. If we expand it a tiny bit from "has all dependents of a task in memory"

How about we start with this case and iterate from there? I can see this being implemented rather easily without big potential for inconsistencies. My proposal would be to not introduce any additional signals of the scheduler but rather allow the worker to short circuit a decision if it is 100% sure that the scheduler would make the same decision. The only way I see this happen is if it has all the dependents in memory itself. from there we can iterate and see if an additional signal would even be worth the complexity. If we save 1% of memory I might be inclined to say it's not worth the trouble, if it's 50% that's a no-brainer; we're likely somewhere in between but I don't have a good feeling about how much impact this actually would have.

I would propose to hold off until I finished with #5046 . Then this should become easier, and more importantly, if something goes wrong we should have an easier time debugging than we currently do

abergou · 2021-07-27T13:06:50Z

@fjetter wouldn't there be a potential race condition even in your example? Imagine that a worker is in the state that it "has all dependents of a task in memory" and thinks it can erase the task, but additional tasks are submitted that are dependents of said task to the scheduler. This could cause a potential problem, no?

fjetter · 2021-07-27T13:19:08Z

magine that a worker is in the state that it "has all dependents of a task in memory" and thinks it can erase the task, but additional tasks are submitted that are dependents of said task to the scheduler. This could cause a potential problem, no?

Well, some race condition is unavoidable but the big question is whether or not we arrive in some corrupt state. The worker would only be allowed to forget a key if it also tells the scheduler such that the key will be rescheduled. Even if this information wasn't sent to the scheduler, this would trigger a "missing-key" event chain and we'd self heal. Avoiding this kind of rescheduling is only possible if we do not allow the worker to make any decision (as is the case right now). Question here would be what the more common scenario is and how big the impact of this "optimistic release" is. Either way, before baking something like this in, we'd need a few good benchmarks. If the numbers are not convincing I'm inclined to not merge something like this in favour of reduced complexity, as discussed above.

gjoseph92 mentioned this issue Aug 26, 2021

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

Open

gjoseph92 mentioned this issue Nov 4, 2021

Shuffle prototype: Feedback (disk usage + workers dying) dask/dask#8294

Open

gjoseph92 mentioned this issue Dec 16, 2021

How to make a big-ish median mosaic microsoft/PlanetaryComputer#12

Open

crusaderky added the memory label Mar 25, 2022

This was referenced Aug 17, 2022

Benchmarks for some common workloads coiled/benchmarks#243

Merged

Failing test_climatic_mean coiled/benchmarks#253

Open

gjoseph92 mentioned this issue Oct 5, 2022

[DNM] Structural co-assignment #7076

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea] Could workers sometimes know when to release keys on their own? #5114

[Idea] Could workers sometimes know when to release keys on their own? #5114

gjoseph92 commented Jul 24, 2021

mrocklin commented Jul 25, 2021

fjetter commented Jul 26, 2021

gjoseph92 commented Jul 26, 2021

fjetter commented Jul 27, 2021

abergou commented Jul 27, 2021

fjetter commented Jul 27, 2021

[Idea] Could workers sometimes know when to release keys on their own? #5114

[Idea] Could workers sometimes know when to release keys on their own? #5114

Comments

gjoseph92 commented Jul 24, 2021

mrocklin commented Jul 25, 2021

fjetter commented Jul 26, 2021

gjoseph92 commented Jul 26, 2021

fjetter commented Jul 27, 2021

abergou commented Jul 27, 2021

fjetter commented Jul 27, 2021