Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea] Could workers sometimes know when to release keys on their own? #5114

Open
gjoseph92 opened this issue Jul 24, 2021 · 6 comments
Open
Labels

Comments

@gjoseph92
Copy link
Collaborator

In #5083 (comment) I wrote up a theory for how high scheduler load can lead to workers running out of memory, because the scheduler is slow to send them free-keys messages, allowing otherwise-releasable data to pile up. Is there a way to make the scheduler less in the critical path for workers to release memory? (This idea probably overlaps a lot with with / is a subset of #4982 and #3974. Also bear in mind that this theory is completely unproven and just something I made up.)

Could we somehow mark tasks as "safe to release", so workers know that when they've completed all the dependents of a task locally, they can release that task, since no other worker (or client) will need the data?

We can't say this at submission time, since we haven't yet scheduled dependencies. (Though tasks with only 1 dependency we could probably eagerly mark as releasable.) But maybe when we assign a task to a worker, we could also look through its immediate dependencies, and any of those that are already assigned to that worker, and have no dependents scheduled on other workers or unscheduled (and not requested by a client), could be marked as releasable.

This could have a nice balanced-budget property, where in many cases the scheduler couldn't hand out new tasks to workers without also giving them some tasks to release (in the future).

cc @fjetter @crusaderky @mrocklin

@mrocklin
Copy link
Member

In general I like this idea. It might be hard from a consistency perspective. In general it's nice if the scheduler marks a task as gone before the task goes away, rather than the other way around. We would want the scheduler to know that this task was not reliably present. This might increase complexity a bit on the scheduler's state machine.

Alternatively we go ahead and speculatively delete data. In the common case it's fine and in the uncommon case we accept that we'll have to recompute.

@fjetter
Copy link
Member

fjetter commented Jul 26, 2021

Similar behaviour to what your are describing was one of the reasons for the deadlocks in the recent months. Doing this consistently is very difficult.

I think AMM #4982 will already remove most of the problems motivating this since AMM could remove replicas on most workers while few are still using it. The delay of deletion of data on these few workers should not destabilize an entire cluster.

FWIW, I believe we could implement something like this on worker side for the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory) but I'm not sure if this is a very common case.

I would suggest to hold off until AMM is somewhat operational and then try to estimate whether we perceive this still to be a problem.

@gjoseph92
Copy link
Collaborator Author

Would it help to address consistency by making it so once the scheduler has set this worker_releasable flag to True (or maybe this is even a transition state?), it treats the key as though it doesn't exist any more, and all responsibility for that key's lifecycle is handed off to the worker? So rather than trying to maintain consistency between the scheduler and worker (is this key gone yet?), we basically eliminate the state that needs to be made consistent. The only reason we can do that is because we're very strict about when keys can enter this worker_releasable state: we can guarantee no other worker or client will need them in the future, so why should the scheduler care anymore?

the few instances where the worker has the complete information (e.g. it has all dependents of a task in memory)

I think this is exactly the case I'm talking about. If we expand it a tiny bit from "has all dependents of a task in memory" to "has—or is about to have with this message—all dependents of a task in memory", then I don't think it's that uncommon. Our goal with the root task colocation is basically to make this happen as often as possible—we hope that all the dependents of a task get scheduled on the same worker where that task already lives.

@fjetter
Copy link
Member

fjetter commented Jul 27, 2021

Would it help to address consistency by making it so once the scheduler has set this worker_releasable flag to True (or maybe this is even a transition state?),

Not necessarily. The problem is currently that the worker state machine is rather fragile and minor unexpected disturbances might throw it off balance. For instance, one of the deadlocks I fixed recently was because the scheduler thought the worker released a key already and issued a compute-task assignment to the same worker (rescheduling a task). The worker had still residual state of that task and the compute-task signal corrupted the state because the transition was not handled properly. I know my argument is based on "our code is not stable / not tested enough / etc." but this is unfortunately the reality. I am working hard to make the world a better place :)

I think this is exactly the case I'm talking about. If we expand it a tiny bit from "has all dependents of a task in memory"

How about we start with this case and iterate from there? I can see this being implemented rather easily without big potential for inconsistencies. My proposal would be to not introduce any additional signals of the scheduler but rather allow the worker to short circuit a decision if it is 100% sure that the scheduler would make the same decision. The only way I see this happen is if it has all the dependents in memory itself. from there we can iterate and see if an additional signal would even be worth the complexity. If we save 1% of memory I might be inclined to say it's not worth the trouble, if it's 50% that's a no-brainer; we're likely somewhere in between but I don't have a good feeling about how much impact this actually would have.

I would propose to hold off until I finished with #5046 . Then this should become easier, and more importantly, if something goes wrong we should have an easier time debugging than we currently do

@abergou
Copy link

abergou commented Jul 27, 2021

@fjetter wouldn't there be a potential race condition even in your example? Imagine that a worker is in the state that it "has all dependents of a task in memory" and thinks it can erase the task, but additional tasks are submitted that are dependents of said task to the scheduler. This could cause a potential problem, no?

@fjetter
Copy link
Member

fjetter commented Jul 27, 2021

magine that a worker is in the state that it "has all dependents of a task in memory" and thinks it can erase the task, but additional tasks are submitted that are dependents of said task to the scheduler. This could cause a potential problem, no?

Well, some race condition is unavoidable but the big question is whether or not we arrive in some corrupt state. The worker would only be allowed to forget a key if it also tells the scheduler such that the key will be rescheduled. Even if this information wasn't sent to the scheduler, this would trigger a "missing-key" event chain and we'd self heal. Avoiding this kind of rescheduling is only possible if we do not allow the worker to make any decision (as is the case right now). Question here would be what the more common scenario is and how big the impact of this "optimistic release" is. Either way, before baking something like this in, we'd need a few good benchmarks. If the numbers are not convincing I'm inclined to not merge something like this in favour of reduced complexity, as discussed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants