Add lock to scheduler for sensitive operations #3259

mrocklin · 2019-11-22T18:50:06Z

Some operations like retiring workers or rebalancing data shouldn't
happen concurrently. Here we add an asynchronous lock around these
operations in order to protect them from each other.

This doesn't yet have any tests.

Some operations like retiring workers or rebalancing data shouldn't happen concurrently. Here we add an asynchronous lock around these operations in order to protect them from each other. This doesn't yet have any tests.

If only we had reentrant locks

mrocklin · 2019-11-23T21:58:55Z

cc @jcrist if you have a moment

StephanErb

This PR looks helpful. Thanks for bringing it forward. I have three remarks and a small code question:

Looking at it, I am wondering if there is a potential for deadlocks: Multiple scheduler methods that share the same lock and then make calls to workers instances. If one of the workers decides to call back into the scheduler, we could be in trouble (e.g., because it decides to trigger a graceful shutdown).
Have you considered for how long the lock would be taken? I fear that if substantial data is transferred between workers, then the now sequential shutdown of workers may be stalling, causing subsequent issues (e.g. timeouts in container/cluster schedules, lost data due to stalled transfer).
Do you see a risk in data that is still bouncing between workers that will soon be shut down? I am not sure how much of a problem that will be in practice.

With an approach such as #3248 this would not be necessary but this introduces additional async/background behaviour that might also be complex to reason about in practice (e.g., how does a worker know that his data is finally moved elsewhere).

StephanErb · 2019-11-24T17:12:00Z

distributed/scheduler.py

-            else:
-                workers = set(self.workers.values())
-                workers_by_task = {ts: ts.who_has for ts in tasks}
+            async with self._lock:


I am new to async so please excuse my ignorance: Documentation (https://docs.python.org/3.6/library/asyncio-sync.html) proposes to use this snippet instead:

lock = Lock() ... with (yield from lock): ...

Is there a functional difference that made you prefer one over the other?

async with is a newer syntax that wasn't available until 3.5 I think. It is usable within async def functions while yield from syntax is usable from within @asyncio.coroutines. In general the Dask codebase switched from coroutiens to async def functions a while ago, and so we tend to prefer to use that syntax.

mrocklin · 2019-11-25T19:33:32Z

Looking at it, I am wondering if there is a potential for deadlocks: Multiple scheduler methods that share the same lock and then make calls to workers instances. If one of the workers decides to call back into the scheduler, we could be in trouble (e.g., because it decides to trigger a graceful shutdown).

Yes, in fact this is the reason for the second commit, which fixes some deadlocks that were exposed by our tests. I hope that tests caught everything, but I don't know for certain.

Have you considered for how long the lock would be taken? I fear that if substantial data is transferred between workers, then the now sequential shutdown of workers may be stalling, causing subsequent issues (e.g. timeouts in container/cluster schedules, lost data due to stalled transfer).

Potentially a while as some of these operations take a while. However, these operations aren't really designed to work well with each other, so I think that this is sensible. If we want to improve speed here then I think we need to more fundamentally reengineer things here.

With an approach such as #3248 this would not be necessary but this introduces additional async/background behaviour that might also be complex to reason about in practice (e.g., how does a worker know that his data is finally moved elsewhere).

That, and #3248 also seems narrower in what it solves. The complexity to solution ratio there seems higher than what I'd like personally.

StephanErb · 2019-11-29T17:35:56Z

I have given this branch a trial run in a scenario where we are using graceful downscaling. I could neither observe crashes nor any deadlocks. Of course this doesn't prove anything but it is at least some evidence.

Add lock to scheduler for sensitive operations

4271aa4

Some operations like retiring workers or rebalancing data shouldn't happen concurrently. Here we add an asynchronous lock around these operations in order to protect them from each other. This doesn't yet have any tests.

This was referenced Nov 22, 2019

RFC Retire worker with periodic callback #3248

Closed

Simultaneous --lifetime shutdown led to errors #3257

Closed

avoid locking again

da68ea3

If only we had reentrant locks

StephanErb reviewed Nov 24, 2019

View reviewed changes

mrocklin added 3 commits November 27, 2019 18:05

Merge branch 'master' into scheduler-retire-lock

86c38b9

Merge branch 'master' into scheduler-retire-lock

17e26c3

lint

62df856

mrocklin added 2 commits December 5, 2019 15:34

Merge branch 'master' into scheduler-retire-lock

dd11f2b

black

6088397

mrocklin merged commit e591f32 into dask:master Dec 9, 2019

mrocklin deleted the scheduler-retire-lock branch December 9, 2019 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lock to scheduler for sensitive operations #3259

Add lock to scheduler for sensitive operations #3259

mrocklin commented Nov 22, 2019

mrocklin commented Nov 23, 2019

StephanErb left a comment •

edited

StephanErb Nov 24, 2019

mrocklin Nov 25, 2019

mrocklin commented Nov 25, 2019

StephanErb commented Nov 29, 2019

Add lock to scheduler for sensitive operations #3259

Add lock to scheduler for sensitive operations #3259

Conversation

mrocklin commented Nov 22, 2019

mrocklin commented Nov 23, 2019

StephanErb left a comment • edited

Choose a reason for hiding this comment

StephanErb Nov 24, 2019

Choose a reason for hiding this comment

mrocklin Nov 25, 2019

Choose a reason for hiding this comment

mrocklin commented Nov 25, 2019

StephanErb commented Nov 29, 2019

StephanErb left a comment •

edited