Idempotent semaphore acquire with retries #3690

fjetter · 2020-04-09T16:25:56Z

This adds a lot of logging information to the semaphore but most importantly refactors the internal structures to make the acquire call idempotent.

This relieves us from using the Client itself as a lifetime anchor for the leases and exchanges this scheme with an explicit refresh of the leases and lease specific timeouts.

This gives the following benefits:

I can create a unique ID for each acquire/lease which can be used for tracing using logs to debug the behaviour
The acquire requests themselves are idempotent, i.e. upon connection failures this can be retried
The requests are simpler since we no longer need to handle the client ID which was in the original implementation somehow redundant anyhow
If any weird "proxy did not get ACK" issues arise (e.g. scheduler logged the lease but the OK was never received by the client) the system will eventually self heal / not deadlock
The internal structure is subjectively more simple since we have only one dict to maintain which now holds timestamps

at the cost of additional complexity:

The client will spawn a new periodic callback to refresh the leases. This replaces the implicit coupling to the client heartbeat. I think this is a complexity worth taking since it decouples this implementation a bit from assumptions about the lifetime management of the client.
Another configuration option (I would argue this makes it more transparent to the users, though)

I sneaked in another change regarding the configurability of the retry options. Will open another PR for this but I needed it for the tests to succeed

martindurant · 2020-04-11T13:56:58Z

@fjetter , thanks for submitting this!

@quasiben @jakirkham , any chance for a review here?

quasiben · 2020-04-13T14:07:09Z

Looks like we need the PR (or add the changes here) for the retry_count config option

marco-neumann-by · 2020-04-14T07:22:35Z

So if I understand this correctly the lease will not periodically refreshed by the holding worker. How does this interact with:

long running user payloads? (at which point do they NOT share the same thread anymore)
GIL blocking?

fjetter · 2020-04-14T07:26:44Z

So if I understand this correctly the lease will not periodically refreshed by the holding worker. How does this interact with:

long running user payloads? (at which point do they NOT share the same thread anymore)
GIL blocking?

We will face the same issues as with the old implementation. In the old implementation everything was coupled to the heartbeat of the client. Instead, we'll now have a dedicated semaphore heartbeat/refresh.

long running user payloads

User payloads are always in another thread and don't impact the event loop of the worker unless...

GIL blocking?

If the GIL is held we're out of luck and need to counteract this with longer timeouts

marco-neumann-by · 2020-04-14T07:35:18Z

So if the user schedules a task that is guarded by a semaphore but is prone to GIL blocking, it might run into refresh timeouts and systematically overbook the semaphore?

fjetter · 2020-04-14T07:44:15Z

For the retry configurations, see #3705

lr4d

First pass. Looks good, log messages could be clearer

distributed/semaphore.py

fjetter · 2020-04-14T11:20:36Z

So if the user schedules a task that is guarded by a semaphore but is prone to GIL blocking, it might run into refresh timeouts and systematically overbook the semaphore?

Currently, the timeout would trigger a log on warning level and once the GIL is released on the worker again, this would trigger an exception on scheduler level (caught and logged as an error). The tasks/computations would not fail, however, and the semaphore would be overbooked, correct.

fjetter · 2020-04-14T11:44:53Z

@marco-neumann-jdas I believe we can either protect ourselves from resource starvation and deadlocks (intention of this implementation) or from overbooking. I don't think we can pull off both within this library. If I'm wrong about this statement, I'd be happy to be educated :)

What I could suggest is that a lease refresh will not simply warn/log/raise the overbooking but actually registers the lease again, i.e. we'd be briefly in a state where there are more leases registered than allowed which would then block new leases until we're in a normal state.

For applications where the timeout is ill configured and every task breaches the timeout this would at least stop an unlimited avalanche.

fjetter · 2020-04-14T15:35:38Z

@marco-neumann-jdas I added the treatment for oversubscription. If this is detected, the lease is registered and further acquisitions are blocked until we fall back to a normal state. See test test_oversubscribing_leases (I documented it as well as I could since the logic is not exactly trivial)
This should at least protect scenarios where the GIL makes up only a fraction of a tasks runtime. If the entire runtime is a big GIL we're completely out of luck but I would argue that in these cases, the users should be able to adjust the configuration since we log a lot of warnings to point them to the proper options.

Once we equip the semaphore with some prometheus metrics this should also be clearly visible.

distributed/semaphore.py

fjetter · 2020-04-16T15:30:44Z

@martindurant @quasiben any feedback? If not, I'd like to merge this.

martindurant · 2020-04-16T15:43:59Z

I'll defer to @quasiben here, if he has time

quasiben · 2020-04-16T16:00:30Z

This looks great and I definitely appreciate the comments around tests. @fjetter do you want to honors of hitting the green button ?

mrocklin · 2020-04-17T15:23:18Z

Hi Folks, this introduced a test failure in master. #3717

Is there any chance that people here can help to resolve this?

fjetter force-pushed the semaphore/idempotent_acquire branch from b640f5f to baab3e1 Compare April 14, 2020 07:57

lr4d reviewed Apr 14, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

distributed/semaphore.py Outdated Show resolved Hide resolved

distributed/semaphore.py Show resolved Hide resolved

distributed/semaphore.py Outdated Show resolved Hide resolved

fjetter force-pushed the semaphore/idempotent_acquire branch 3 times, most recently from 6e9904f to f94a1ac Compare April 14, 2020 12:33

fjetter force-pushed the semaphore/idempotent_acquire branch from 2f5d094 to 3545383 Compare April 14, 2020 15:36

marco-neumann-by approved these changes Apr 15, 2020

View reviewed changes

lr4d reviewed Apr 16, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

lr4d reviewed Apr 16, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

Idempotent semaphore acquire with retries

679e1da

fjetter force-pushed the semaphore/idempotent_acquire branch from 2500cde to 679e1da Compare April 16, 2020 13:52

lr4d approved these changes Apr 16, 2020

View reviewed changes

fjetter merged commit ee8cff4 into dask:master Apr 16, 2020

fjetter deleted the semaphore/idempotent_acquire branch April 16, 2020 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idempotent semaphore acquire with retries #3690

Idempotent semaphore acquire with retries #3690

fjetter commented Apr 9, 2020

martindurant commented Apr 11, 2020

quasiben commented Apr 13, 2020

marco-neumann-by commented Apr 14, 2020

fjetter commented Apr 14, 2020

marco-neumann-by commented Apr 14, 2020

fjetter commented Apr 14, 2020

lr4d left a comment

fjetter commented Apr 14, 2020

fjetter commented Apr 14, 2020

fjetter commented Apr 14, 2020 •

edited

fjetter commented Apr 16, 2020

martindurant commented Apr 16, 2020

quasiben commented Apr 16, 2020

mrocklin commented Apr 17, 2020

Idempotent semaphore acquire with retries #3690

Idempotent semaphore acquire with retries #3690

Conversation

fjetter commented Apr 9, 2020

martindurant commented Apr 11, 2020

quasiben commented Apr 13, 2020

marco-neumann-by commented Apr 14, 2020

fjetter commented Apr 14, 2020

marco-neumann-by commented Apr 14, 2020

fjetter commented Apr 14, 2020

lr4d left a comment

Choose a reason for hiding this comment

fjetter commented Apr 14, 2020

fjetter commented Apr 14, 2020

fjetter commented Apr 14, 2020 • edited

fjetter commented Apr 16, 2020

martindurant commented Apr 16, 2020

quasiben commented Apr 16, 2020

mrocklin commented Apr 17, 2020

fjetter commented Apr 14, 2020 •

edited