New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idempotent semaphore acquire with retries #3690
Conversation
@fjetter , thanks for submitting this! @quasiben @jakirkham , any chance for a review here? |
Looks like we need the PR (or add the changes here) for the |
So if I understand this correctly the lease will not periodically refreshed by the holding worker. How does this interact with:
|
We will face the same issues as with the old implementation. In the old implementation everything was coupled to the heartbeat of the client. Instead, we'll now have a dedicated semaphore heartbeat/refresh.
User payloads are always in another thread and don't impact the event loop of the worker unless...
If the GIL is held we're out of luck and need to counteract this with longer timeouts |
So if the user schedules a task that is guarded by a semaphore but is prone to GIL blocking, it might run into refresh timeouts and systematically overbook the semaphore? |
For the retry configurations, see #3705 |
b640f5f
to
baab3e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Looks good, log messages could be clearer
Currently, the timeout would trigger a log on warning level and once the GIL is released on the worker again, this would trigger an exception on scheduler level (caught and logged as an error). The tasks/computations would not fail, however, and the semaphore would be overbooked, correct. |
@marco-neumann-jdas I believe we can either protect ourselves from resource starvation and deadlocks (intention of this implementation) or from overbooking. I don't think we can pull off both within this library. If I'm wrong about this statement, I'd be happy to be educated :) What I could suggest is that a lease refresh will not simply warn/log/raise the overbooking but actually registers the lease again, i.e. we'd be briefly in a state where there are more leases registered than allowed which would then block new leases until we're in a normal state. For applications where the timeout is ill configured and every task breaches the timeout this would at least stop an unlimited avalanche. |
6e9904f
to
f94a1ac
Compare
@marco-neumann-jdas I added the treatment for oversubscription. If this is detected, the lease is registered and further acquisitions are blocked until we fall back to a normal state. See test Once we equip the semaphore with some prometheus metrics this should also be clearly visible. |
2f5d094
to
3545383
Compare
2500cde
to
679e1da
Compare
@martindurant @quasiben any feedback? If not, I'd like to merge this. |
I'll defer to @quasiben here, if he has time |
This looks great and I definitely appreciate the comments around tests. @fjetter do you want to honors of hitting the green button ? |
Hi Folks, this introduced a test failure in master. #3717 Is there any chance that people here can help to resolve this? |
This adds a lot of logging information to the semaphore but most importantly refactors the internal structures to make the acquire call idempotent.
This relieves us from using the Client itself as a lifetime anchor for the leases and exchanges this scheme with an explicit refresh of the leases and lease specific timeouts.
This gives the following benefits:
at the cost of additional complexity:
I sneaked in another change regarding the configurability of the retry options. Will open another PR for this but I needed it for the tests to succeed