New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure NannyPlugin
s are always installed
#8107
Ensure NannyPlugin
s are always installed
#8107
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files ± 0 21 suites ±0 10h 29m 5s ⏱️ - 10m 20s For more details on these failures, see this check. Results for commit 0e4b536. ± Comparison against base commit e79c0c7. ♻️ This comment has been updated with latest results. |
IIUC, this approach could deadlock plugin registration if a nanny dies without deregistering before completing startup. Given that this might happen when using a large cluster with spot instances, I am -0.5 on the approach. Instead, I would consider the following: Before registering the worker belonging to a nanny, we add a consolidation step ensuring that all currently known plugins are installed on the nanny. This might take multiple passes if users register plugins in the meantime, but it should still provide strong consistency guarantees from the time a worker is registered by the scheduler. |
How is this different to what we're doing currently? |
Papertrail: This has been discussed in an offline conversation and @fjetter investigates two alternative approaches. |
523fbf6
to
6c91268
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only got around to implement one of the two approaches. Hendrik's approach may still be lower complexity but I likely won't find time for this today
@pytest.mark.parametrize("restart", [True, False]) | ||
@gen_cluster(client=True, nthreads=[]) | ||
async def test_nanny_plugin_register_nanny_killed(c, s, restart): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmakait this test is killing the nanny process while it is instantiating the Worker. That's probably the worst case because everything is in a partial state.
deadlocks with the latest commit but passes just fine now
I realized that the lock-less approach does not work until we can uniquely identify plugins (e.g., through entity tags). Names are not sufficient. |
There is a race condition that allows to ignore and not setup a
NannyPlugin
on aNanny
if saidNanny
is starting while the registration is happening. This Nanny would never receive the plugin and we'd end up in an inconsistent state.Closes #8100
This implementation guarantees strong consistency such that once
client.register_worker_plugin
is called, all present and future Nannies are guaranteed to have the plugin setup. If a Nanny is current starting while this API is called, it is blocking until all intermediate state Nannies are up. Subsequently, all Nannies that want to start afterregister_worker_plugin
has been called but not completed, have to wait.This strong guarantee requires a mutex which makes the implementation a little more complex but I believe this is the right choice for UX. Inconsistencies that arise from these races are very difficult to debug for end users and I believe this complexity if warranted for such a common situation.
If we imposed only eventual consistency, i.e. we allowed for a brief period after
register_worker_plugin
to be inconsistent, we could instead get away with installing the plugin on the inconsistentNanny
after it is properly started and restarting the worker afterwards. I chose to not go for this approach since we would have to keep state about currently starting Nannies as well (Scheduler.add/remove_nanny
). While the concurrency control would be simpler, the UX would be much worse.