Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure NannyPlugins are always installed #8107

Merged
merged 6 commits into from Sep 1, 2023

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Aug 15, 2023

There is a race condition that allows to ignore and not setup a NannyPlugin on a Nanny if said Nanny is starting while the registration is happening. This Nanny would never receive the plugin and we'd end up in an inconsistent state.

Closes #8100

This implementation guarantees strong consistency such that once client.register_worker_plugin is called, all present and future Nannies are guaranteed to have the plugin setup. If a Nanny is current starting while this API is called, it is blocking until all intermediate state Nannies are up. Subsequently, all Nannies that want to start after register_worker_plugin has been called but not completed, have to wait.
This strong guarantee requires a mutex which makes the implementation a little more complex but I believe this is the right choice for UX. Inconsistencies that arise from these races are very difficult to debug for end users and I believe this complexity if warranted for such a common situation.

If we imposed only eventual consistency, i.e. we allowed for a brief period after register_worker_plugin to be inconsistent, we could instead get away with installing the plugin on the inconsistent Nanny after it is properly started and restarting the worker afterwards. I chose to not go for this approach since we would have to keep state about currently starting Nannies as well (Scheduler.add/remove_nanny). While the concurrency control would be simpler, the UX would be much worse.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 15, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±  0         21 suites  ±0   10h 29m 5s ⏱️ - 10m 20s
  3 792 tests +  7    3 684 ✔️ +  8     107 💤 ±0  1  - 1 
36 657 runs  +82  34 857 ✔️ +87  1 799 💤  - 4  1  - 1 

For more details on these failures, see this check.

Results for commit 0e4b536. ± Comparison against base commit e79c0c7.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait self-requested a review August 15, 2023 16:56
@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 16, 2023

IIUC, this approach could deadlock plugin registration if a nanny dies without deregistering before completing startup. Given that this might happen when using a large cluster with spot instances, I am -0.5 on the approach.

Instead, I would consider the following: Before registering the worker belonging to a nanny, we add a consolidation step ensuring that all currently known plugins are installed on the nanny. This might take multiple passes if users register plugins in the meantime, but it should still provide strong consistency guarantees from the time a worker is registered by the scheduler.

@fjetter
Copy link
Member Author

fjetter commented Aug 16, 2023

Instead, I would consider the following: Before registering the worker belonging to a nanny, we add a consolidation step ensuring that all currently known plugins are installed on the nanny. This might take multiple passes if users register plugins in the meantime, but it should still provide strong consistency guarantees from the time a worker is registered by the scheduler.

How is this different to what we're doing currently?

@hendrikmakait
Copy link
Member

How is this different to what we're doing currently?

Papertrail: This has been discussed in an offline conversation and @fjetter investigates two alternative approaches.

@fjetter fjetter force-pushed the ensure_nanny_plugins_register branch from 523fbf6 to 6c91268 Compare August 16, 2023 13:46
Copy link
Member Author

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got around to implement one of the two approaches. Hendrik's approach may still be lower complexity but I likely won't find time for this today

Comment on lines +900 to +902
@pytest.mark.parametrize("restart", [True, False])
@gen_cluster(client=True, nthreads=[])
async def test_nanny_plugin_register_nanny_killed(c, s, restart):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait this test is killing the nanny process while it is instantiating the Worker. That's probably the worst case because everything is in a partial state.

deadlocks with the latest commit but passes just fine now

@hendrikmakait hendrikmakait self-assigned this Aug 17, 2023
@hendrikmakait
Copy link
Member

Only got around to implement one of the two approaches. Hendrik's approach may still be lower complexity but I likely won't find time for this today

I realized that the lock-less approach does not work until we can uniquely identify plugins (e.g., through entity tags). Names are not sufficient.

@hendrikmakait hendrikmakait merged commit 9b15cd5 into dask:main Sep 1, 2023
25 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NannyPlugin not installed if it is registered while a Nanny is starting
2 participants