Ensure `NannyPlugin`s are always installed #8107

fjetter · 2023-08-15T10:14:07Z

There is a race condition that allows to ignore and not setup a NannyPlugin on a Nanny if said Nanny is starting while the registration is happening. This Nanny would never receive the plugin and we'd end up in an inconsistent state.

Closes #8100

This implementation guarantees strong consistency such that once client.register_worker_plugin is called, all present and future Nannies are guaranteed to have the plugin setup. If a Nanny is current starting while this API is called, it is blocking until all intermediate state Nannies are up. Subsequently, all Nannies that want to start after register_worker_plugin has been called but not completed, have to wait.
This strong guarantee requires a mutex which makes the implementation a little more complex but I believe this is the right choice for UX. Inconsistencies that arise from these races are very difficult to debug for end users and I believe this complexity if warranted for such a common situation.

If we imposed only eventual consistency, i.e. we allowed for a brief period after register_worker_plugin to be inconsistent, we could instead get away with installing the plugin on the inconsistent Nanny after it is properly started and restarting the worker afterwards. I chose to not go for this approach since we would have to keep state about currently starting Nannies as well (Scheduler.add/remove_nanny). While the concurrency control would be simpler, the UX would be much worse.

github-actions · 2023-08-15T12:03:05Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±  0       21 suites ±0 10h 29m 5s ⏱️ - 10m 20s
  3 792 tests +  7   3 684 ✔️ +  8   107 💤 ±0 1 ❌ - 1
36 657 runs +82 34 857 ✔️ +87 1 799 💤 - 4 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 0e4b536. ± Comparison against base commit e79c0c7.

♻️ This comment has been updated with latest results.

hendrikmakait · 2023-08-16T09:51:21Z

IIUC, this approach could deadlock plugin registration if a nanny dies without deregistering before completing startup. Given that this might happen when using a large cluster with spot instances, I am -0.5 on the approach.

Instead, I would consider the following: Before registering the worker belonging to a nanny, we add a consolidation step ensuring that all currently known plugins are installed on the nanny. This might take multiple passes if users register plugins in the meantime, but it should still provide strong consistency guarantees from the time a worker is registered by the scheduler.

fjetter · 2023-08-16T10:15:29Z

Instead, I would consider the following: Before registering the worker belonging to a nanny, we add a consolidation step ensuring that all currently known plugins are installed on the nanny. This might take multiple passes if users register plugins in the meantime, but it should still provide strong consistency guarantees from the time a worker is registered by the scheduler.

How is this different to what we're doing currently?

hendrikmakait · 2023-08-16T11:00:16Z

How is this different to what we're doing currently?

Papertrail: This has been discussed in an offline conversation and @fjetter investigates two alternative approaches.

fjetter

Only got around to implement one of the two approaches. Hendrik's approach may still be lower complexity but I likely won't find time for this today

fjetter · 2023-08-16T13:47:28Z

distributed/tests/test_nanny.py

+@pytest.mark.parametrize("restart", [True, False])
+@gen_cluster(client=True, nthreads=[])
+async def test_nanny_plugin_register_nanny_killed(c, s, restart):


@hendrikmakait this test is killing the nanny process while it is instantiating the Worker. That's probably the worst case because everything is in a partial state.

deadlocks with the latest commit but passes just fine now

hendrikmakait · 2023-09-01T08:53:37Z

Only got around to implement one of the two approaches. Hendrik's approach may still be lower complexity but I likely won't find time for this today

I realized that the lock-less approach does not work until we can uniquely identify plugins (e.g., through entity tags). Names are not sufficient.

hendrikmakait self-requested a review August 15, 2023 16:56

fjetter added 4 commits August 16, 2023 15:45

Ensure nanny plugins always register

608f9f4

Add log message for worker plugins

df7afda

remove comment

1611f40

Ensure failing Nanny cannot deadlock plugin registration

6c91268

fjetter force-pushed the ensure_nanny_plugins_register branch from 523fbf6 to 6c91268 Compare August 16, 2023 13:46

fjetter commented Aug 16, 2023

View reviewed changes

hendrikmakait self-assigned this Aug 17, 2023

hendrikmakait added 2 commits August 31, 2023 14:23

Minor fix

4024e79

Merge branch 'main' into pr/fjetter/8107

0e4b536

hendrikmakait approved these changes Sep 1, 2023

View reviewed changes

hendrikmakait merged commit 9b15cd5 into dask:main Sep 1, 2023
25 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure `NannyPlugin`s are always installed #8107

Ensure `NannyPlugin`s are always installed #8107

fjetter commented Aug 15, 2023

github-actions bot commented Aug 15, 2023 •

edited

hendrikmakait commented Aug 16, 2023 •

edited

fjetter commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

fjetter left a comment

fjetter Aug 16, 2023

hendrikmakait commented Sep 1, 2023

Ensure NannyPlugins are always installed #8107

Ensure NannyPlugins are always installed #8107

Conversation

fjetter commented Aug 15, 2023

github-actions bot commented Aug 15, 2023 • edited

Unit Test Results

hendrikmakait commented Aug 16, 2023 • edited

fjetter commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

fjetter left a comment

Choose a reason for hiding this comment

fjetter Aug 16, 2023

Choose a reason for hiding this comment

hendrikmakait commented Sep 1, 2023

Ensure `NannyPlugin`s are always installed #8107

Ensure `NannyPlugin`s are always installed #8107

github-actions bot commented Aug 15, 2023 •

edited

hendrikmakait commented Aug 16, 2023 •

edited