`NannyPlugin`s may not be registered on all workers

If workers are connecting to the scheduler at the same time that a NannyPlugin is registered, it's possible that some workers will never receive the plugin.

Specifically, if this sequence happens:
1. Nanny starts and [calls the scheduler](https://github.com/dask/distributed/blob/1fd07f03cacee6fde81d13282568a727bce789b9/distributed/nanny.py#L359-L361) [`add_nanny` RPC](https://github.com/dask/distributed/blob/1fd07f03cacee6fde81d13282568a727bce789b9/distributed/scheduler.py#L4083-L4088) to retrieve plugins. There are none.
2. A [nanny plugin is registered](https://github.com/dask/distributed/blob/1fd07f03cacee6fde81d13282568a727bce789b9/distributed/scheduler.py#L7011-L7019). The new Nanny isn't in `self.workers`, because its worker process hasn't started up and connected yet, so no message is broadcast to it.
3. The worker process starts up and connects to the scheduler. 

Then that worker will never be informed of the plugin.

This is quite problematic, since nanny plugins are usually used to set up important state on workers before they should process tasks (environment variables, installing packages, etc.). The fact that worker state could be non-deterministically inconsistent (some workers randomly don't have the right packages/variables/etc.) will be confusing to debug.

```python
class SlowWorkerStart(WorkerPlugin):
    def __init__(self) -> None:
        self.started = mp.Event()
        self.continue_start = mp.Event()

    async def setup(self, worker):
        self.started.set()
        while not self.continue_start.is_set():
            await asyncio.sleep(0.1)


@gen_cluster(client=True, nthreads=[("", 1)], Worker=Nanny)
async def test_nanny_plugin_race(c, s, a):
    from dask.distributed import Environ

    worker_plugin = SlowWorkerStart()
    # Note `plugins=` is passed to worker
    n = Nanny(s.address, plugins=[worker_plugin], nthreads=1)
    start_task = None
    try:
        # Can't use `async with Nanny`, because `Nanny.start_unsafe` blocks on the worker
        # having connected to the scheduler.
        start_task = asyncio.create_task(n.start())

        # Wait until the worker process has started (but not connected to the scheduler)
        await async_wait_for(lambda: worker_plugin.started.is_set(), 5)
        assert len(s.workers) == 1

        # Now, register a plugin after the Nanny has asked the scheduler about plugins,
        # but before the worker process has connected.
        assert not a.plugins
        assert not n.plugins
        await c.register_worker_plugin(Environ({"ABC": 123}))

        # Wait for the worker to connect
        worker_plugin.continue_start.set()
        try:
            await start_task
        finally:
            start_task = None
        await c.wait_for_workers(2)

        assert a.plugins
        assert n.plugins  # this will fail, it didn't get the plugin
    finally:
        if start_task:
            start_task.cancel()
            try:
                await start_task
            finally:
                start_task = None
        await n.close()
```

To solve this, I think we'd have to maintain state on the scheduler tracking all the nannies, separate from `SchedulerState.workers`. Since Nannies and workers have different lifetimes, it's not enough to assume that `.workers` is always equivalent to the nannies. Doing this would probably improve cleanliness overall anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`NannyPlugin`s may not be registered on all workers #7035

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

NannyPlugins may not be registered on all workers #7035

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`NannyPlugin`s may not be registered on all workers #7035