Guarantee worker is restarted if Nanny.kill is called #7323

fjetter · 2022-11-17T15:01:13Z

This closes #7312 a race condition that might kill a nanny if it shuts down a worker even though it's supposed to by restarted.

It does feel like a bandaid since the closing is still extremely complicated but I'd like to fix the behavior first before cleaning anything up

Closes #6311 (at least one of the problems with the test)

xref #7321, #7320

github-actions · 2022-11-17T16:02:23Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      18 files ±  0       18 suites ±0 8h 10m 28s ⏱️ - 14m 26s
  3 259 tests +  4   3 170 ✔️ +  3     85 💤 - 1 4 ❌ +2
29 340 runs +36 28 108 ✔️ +32 1 227 💤 +1 5 ❌ +3

For more details on these failures, see this check.

Results for commit 178f57f. ± Comparison against base commit 19deee3.

♻️ This comment has been updated with latest results.

crusaderky · 2022-11-21T12:24:13Z

Many tests are red

fjetter · 2022-11-25T09:51:56Z

distributed/tests/test_nanny.py

-    # Avoid excessive spewing. (It's also printed once extra within the subprocess, which is okay.)
-    assert logs.getvalue().count("ValueError: broken") == 1, logs.getvalue()


Previously we were logging the exception twice. Once in the subprocess, once in the parent process.
Now, we're only logging it once in the subprocess and are raising it to the caller as a proper exception which should make for a nicer UX

logs on main

2022-11-25 10:51:13,721 - distributed.scheduler - INFO - State start 2022-11-25 10:51:13,723 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:64939 2022-11-25 10:51:13,724 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:64938 2022-11-25 10:51:14,277 - distributed.worker - INFO - Stopping worker. Reason: worker-close 2022-11-25 10:51:14,277 - distributed.worker - INFO - Closed worker has not yet started: Status.init 2022-11-25 10:51:14,278 - distributed.nanny - ERROR - Failed to start worker Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start await asyncio.wait_for(self.start_unsafe(), timeout=timeout) File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for return await fut File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe raise ValueError("broken") ValueError: broken The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run async with worker: File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__ await self File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start raise RuntimeError(f"{type(self).__name__} failed to start.") from exc RuntimeError: BrokenWorker failed to start. 2022-11-25 10:51:14,314 - distributed.nanny - ERROR - Failed to start process Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start await asyncio.wait_for(self.start_unsafe(), timeout=timeout) File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for return await fut File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe raise ValueError("broken") ValueError: broken The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 442, in instantiate result = await self.process.start() File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 714, in start msg = await self._wait_until_connected(uid) File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 857, in _wait_until_connected raise msg["exception"] File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run async with worker: File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__ await self File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start raise RuntimeError(f"{type(self).__name__} failed to start.") from exc RuntimeError: BrokenWorker failed to start. 2022-11-25 10:51:14,318 - distributed.scheduler - INFO - Scheduler closing... 2022-11-25 10:51:14,319 - distributed.scheduler - INFO - Scheduler closing all comms

logs on this PR

2022-11-25 10:50:08,962 - distributed.scheduler - INFO - State start 2022-11-25 10:50:08,963 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:64916 2022-11-25 10:50:08,963 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:64915 2022-11-25 10:50:09,388 - distributed.worker - INFO - Stopping worker. Reason: worker-close 2022-11-25 10:50:09,388 - distributed.worker - INFO - Closed worker has not yet started: Status.init 2022-11-25 10:50:09,390 - distributed.nanny - ERROR - Failed to start worker Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/core.py", line 487, in start await asyncio.wait_for(self.start_unsafe(), timeout=timeout) File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for return await fut File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe raise ValueError("broken") ValueError: broken The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 899, in run async with worker: File "/Users/fjetter/workspace/distributed/distributed/core.py", line 501, in __aenter__ await self File "/Users/fjetter/workspace/distributed/distributed/core.py", line 497, in start raise RuntimeError(f"{type(self).__name__} failed to start.") from exc RuntimeError: BrokenWorker failed to start. 2022-11-25 10:50:09,439 - distributed.scheduler - INFO - Scheduler closing... 2022-11-25 10:50:09,440 - distributed.scheduler - INFO - Scheduler closing all comms

As you can see, only logging in the subprocess gives also a nicer traceback than both

fjetter · 2022-11-25T09:54:19Z

distributed/nanny.py

-        if self.death_timeout:
-            try:
-                result = await asyncio.wait_for(
-                    self.process.start(), self.death_timeout
-                )
-            except asyncio.TimeoutError:
-                logger.error(
-                    "Timed out connecting Nanny '%s' to scheduler '%s'",
-                    self,
-                    self.scheduler_addr,
-                )
-                await self.close(
-                    timeout=self.death_timeout, reason="nanny-instantiate-timeout"
-                )
-                raise
-
-        else:
-            try:
-                result = await self.process.start()
-            except Exception:
-                logger.error("Failed to start process", exc_info=True)
-                await self.close(reason="nanny-instantiate-failed")
-                raise
-        return result


Server.start takes care of closing it if something happens during startup. All of this special handling is unnecessary.
The treatment of death_timeout is also handled in Server.start so this PR introduces a much more streamlined approach to the lifecycle management of the Nanny

Which reminds me... if there is a startup failure during restart, we're not robust anymore. I'll add an extra test and adjust the behavior

fjetter · 2022-11-25T14:00:59Z

distributed/nanny.py

-                    self.process.start(), self.death_timeout
+        # The lock is required since there are many possible race conditions due
+        # to the worker exit callback
+        async with self._instantiate_lock:


We should kill the exit callbacks eventually. I think this would require something like #6616

fjetter self-assigned this Nov 17, 2022

fjetter commented Nov 25, 2022

View reviewed changes

This was referenced Nov 25, 2022

Restart worker via Nanny on connection failure #6387

Open

Add minimum dependency testing to CI #7285

Merged

Guarantee worker is restarted if Nanny.kill is called

87521ce

fjetter force-pushed the ensure_nanny_restart_not_kill_worker branch from ee44cde to 85562a0 Compare December 1, 2022 12:51

WorkerProcess blocks on kill if still starting

6d675ba

fjetter force-pushed the ensure_nanny_restart_not_kill_worker branch from 85562a0 to 6d675ba Compare December 12, 2022 15:17

fjetter added 3 commits December 12, 2022 19:33

Verify status again in instantiate

9d9c46d

More fixes

f6b8f8f

check if None

178f57f

mrocklin mentioned this pull request Dec 20, 2022

WorkerProcess blocks on kill if still starting #7424

Merged

fjetter removed their assignment Jan 27, 2023

fjetter mentioned this pull request Apr 10, 2024

Ensure workers do not kill on restart #8611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guarantee worker is restarted if Nanny.kill is called #7323

Guarantee worker is restarted if Nanny.kill is called #7323

fjetter commented Nov 17, 2022 •

edited

github-actions bot commented Nov 17, 2022 •

edited

crusaderky commented Nov 21, 2022

fjetter Nov 25, 2022 •

edited

fjetter Nov 25, 2022

fjetter Nov 25, 2022

fjetter Nov 25, 2022

		# Avoid excessive spewing. (It's also printed once extra within the subprocess, which is okay.)
		assert logs.getvalue().count("ValueError: broken") == 1, logs.getvalue()

Guarantee worker is restarted if Nanny.kill is called #7323

Are you sure you want to change the base?

Guarantee worker is restarted if Nanny.kill is called #7323

Conversation

fjetter commented Nov 17, 2022 • edited

github-actions bot commented Nov 17, 2022 • edited

Unit Test Results

crusaderky commented Nov 21, 2022

fjetter Nov 25, 2022 • edited

Choose a reason for hiding this comment

fjetter Nov 25, 2022

Choose a reason for hiding this comment

fjetter Nov 25, 2022

Choose a reason for hiding this comment

fjetter Nov 25, 2022

Choose a reason for hiding this comment

fjetter commented Nov 17, 2022 •

edited

github-actions bot commented Nov 17, 2022 •

edited

fjetter Nov 25, 2022 •

edited