New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guarantee worker is restarted if Nanny.kill is called #7323
base: main
Are you sure you want to change the base?
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 18 files ± 0 18 suites ±0 8h 10m 28s ⏱️ - 14m 26s For more details on these failures, see this check. Results for commit 178f57f. ± Comparison against base commit 19deee3. ♻️ This comment has been updated with latest results. |
Many tests are red |
# Avoid excessive spewing. (It's also printed once extra within the subprocess, which is okay.) | ||
assert logs.getvalue().count("ValueError: broken") == 1, logs.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we were logging the exception twice. Once in the subprocess, once in the parent process.
Now, we're only logging it once in the subprocess and are raising it to the caller as a proper exception which should make for a nicer UX
logs on main
2022-11-25 10:51:13,721 - distributed.scheduler - INFO - State start
2022-11-25 10:51:13,723 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:64939
2022-11-25 10:51:13,724 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:64938
2022-11-25 10:51:14,277 - distributed.worker - INFO - Stopping worker. Reason: worker-close
2022-11-25 10:51:14,277 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-11-25 10:51:14,278 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
raise ValueError("broken")
ValueError: broken
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run
async with worker:
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__
await self
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:51:14,314 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
raise ValueError("broken")
ValueError: broken
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 442, in instantiate
result = await self.process.start()
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 714, in start
msg = await self._wait_until_connected(uid)
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 857, in _wait_until_connected
raise msg["exception"]
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run
async with worker:
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__
await self
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:51:14,318 - distributed.scheduler - INFO - Scheduler closing...
2022-11-25 10:51:14,319 - distributed.scheduler - INFO - Scheduler closing all comms
logs on this PR
2022-11-25 10:50:08,962 - distributed.scheduler - INFO - State start
2022-11-25 10:50:08,963 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:64916
2022-11-25 10:50:08,963 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:64915
2022-11-25 10:50:09,388 - distributed.worker - INFO - Stopping worker. Reason: worker-close
2022-11-25 10:50:09,388 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-11-25 10:50:09,390 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 487, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
raise ValueError("broken")
ValueError: broken
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 899, in run
async with worker:
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 501, in __aenter__
await self
File "/Users/fjetter/workspace/distributed/distributed/core.py", line 497, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:50:09,439 - distributed.scheduler - INFO - Scheduler closing...
2022-11-25 10:50:09,440 - distributed.scheduler - INFO - Scheduler closing all comms
As you can see, only logging in the subprocess gives also a nicer traceback than both
distributed/nanny.py
Outdated
if self.death_timeout: | ||
try: | ||
result = await asyncio.wait_for( | ||
self.process.start(), self.death_timeout | ||
) | ||
except asyncio.TimeoutError: | ||
logger.error( | ||
"Timed out connecting Nanny '%s' to scheduler '%s'", | ||
self, | ||
self.scheduler_addr, | ||
) | ||
await self.close( | ||
timeout=self.death_timeout, reason="nanny-instantiate-timeout" | ||
) | ||
raise | ||
|
||
else: | ||
try: | ||
result = await self.process.start() | ||
except Exception: | ||
logger.error("Failed to start process", exc_info=True) | ||
await self.close(reason="nanny-instantiate-failed") | ||
raise | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server.start
takes care of closing it if something happens during startup. All of this special handling is unnecessary.
The treatment of death_timeout is also handled in Server.start
so this PR introduces a much more streamlined approach to the lifecycle management of the Nanny
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which reminds me... if there is a startup failure during restart, we're not robust anymore. I'll add an extra test and adjust the behavior
self.process.start(), self.death_timeout | ||
# The lock is required since there are many possible race conditions due | ||
# to the worker exit callback | ||
async with self._instantiate_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should kill the exit callbacks eventually. I think this would require something like #6616
ee44cde
to
85562a0
Compare
85562a0
to
6d675ba
Compare
This closes #7312 a race condition that might kill a nanny if it shuts down a worker even though it's supposed to by restarted.
It does feel like a bandaid since the closing is still extremely complicated but I'd like to fix the behavior first before cleaning anything up
Closes #6311 (at least one of the problems with the test)
xref #7321, #7320