Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee worker is restarted if Nanny.kill is called #7323

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Nov 17, 2022

This closes #7312 a race condition that might kill a nanny if it shuts down a worker even though it's supposed to by restarted.

It does feel like a bandaid since the closing is still extremely complicated but I'd like to fix the behavior first before cleaning anything up

Closes #6311 (at least one of the problems with the test)

xref #7321, #7320

@fjetter fjetter self-assigned this Nov 17, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 17, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       18 files  ±  0         18 suites  ±0   8h 10m 28s ⏱️ - 14m 26s
  3 259 tests +  4    3 170 ✔️ +  3       85 💤  - 1  4 +2 
29 340 runs  +36  28 108 ✔️ +32  1 227 💤 +1  5 +3 

For more details on these failures, see this check.

Results for commit 178f57f. ± Comparison against base commit 19deee3.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator

Many tests are red

Comment on lines -546 to -547
# Avoid excessive spewing. (It's also printed once extra within the subprocess, which is okay.)
assert logs.getvalue().count("ValueError: broken") == 1, logs.getvalue()
Copy link
Member Author

@fjetter fjetter Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were logging the exception twice. Once in the subprocess, once in the parent process.
Now, we're only logging it once in the subprocess and are raising it to the caller as a proper exception which should make for a nicer UX

logs on main

2022-11-25 10:51:13,721 - distributed.scheduler - INFO - State start
2022-11-25 10:51:13,723 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:64939
2022-11-25 10:51:13,724 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:64938
2022-11-25 10:51:14,277 - distributed.worker - INFO - Stopping worker. Reason: worker-close
2022-11-25 10:51:14,277 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-11-25 10:51:14,278 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start
    await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
  File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
    raise ValueError("broken")
ValueError: broken

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run
    async with worker:
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__
    await self
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:51:14,314 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 482, in start
    await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
  File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
    raise ValueError("broken")
ValueError: broken

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 442, in instantiate
    result = await self.process.start()
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 714, in start
    msg = await self._wait_until_connected(uid)
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 857, in _wait_until_connected
    raise msg["exception"]
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 921, in run
    async with worker:
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 495, in __aenter__
    await self
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 490, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:51:14,318 - distributed.scheduler - INFO - Scheduler closing...
2022-11-25 10:51:14,319 - distributed.scheduler - INFO - Scheduler closing all comms

logs on this PR

2022-11-25 10:50:08,962 - distributed.scheduler - INFO - State start
2022-11-25 10:50:08,963 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:64916
2022-11-25 10:50:08,963 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:64915
2022-11-25 10:50:09,388 - distributed.worker - INFO - Stopping worker. Reason: worker-close
2022-11-25 10:50:09,388 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-11-25 10:50:09,390 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 487, in start
    await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
  File "/Users/fjetter/mambaforge/envs/dask-distributed-310/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/Users/fjetter/workspace/distributed/distributed/tests/test_nanny.py", line 527, in start_unsafe
    raise ValueError("broken")
ValueError: broken

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/fjetter/workspace/distributed/distributed/nanny.py", line 899, in run
    async with worker:
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 501, in __aenter__
    await self
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 497, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: BrokenWorker failed to start.
2022-11-25 10:50:09,439 - distributed.scheduler - INFO - Scheduler closing...
2022-11-25 10:50:09,440 - distributed.scheduler - INFO - Scheduler closing all comms

As you can see, only logging in the subprocess gives also a nicer traceback than both

Comment on lines 424 to 447
if self.death_timeout:
try:
result = await asyncio.wait_for(
self.process.start(), self.death_timeout
)
except asyncio.TimeoutError:
logger.error(
"Timed out connecting Nanny '%s' to scheduler '%s'",
self,
self.scheduler_addr,
)
await self.close(
timeout=self.death_timeout, reason="nanny-instantiate-timeout"
)
raise

else:
try:
result = await self.process.start()
except Exception:
logger.error("Failed to start process", exc_info=True)
await self.close(reason="nanny-instantiate-failed")
raise
return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server.start takes care of closing it if something happens during startup. All of this special handling is unnecessary.
The treatment of death_timeout is also handled in Server.start so this PR introduces a much more streamlined approach to the lifecycle management of the Nanny

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which reminds me... if there is a startup failure during restart, we're not robust anymore. I'll add an extra test and adjust the behavior

self.process.start(), self.death_timeout
# The lock is required since there are many possible race conditions due
# to the worker exit callback
async with self._instantiate_lock:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should kill the exit callbacks eventually. I think this would require something like #6616

@fjetter fjetter force-pushed the ensure_nanny_restart_not_kill_worker branch from ee44cde to 85562a0 Compare December 1, 2022 12:51
@fjetter fjetter force-pushed the ensure_nanny_restart_not_kill_worker branch from 85562a0 to 6d675ba Compare December 12, 2022 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart can kill a worker CI failing with test_submit_after_failed_worker_async
2 participants