Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failing with test_submit_after_failed_worker_async #6311

Closed
jakirkham opened this issue May 10, 2022 · 4 comments · Fixed by #8032 · May be fixed by #7323
Closed

CI failing with test_submit_after_failed_worker_async #6311

jakirkham opened this issue May 10, 2022 · 4 comments · Fixed by #8032 · May be fixed by #7323
Labels
flaky test Intermittent failures on CI.

Comments

@jakirkham
Copy link
Member

Seeing a CI failure with test_submit_after_failed_worker_async.

@jakirkham jakirkham added the flaky test Intermittent failures on CI. label May 10, 2022
@gjoseph92
Copy link
Collaborator

A more recent traceback from https://github.com/dask/distributed/runs/6621019362?check_suite_focus=true#step:11:1368

____________________ test_submit_after_failed_worker_async _____________________
cls = <class '_pytest.runner.CallInfo'>
func = <function call_runtest_hook.<locals>.<lambda> at 0x13f4bc040>
when = 'call'
reraise = (<class '_pytest.outcomes.Exit'>, <class 'KeyboardInterrupt'>)
@classmethod
deffrom_call(
cls,
        func: "Callable[[], TResult]",
        when: "Literal['collect', 'setup', 'call', 'teardown']",
        reraise: Optional[
            Union[Type[BaseException], Tuple[Type[BaseException], ...]]
        ] = None,
    ) -> "CallInfo[TResult]":
"""Call func, wrapping the result in a CallInfo.
    :param func:
        The function to call. Called without arguments.
    :param when:
        The phase in which the function is called.
    :param reraise:
        Exception or exceptions that shall propagate if raised by the
        function, instead of being wrapped in the CallInfo.
    """
        excinfo = None
        start = timing.time()
        precise_start = timing.perf_counter()
try:
>           result: Optional[TResult] = func()
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/runner.py:338: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/runner.py:259: in <lambda>
lambda: ihook(item=item, **kwds), when=when, reraise=reraise
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pluggy/_hooks.py:265: in __call__
returnself._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pluggy/_manager.py:80: in _hookexec
returnself._inner_hookexec(hook_name, methods, kwargs, firstresult)
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/unraisableexception.py:88: in pytest_runtest_call
yield from unraisable_exception_runtest_hook()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
defunraisable_exception_runtest_hook() -> Generator[None, None, None]:
with catch_unraisable_exception() as cm:
yield
if cm.unraisable:
if cm.unraisable.err_msg isnotNone:
                    err_msg = cm.unraisable.err_msg
else:
                    err_msg = "Exception ignored in"
                msg = f"{err_msg}: {cm.unraisable.object!r}\n\n"
                msg += "".join(
                    traceback.format_exception(
                        cm.unraisable.exc_type,
                        cm.unraisable.exc_value,
                        cm.unraisable.exc_traceback,
                    )
                )
>               warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
E               pytest.PytestUnraisableExceptionWarning: Exception ignored in: <_io.FileIO [closed]>
E               
E               Traceback (most recent call last):
E                 File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 1141, in coro
E                   gc.collect()
E               ResourceWarning: unclosed file <_io.BufferedReader name=26>
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/unraisableexception.py:78: PytestUnraisableExceptionWarning
----------------------------- Captured stderr call -----------------------------
2022-05-27 06:19:51,868 - distributed.scheduler - INFO - State start
2022-05-27 06:19:51,871 - distributed.scheduler - INFO - Clear task state
2022-05-27 06:19:51,872 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:56527
2022-05-27 06:19:51,872 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:56526
2022-05-27 06:19:51,884 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56528
2022-05-27 06:19:51,884 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56528
2022-05-27 06:19:51,885 - distributed.worker - INFO -          dashboard at:            127.0.0.1:56529
2022-05-27 06:19:51,885 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56527
2022-05-27 06:19:51,885 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,885 - distributed.worker - INFO -               Threads:                          1
2022-05-27 06:19:51,885 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-05-27 06:19:51,885 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmpqs_op4a3/dask-worker-space/worker-qrg8801h
2022-05-27 06:19:51,886 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,887 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56530
2022-05-27 06:19:51,887 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56530
2022-05-27 06:19:51,887 - distributed.worker - INFO -          dashboard at:            127.0.0.1:56531
2022-05-27 06:19:51,887 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56527
2022-05-27 06:19:51,887 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,888 - distributed.worker - INFO -               Threads:                          2
2022-05-27 06:19:51,888 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-05-27 06:19:51,888 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmpqs_op4a3/dask-worker-space/worker-4bil5nqv
2022-05-27 06:19:51,888 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,898 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56528', name: 0, status: init, memory: 0, processing: 0>
2022-05-27 06:19:51,899 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56528
2022-05-27 06:19:51,899 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:51,900 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56530', name: 1, status: init, memory: 0, processing: 0>
2022-05-27 06:19:51,901 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56530
2022-05-27 06:19:51,901 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:51,901 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56527
2022-05-27 06:19:51,901 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,902 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56527
2022-05-27 06:19:51,903 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:51,903 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:51,903 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:51,931 - distributed.scheduler - INFO - Receive client connection: Client-0499aa98-dd85-11ec-9af1-005056aba833
2022-05-27 06:19:51,932 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:51,942 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:56536'
2022-05-27 06:19:53,926 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56541
2022-05-27 06:19:53,926 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56541
2022-05-27 06:19:53,926 - distributed.worker - INFO -          dashboard at:            127.0.0.1:56542
2022-05-27 06:19:53,926 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56527
2022-05-27 06:19:53,926 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:53,926 - distributed.worker - INFO -               Threads:                          2
2022-05-27 06:19:53,926 - distributed.worker - INFO -                Memory:                   9.33 GiB
2022-05-27 06:19:53,926 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmpqs_op4a3/dask-worker-space/worker-828ryjvx
2022-05-27 06:19:53,927 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:54,756 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56541', status: init, memory: 0, processing: 0>
2022-05-27 06:19:54,757 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56541
2022-05-27 06:19:54,758 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:54,758 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56527
2022-05-27 06:19:54,758 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:54,760 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:55,242 - distributed.nanny - INFO - Nanny asking worker to close
2022-05-27 06:19:55,244 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56541
2022-05-27 06:19:55,246 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting.  Status: Status.closing
2022-05-27 06:19:55,247 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56541', status: closing, memory: 3, processing: 0>
2022-05-27 06:19:55,247 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:56541
2022-05-27 06:19:55,508 - distributed.nanny - WARNING - Restarting worker
2022-05-27 06:19:57,479 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56558
2022-05-27 06:19:57,479 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56558
2022-05-27 06:19:57,479 - distributed.worker - INFO -          dashboard at:            127.0.0.1:56559
2022-05-27 06:19:57,480 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56527
2022-05-27 06:19:57,480 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:57,480 - distributed.worker - INFO -               Threads:                          2
2022-05-27 06:19:57,480 - distributed.worker - INFO -                Memory:                   9.33 GiB
2022-05-27 06:19:57,480 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmpqs_op4a3/dask-worker-space/worker-fo9bzbxy
2022-05-27 06:19:57,480 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:58,265 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56558', status: init, memory: 0, processing: 0>
2022-05-27 06:19:58,267 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56558
2022-05-27 06:19:58,267 - distributed.core - INFO - Starting established connection
2022-05-27 06:19:58,268 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56527
2022-05-27 06:19:58,268 - distributed.worker - INFO - -------------------------------------------------
2022-05-27 06:19:58,270 - distributed.core - INFO - Starting established connection
2022-05-27 06:20:00,251 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:56541
ConnectionRefusedError: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/comm/core.py", line 289, in connect
    comm = await asyncio.wait_for(
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/Users/runner/work/distributed/distributed/distributed/comm/tcp.py", line 451, in connect
    convert_stream_closed_error(self, e)
  File "/Users/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x13ff10df0>: ConnectionRefusedError: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/worker.py", line 3277, in gather_dep
    response = await get_data_from_worker(
  File "/Users/runner/work/distributed/distributed/distributed/worker.py", line 4593, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/Users/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/Users/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/Users/runner/work/distributed/distributed/distributed/worker.py", line 4570, in _get_data
    comm = await rpc.connect(worker)
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 1193, in connect
    return await connect_attempt
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 1129, in _connect
    comm = await connect(
  File "/Users/runner/work/distributed/distributed/distributed/comm/core.py", line 315, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:56541 after 5 s
2022-05-27 06:20:00,280 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:56536'.
2022-05-27 06:20:00,281 - distributed.nanny - INFO - Nanny asking worker to close
2022-05-27 06:20:00,284 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56558
2022-05-27 06:20:00,287 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting.  Status: Status.closing
2022-05-27 06:20:00,288 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56558', status: closing, memory: 0, processing: 0>
2022-05-27 06:20:00,288 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:56558
2022-05-27 06:20:00,590 - distributed.scheduler - INFO - Remove client Client-0499aa98-dd85-11ec-9af1-005056aba833
2022-05-27 06:20:00,591 - distributed.scheduler - INFO - Remove client Client-0499aa98-dd85-11ec-9af1-005056aba833
2022-05-27 06:20:00,591 - distributed.scheduler - INFO - Close client connection: Client-0499aa98-dd85-11ec-9af1-005056aba833
2022-05-27 06:20:00,593 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56528
2022-05-27 06:20:00,594 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56530
2022-05-27 06:20:00,598 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56528', name: 0, status: closing, memory: 0, processing: 0>
2022-05-27 06:20:00,598 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:56528
2022-05-27 06:20:00,599 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56530', name: 1, status: closing, memory: 0, processing: 0>
2022-05-27 06:20:00,599 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:56530
2022-05-27 06:20:00,599 - distributed.scheduler - INFO - Lost all workers
2022-05-27 06:20:00,599 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting.  Status: Status.closing
2022-05-27 06:20:00,600 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting.  Status: Status.closing
2022-05-27 06:20:00,605 - distributed.scheduler - INFO - Scheduler closing...
2022-05-27 06:20:00,606 - distributed.scheduler - INFO - Scheduler closing all comms
2022-05-27 06:20:00,768 - distributed.utils_perf - WARNING - full garbage collections took 31% CPU time recently (threshold: 10%)

@graingert
Copy link
Member

graingert commented Nov 24, 2022

from #7249

2022-11-24 13:12:20,027 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "d:\a\distributed\distributed\distributed\nanny.py", line 603, in close
    await self.kill(timeout=timeout, reason=reason)
  File "d:\a\distributed\distributed\distributed\nanny.py", line 388, in kill
    await self.process.kill(reason=reason, timeout=0.8 * (deadline - time()))
  File "d:\a\distributed\distributed\distributed\nanny.py", line 826, in kill
    await process.join(wait_timeout)
  File "d:\a\distributed\distributed\distributed\process.py", line 311, in join
    assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process

@fjetter
Copy link
Member

fjetter commented Dec 1, 2022

85562a0 fixes the problem mentioned in #6311 (comment)

which is a race condition between starting and killing the WorkerProcess concurrently.

I believe there are still other issues with this test but can't tell for sure. Locally I can trigger a couple of timeouts due to the 500ms connect timeout configured

@mrocklin
Copy link
Member

I just ran into this (and it appears to be quite common). It appears that the commit @fjetter mentions above was never merged in. Is there a story here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
5 participants