Skip to content

TCP timeouts connecting to cluster #5099

@gjoseph92

Description

@gjoseph92

While running a profiling script that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during wait_for_workers, even with DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.

I've also tried with #5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.

Maybe notable is that I'm using 4 async Clients at once from the same process.

I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson

Typical traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/tcp.py", line 376, in connect
    stream = await self.client.connect(
  File "/usr/local/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "//run.py", line 153, in run_trial
    tasks, runtime, batched_send_stats = await trial(**vars)
  File "//run.py", line 119, in trial
    await client.wait_for_workers(cluster_size)
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1219, in _wait_for_workers
    info = await self.scheduler.identity()
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1051, in connect
    raise exc
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1035, in connect
    comm = await fut
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 312, in connect
    raise OSError(
OSError: Timed out trying to connect to tls://54.188.70.54:8786 after 60 s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions