TCP timeouts connecting to cluster

While running a [profiling script](https://github.com/gjoseph92/coiled-parameter-sweep-profiling/blob/main/run.py) that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during `wait_for_workers`, even with `DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s`. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.

I've also tried with https://github.com/dask/distributed/pull/5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.

Maybe notable is that I'm using 4 async Clients at once from the same process.

I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson 

Typical traceback:
```python
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/tcp.py", line 376, in connect
    stream = await self.client.connect(
  File "/usr/local/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "//run.py", line 153, in run_trial
    tasks, runtime, batched_send_stats = await trial(**vars)
  File "//run.py", line 119, in trial
    await client.wait_for_workers(cluster_size)
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1219, in _wait_for_workers
    info = await self.scheduler.identity()
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1051, in connect
    raise exc
  File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1035, in connect
    comm = await fut
  File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 312, in connect
    raise OSError(
OSError: Timed out trying to connect to tls://54.188.70.54:8786 after 60 s
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TCP timeouts connecting to cluster #5099

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TCP timeouts connecting to cluster #5099

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions