While running a profiling script that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during wait_for_workers, even with DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.
I've also tried with #5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.
Maybe notable is that I'm using 4 async Clients at once from the same process.
I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson
Typical traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/distributed/comm/tcp.py", line 376, in connect
stream = await self.client.connect(
File "/usr/local/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "//run.py", line 153, in run_trial
tasks, runtime, batched_send_stats = await trial(**vars)
File "//run.py", line 119, in trial
await client.wait_for_workers(cluster_size)
File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1219, in _wait_for_workers
info = await self.scheduler.identity()
File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 863, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1051, in connect
raise exc
File "/usr/local/lib/python3.9/site-packages/distributed/core.py", line 1035, in connect
comm = await fut
File "/usr/local/lib/python3.9/site-packages/distributed/comm/core.py", line 312, in connect
raise OSError(
OSError: Timed out trying to connect to tls://54.188.70.54:8786 after 60 s
While running a profiling script that creates ~180 Coiled clusters (max 4 active at once), I'm hitting somewhat frequent TCP connection timeout errors during
wait_for_workers, even withDASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s. Anecdotally, I've noticed it most commonly happens with my 50- or 100-worker clusters, less so with the 2- or 20-worker ones.I've also tried with #5096 and I'm still seeing the errors, though I can't say statistically if they're any less frequent.
Maybe notable is that I'm using 4 async Clients at once from the same process.
I’m not sure if this is a Dask problem, or a Coiled problem, or both. Wondering if anyone here has ideas on what to debug next. cc @fjetter @jacobtomlinson
Typical traceback: