Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out trying to connect ... : connect() didn't finish in time #4080

Open
mnarodovitch opened this issue Aug 27, 2020 · 36 comments
Open

Timed out trying to connect ... : connect() didn't finish in time #4080

mnarodovitch opened this issue Aug 27, 2020 · 36 comments

Comments

@mnarodovitch
Copy link

mnarodovitch commented Aug 27, 2020

Whereas the following is not a bug we can pinpoint, it really bothers us. After weeks of experimentation and investigation we didn't find a real solution.

We run something like

(a_bag
  .repartition(100_000) # outputs small parttitions
  .map(very_cpu_heavy_computation) # outputs ~4MB partitions
  .repartition(1000).to_dataframe().to_parquet() # writes ~30MB parquet files from ~400MB in-memory-partitions
)

on 100 workers with 1500 cpus on Kubernetes pods.

When it comes to the last re-partition step, dask starts blocking the event loop and the workers start spamming the following logs

distributed.utils_comm - INFO - Retrying get_data_from_worker after exception in attempt 0/2: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: connect() didn't finish in time

=> 5000 events

distributed.core - INFO - Event loop was unresponsive in Worker for 4.95s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

=> 26,094 events

We are working around that, by setting

# less connections to avoid potential network saturation and connection timeouts
export DASK_DISTRIBUTED__WORKER__CONNECTIONS__OUTGOING=25
export DASK_DISTRIBUTED__WORKER__CONNECTIONS__INCOMING=5
export DASK_DISTRIBUTED__COMM__SOCKET_BACKLOG=16384

# graceful timeout and retry policy
export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=20
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=20
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=2

sothat, the job finally finishes, despite the issues in the communication.

Anyways, sometimes workers even start hanging and have to be restarted manually. Also the progress significantly slows down to the end of the job.

I'm sharing this with the community and hoping, that somebody may give pointers what to try, or even ideas for a resolution.

Environment:

  • Dask version: 2.23.0
  • Python version: 3.6
  • Operating System: ubuntu on kubernetes
  • Install method (conda, pip, source): pip
@mnarodovitch
Copy link
Author

I have the suspicion, that this is related to the handshake, which is failing, when the event loop is blocked
https://github.com/dask/distributed/blob/master/distributed/comm/core.py#L297

@mrocklin
Copy link
Member

I have the suspicion, that this is related to the handshake, which is failing, when the event loop is blocked

Interesting. This was added recently. We could test this hypothesis relatively easily by walking back to Dask version 2.22 and seeing if the problem persists. Is this easy for you to do?

@mnarodovitch
Copy link
Author

mnarodovitch commented Aug 27, 2020

Interesting. This was added recently. We could test this hypothesis relatively easily by walking back to Dask version 2.22 and seeing if the problem persists. Is this easy for you to do?

Had a look into our history and we tried 2.20, 2.21 and 2.23. All 'failing' with the timeouts. The runs are somewhat costly and take some hours. We didn't setup cheap reproducers, sofar. This rather happens at scale.

@mnarodovitch
Copy link
Author

Digging out more logs


distributed.worker - ERROR - Worker stream died during communication: tcp://10.2.37.70:44241 |  
-- | --
  | Traceback (most recent call last): |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/comm/core.py", line 237, in connect |  
  | _raise(error) |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/comm/core.py", line 216, in _raise |  
  | raise IOError(msg) |  
  | OSError: Timed out trying to connect to 'tcp://10.2.37.70:44241' after 20 s: connect() didn't finish in time |  
  |   |  
  | During handling of the above exception, another exception occurred: |  
  |   |  
  | Traceback (most recent call last): |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/worker.py", line 1983, in gather_dep |  
  | self.rpc, deps, worker, who=self.address |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/worker.py", line 3258, in get_data_from_worker |  
  | return await retry_operation(_get_data, operation="get_data_from_worker") |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/utils_comm.py", line 390, in retry_operation |  
  | operation=operation, |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/utils_comm.py", line 370, in retry |  
  | return await coro() |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/worker.py", line 3235, in _get_data |  
  | comm = await rpc.connect(worker) |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/core.py", line 1013, in connect |  
  | **self.connection_args, |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/comm/core.py", line 249, in connect |  
  | _raise(error) |  
  | File "/usr/local/lib/python3.6/site-packages/distributed/comm/core.py", line 216, in _raise |  
  | raise IOError(msg) |  
  | OSError: Timed out trying to connect to 'tcp://10.2.37.70:44241' after 20 s: Timed out trying to connect to 'tcp://10.2.37.70:44241' after 20 s: connect() didn't finish in time

along with

 log='distributed.utils_comm - INFO - Retrying get_data_from_worker after exception in attempt 0/2: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: connect() didn't finish in time'

=> ~ 1,364 events

@mrocklin
Copy link
Member

OK, if this was happening prior to 2.23 then it probably isn't the handshake.

I'm surprised to see the event loop being inactive for 5s . The next thing I would probably ask is what is slowing down the workers. One way to get this information is to create a performance_report or watch the dashboard and look at the /profile-server tab. That might give us a sense about what is getting in the way.

Separately, if I come up with some time I'll create a small snippet to test how long it takes for every worker to talk to every other worker. My guess is that this will return quickly, and that there is some issue with the combination of comms and whatever work is going on. I'm curious, is there anything suspect in the functions that you're running, or would this happen just as often with normal Pandas functions?

@mrocklin
Copy link
Member

I would be curious what this returns on your system. On my laptop it's 15ms.

import asyncio
import time
from typing import List

from dask.distributed import Client, get_worker
from dask.utils import format_time


async def ping(addresses: List[str]):
    await asyncio.gather(*[
        get_worker().rpc(addr).identity()
        for addr in addresses
        if addr != get_worker().address
    ])


def all_ping(client: Client):
    workers = list(client.scheduler_info()["workers"])
    start = time.time()
    client.run(ping, workers)
    stop = time.time()
    print(format_time(stop - start))


if __name__ == "__main__":
    with Client() as client:
        all_ping(client)

Also, due to connection pooling, running this before your comptuation might make this smoother short term.

@mrocklin mrocklin reopened this Aug 27, 2020
@mnarodovitch
Copy link
Author

mnarodovitch commented Aug 27, 2020

Thank you so much for looking into this! I'm happy to run all_ping and try the performance report to create better evidence on what's going on.
Will update the issue with more data in the upcoming week.

In respect to the functions I'm running: very_cpu_heavy_computation runs ML code in a child process spawned from a dask-worker-thread and outputs a list(dict). From my point of view, this is not very suspect, as the child process should not affect the event-loop or GIL anyhow. Dask performs very well on that tasks - without any hiccups.

The problems arise, when it starts re-partitioning the 100,000 output partitions to 1000.

I suspect, that the same would happen, if a repartition a synthetic bag of a comparable scale from 100,000 to 1000 partitions. This is also something to try out on our end.

@mnarodovitch
Copy link
Author

mnarodovitch commented Aug 28, 2020

Thanks again for looking into this. I got hooked up on that topic and happy to share a sort-of-reproducer running on the full-scale-cluster.

b = bag.from_sequence([1]*100_000, partition_size=1)
bb = b.map_partitions(lambda *_, **__: [b'1'*2**20]*5).persist()
bc = bb.repartition(1000).persist()

The reproducer is producing connection errors and timeouts for the comm between workers and scheduler. This is not exactly the failure mode, we have faced. In our case, the workers are loosing connection between themselves. I'm sharing the results anyways.

I produced a performance report and uploaded that to https://bl.ocks.org/michaelnarodovitch/19923648bc21002b0a4c39556aea00f9 .

I also produced verbose logs and share some highlights here. If somebody is interested, I also may share the logs themselves.

On the scheduler
distributed.core - DEBUG - Lost connection to <WORKER> while reading message: in <closed TCP>: Stream is closed. Last operation: None
=> 141,539 events

On the workers
distributed.utils_comm - INFO - Retrying functools.partial(<function PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc at 0x7f71c27aa268>, keys=("('lambda-fe761bebe828abd999db601e6011be16', 55634)",)) after exception in attempt 0/2: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
=> 2,194 events

distributed.utils_comm - INFO - Retrying functools.partial(<function PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc at 0x7ff3d7b1a840>, keys=("('lambda-fe761bebe828abd999db601e6011be16', 71105)",)) after exception in attempt 1/2: in <closed TCP>: TimeoutError: [Errno 110] Connection timed out
=> 11,370 events

OSError: Timed out trying to connect to '<SCHEDULER>' after 20 s: Timed out trying to connect to 'tcp://10.3.91.248:8786' after 20 s: connect() didn't finish in time
=> 39,922 events

Will share more information on our failure mode, when we run very_cpu_heavy_computation next time.

@mrocklin reporting back on the ping_all. Executed in 15ms. The network is very fast. AWS + some k8s overlay.

@mrocklin
Copy link
Member

Well, looking at the performance report nothing is coming up unfortunately (we're entertainingly spending most of our time in the workers logging). We're not entirely sensitive here though. If we're spending a lot of time in GIL-holding functions we wouldn't see that.

This example is great though in that it's really well isolated, and probably stresses only our networking and serialization stack. I'll take a longer look later today on a cluster.

@mrocklin
Copy link
Member

@mrocklin reporting back on the ping_all. Executed in 15ms. The network is very fast. AWS + some k8s overlay.

That's good to hear. I was curious about how things would scale there. Thanks!

@mrocklin
Copy link
Member

OK, so I just ran this myself on a larger cluster. Here are the results

import coiled
cluster = coiled.Cluster(n_workers=40)

from dask.distributed import Client
client = Client(cluster)
# This works fine
b = bag.from_sequence([1]*10_000, partition_size=1)
bb = b.map_partitions(lambda *_, **__: [b'1'*2**20]*5)
bc = bb.repartition(1000)
bc = bc.persist()
# A larger size fails
b = bag.from_sequence([1]*40_000, partition_size=1)
bb = b.map_partitions(lambda *_, **__: [b'1'*2**20]*5)
bc = bb.repartition(1000).persist()
# Get logs
logs = cluster.get_logs()
print(list(logs.values())[-1])
distributed.nanny - INFO -         Start Nanny at: 'tls://10.1.11.97:36845'
distributed.worker - INFO -       Start worker at:     tls://10.1.11.97:32879
distributed.worker - INFO -          Listening to:     tls://10.1.11.97:32879
distributed.worker - INFO -          dashboard at:           10.1.11.97:33013
distributed.worker - INFO - Waiting to connect to: tls://ip-10-1-13-173.us-east-2.compute.internal:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /dask-worker-space/dask-worker-space/worker-z4lq642y
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Starting Worker plugin <https://beta.coiled.io/preloads/aws-credentials.p-7c0bb291-0d76-4b28-8fe0-e9ff648b9c53
distributed.worker - INFO -         Registered to: tls://ip-10-1-13-173.us-east-2.compute.internal:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/batched.py", line 93, in _background_send
    payload, serializers=self.serializers, on_error="raise"
  File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 241, in write
    stream.write(b)
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 553, in write
    self._write_buffer.append(data)
  File "/opt/conda/lib/python3.7/site-packages/tornado/iostream.py", line 177, in append
    b += data  # type: ignore
BufferError: Existing exports of data: object cannot be re-sized

cc @jakirkham as the new bytes-handling expert does the error message above mean anything to you?

@mnarodovitch
Copy link
Author

That's something new to me :) Thanks for trying it out.

I digged a little bit into the comm code and didn't find anything suspicious there. I can't exclude that the connection timeouts on our end are related to our networking setup. I will update here, if I find anything interesting to share.

I suppose, that the reproducer captures the problem (on our end) quite well.

Maybe some other folks are also facing timeouts and can share their experiences here.

@mrocklin
Copy link
Member

mrocklin commented Sep 5, 2020

@quasiben is anyone on your team able to help deal with the comms issue here? It seems like it might be related to recent changes.

@jakirkham
Copy link
Member

jakirkham commented Sep 9, 2020

This part of the traceback above sticks out to me.

  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 241, in write
    stream.write(b)

Looking at the code it appears to be sending bytes (see below). This happens regardless of what was serialized (as it is always turned into a bytes object). That code also appears to be unchanged for ~1.5yrs. So am somewhat doubtful that this is the issue.

if sum(lengths) < 2 ** 17: # 128kiB
b = b"".join(length_bytes + frames) # small enough, send in one go
stream.write(b)

The error itself relates to how taking views of underlying buffers may result in them being immutable. Here's an example with a BytesIO object below (note this is only meant to be illustrative).

import io

b = io.BytesIO()
b.write(b"abc")

m = b.getbuffer()  # buffer cannot be mutated as it is viewed now

b.write(b"def")    # error trying to mutate buffer that is viewed

The way one fixes this typically is they release the view before trying to mutate the buffer.

import io

b = io.BytesIO()
b.write(b"abc")

m = b.getbuffer()
m.release()        # release view after doing stuff

b.write(b"def")    # now able to mutate buffer without issues

What can we draw from these two things? Well the error is invariant to serialization. The error happens in Tornado when trying to pack bytes into some underlying stream. Also Tornado fails to do this likely because it has taken some view of this underlying buffer.

Doing a quick search I find this SO question. Ben says here that this may be thread related (like one thread is viewing this buffer while another is trying to modify it). That SO question links to this old Tornado PR ( tornadoweb/tornado#2008 ) and then curiously to an old Distributed issue ( #1704 ).

So my best guess is maybe we are doing something unsafe with thread usage around Tornado. Barring that maybe there is still a Tornado bug lurking. HTH

@mrocklin
Copy link
Member

This just came up in a Pangeo issue (see #4128 linked above). @jakirkham @quasiben would you all be able to take ownership over this problem? I think that you all are probably the best positioned to fix it, if you have the time.

@jakirkham
Copy link
Member

I think I've already provided as much insight as I can above.

@mrocklin
Copy link
Member

Understood. Someone needs to dig in here, figure out the right way forward, and fix the problem. I'm curious if either of you are willing to take this on.

@mnarodovitch
Copy link
Author

mnarodovitch commented Sep 26, 2020

Gave it an other try with dask/distributed 2.7.0 and dask/distributed 2.0.0 - same problem.

Wondering, what happens with BatchedSend, if the scheduler event loop is blocked, as indicated by the logs

distributed.core - INFO - Event loop was unresponsive in Scheduler for 6.87s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Note: We still can not exclude, that his is due to networking issues on our end.

@quasiben
Copy link
Member

I can take a look at this early next week

@KrishanBhasin
Copy link
Contributor

@michaelnarodovitch This sounds very similar to the issues I have been having in #3878 - these logs are extremely similar to what I have been seeing:

On the scheduler
distributed.core - DEBUG - Lost connection to <WORKER> while reading message: in <closed TCP>: Stream is closed. Last operation: None
=> 141,539 events

On the workers
distributed.utils_comm - INFO - Retrying functools.partial(<function PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc at 0x7f71c27aa268>, keys=("('lambda-fe761bebe828abd999db601e6011be16', 55634)",)) after exception in attempt 0/2: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
=> 2,194 events

distributed.utils_comm - INFO - Retrying functools.partial(<function PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc at 0x7ff3d7b1a840>, keys=("('lambda-fe761bebe828abd999db601e6011be16', 71105)",)) after exception in attempt 1/2: in <closed TCP>: TimeoutError: [Errno 110] Connection timed out
=> 11,370 events

OSError: Timed out trying to connect to '<SCHEDULER>' after 20 s: Timed out trying to connect to 'tcp://10.3.91.248:8786' after 20 s: connect() didn't finish in time
=> 39,922 events

Unfortunately I do not have anything of value to add other than potentially linking the issues, apologies.

@mnarodovitch
Copy link
Author

mnarodovitch commented Sep 29, 2020

Thanks for sharing @KrishanBhasin . Also related:
#4118 - also using a large cluster and experiencing flappy connections (162 workers)
#4133 - BufferError: Existing exports of data: object cannot be re-sized
#4113 - flappy connections, CommClosedError()

@jcrist
Copy link
Member

jcrist commented Oct 6, 2020

We've found a similar issue from running Prefect with inproc comms. Was working fine on 2.22.0 and fails with non-deterministic timeout errors pretty consistently after 2.23.0. I've tried to reproduce without prefect, but am struggling. Will try for a bit more tomorrow, otherwise will post a reproducer with prefect involved.

@KrishanBhasin
Copy link
Contributor

On 2.30.0, I am still seeing the below on workers:
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback [....] Timed out trying to connect to 'tcp://127.0.0.1:46171' after 90 s: connect() didn't finish in time

distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:34761 [...] Timed out trying to connect to 'tcp://127.0.0.1:34761' after 90 s: connect() didn't finish in time

and the below on the scheduler:
Timed out trying to connect to 'tcp://127.0.0.1:37947' after 90 s: Timed out trying to connect to 'tcp://127.0.0.1:37947' after 90 s: in <distributed.comm.tcp.TCPConnector object at 0x7f628baa2910>: ConnectionRefusedError: [Errno 111] Connection refused

@mnarodovitch
Copy link
Author

mnarodovitch commented Oct 8, 2020

What helped quite a bit is, sofar:

export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=20
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=5
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=10

Long application timeout effectively makes the workers saturate the scheduler socket. Short timeout + backoff protects the scheduler listener by effectively spreading the connections in time. Failed tasks (still occurs) would be rescheduled instead of failing the whole job.

if time() < deadline:
logger.debug("Could not connect, waiting before retrying")
await asyncio.sleep(backoff)
backoff *= random.randrange(140, 160) / 100
retry_timeout_backoff *= random.randrange(140, 160) / 100
backoff = min(backoff, 1) # wait at most one second

@KrishanBhasin
Copy link
Contributor

KrishanBhasin commented Oct 12, 2020

@michaelnarodovitch thanks for sharing that. I have just tried running your suggested config, and am finding that the issue still occurs (but much less).

My code does a lot of merging towards the end of the job, where I am still getting [Errno 111] Connection refused or [Errno 32] Broken pipe in my workers' logs followed by a huge loss of progress (I assume because the worker gets removed, so all of its keys have to recomputed from scratch).

This loss of work is very time-expensive for me (~1h each time), so I will try to find time to investigate further

Additionally, my scheduler logs are getting hammered with the below error message ~35 times per second which may or may not be related (these appear to relate to the diagnostics dashboard):

2020-10-12 16:18:15,613 - distributed.scheduler - INFO - Plugin failed with exception Traceback (most recent call last):
File "/home/kb/distributed/distributed/scheduler.py", line 4757, in transition 
  plugin.transition(key, start, finish2, *args, **kwargs)
File "/home/kb/distributed/distributed/diagnostics/websocket.py", line 64, in transition 
  self.socket.send("transition", data)
File "/home/kb/distributed/distributed/http/scheduler/info.py", line 177, in send 
  self.write_message(data)
File "/home/kb/.conda/envs/dc/lib/python3.7/site-packages/tornado/websocket.py", line 339, in write_message
  raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

Within the scheduler logs there are also a few of these below, which appear to be related to removing the workers that throw up the Connection refused error:

Remove worker <Worker 'tcp://127.0.0.1:34177', name: 14, memory: 96, processing: 1>

Traceback (most recent call last):
  File "/home/kb/distributed/distributed/scheduler.py", line 2284, in remove_worker
    result = plugin.remove_worker(scheduler=self, worker=address)
  File "/home/kb/distributed/distributed/diagnostics/websocket.py", line 21, in remove_worker
    self.socket.send("remove_worker", {"worker": worker})
  File "/home/kb/distributed/distributed/http/scheduler/info.py", line 177, in send
    self.write_message(data)
  File "/home/kb/.conda/envs/dc/lib/python3.7/site-packages/tornado/websocket.py", line 339, in write_message
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

It may be worth noting that I am exclusively running a LocalCluster with 50-60 workers with a total RAM of ~1TB.

@jcrist
Copy link
Member

jcrist commented Oct 14, 2020

This might be fixed by #4167, can you try that branch and see if it works for you?

@KrishanBhasin
Copy link
Contributor

Has anyone here found that distributed==2.30.1 has solved their problems?

My workers continue to time out attempting to communicate with each other, on a 50-node LocalCluster

@fjetter
Copy link
Member

fjetter commented Nov 5, 2020

@KrishanBhasin can you post a traceback to see where the timeout occurs?

@KrishanBhasin
Copy link
Contributor

KrishanBhasin commented Nov 9, 2020

Hey @fjetter
First:
Apologies for the tone of of my last message; I flinch a little when I reread what I wrote. I was hoping to verify whether I was alone in continuing to have problems (therefore likely a problem on my end), or if there were still problems in the wider community.

Second:
It appears that my continued problems were due to data-specific issues on my end, and that 2.30.1 has indeed helped fix our issues. My team has made some changes to the data that feeds into our process, and we no longer seem to encounter these errors. If they do resurface, I will be sure to share the traceback here.

@fjetter
Copy link
Member

fjetter commented Nov 9, 2020

Apologies for the tone of of my last message; I flinch a little when I reread what I wrote.

Don't worry, I didn't even notice anything. I was just a bit upset that the issue was not resolved :)

hat 2.30.1 has indeed helped fix our issues.

Glad to hear!

@chenkonturek
Copy link

@KrishanBhasin can you post a traceback to see where the timeout occurs?

Hi @fjetter,

Regarding to the timeout errors that @KrishanBhasin mentioned above, here is the traceback (obtained while using distributed==2.30.1). Hopefully, this would be useful.

9318734-Heartbeat: tcp://127.0.0.1:36847
9318735-Worker stream died during communication: tcp://127.0.0.1:35259
9318736-Traceback (most recent call last):
9318737-  File “~/.conda/envs/test/lib/python3.7/site-packages/distributed/comm/core.py", line 288, in connect
9318738-    timeout=min(intermediate_cap, time_left()),
9318739-  File "~/.conda/envs/test/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
9318740:    raise futures.TimeoutError()
9318741:concurrent.futures._base.TimeoutError
9318742-
9318743-The above exception was the direct cause of the following exception:
9318744-
9318745-Traceback (most recent call last):
9318746-  File “~/.conda/envs/test/lib/python3.7/site-packages/distributed/worker.py", line 1980, in gather_dep
9318747-    self.rpc, deps, worker, who=self.address
9318748-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/worker.py", line 3255, in get_data_from_worker
9318749-    return await retry_operation(_get_data, operation="get_data_from_worker")
9318750-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
9318751-    operation=operation,
9318752-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
9318753-    return await coro()
9318754-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/worker.py", line 3232, in _get_data
9318755-    comm = await rpc.connect(worker)
9318756-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/core.py", line 1035, in connect
9318757-    **self.connection_args,
9318758-  File "~/.conda/envs/test/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
9318759-    ) from active_exception
9318760:OSError: Timed out trying to connect to tcp://127.0.0.1:35259 after 10 s

@fjetter
Copy link
Member

fjetter commented Nov 18, 2020

This traceback looks "ordinary" and this might just be an ordinary timeout. Especially if the cluster is under heavy load this might just be expected behaviour. There is a discussion ongoing to increase the default timeout value to something larger, see #4228 You might want to try out to increase the value and see if this resolves the problem (try 30s or 60s if your cluster is under extreme stress)

@riedel
Copy link
Member

riedel commented May 5, 2021

Is there a way to identify a deadlock situations that is related to network timeouts? Is this a typical log for such a situation. Our computation suddenly halts and it does not seem to move forward or backwards. Is there a way to recover from such situations, e.g. by killing a definitive worker or better timing out sth.? When I kill a random worker, the dependend data gets rebuild and I eg. progress fro, 125102/189776 to 125514/189776 split-shuffle tasks done.

Is this such a situation as described above? Or sth. different.

Does this look norma (seeing similar timeouts):

 distributed.worker - INFO - Dependent not found: ('split-shuffle-0-fb3b840129241743e8d1ecca26cce38e', 10, (2, 14, 6)) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/comm/tcp.py", line 379, in connect stream = await self.client.connect( File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/tornado/tcpclient.py", line 265, in connect addrinfo = await self.resolver.resolve(host, port, af) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/asyncio/tasks.py", line 492, in wait_for fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/comm/core.py", line 285, in connect comm = await asyncio.wait_for( File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/asyncio/tasks.py", line 494, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/worker.py", line 2335, in handle_missing_dep who_has = await retry_operation( File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/core.py", line 859, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/core.py", line 1010, in connect comm = await connect( File "/smartdata/iu5681/.conda/envs/latentagei4wafer/lib/python3.9/site-packages/distributed/comm/core.py", line 307, in connect raise IOError( OSError: Timed out trying to connect to tcp://141.52.224.191:37350 after 3 s

distributed.worker - INFO - Dependent not found: ('split-shuffle-0-fb3b840129241743e8d1ecca26cce38e', 10, (15, 14, 6)) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('shuffle-0-fb3b840129241743e8d1ecca26cce38e', 1770)

distributed.worker - INFO - Can't find dependencies for key ('shuffle-0-fb3b840129241743e8d1ecca26cce38e', 1770)

distributed.worker - INFO - Can't find dependencies for key ('shuffle-0-fb3b840129241743e8d1ecca26cce38e', 1770) 

and many more of this...

None of the workers seem to have actually failed on the cluster (here 160 CPUs on 40 workers)

@jakirkham
Copy link
Member

That sounds like something different. @riedel could you please raise a new issue with a minimal reproducer?

@fjetter
Copy link
Member

fjetter commented May 5, 2021

@riedel there is a stale issue #4133 and some more recent ones about the general issue #4413 and maybe #4439

@fjetter
Copy link
Member

fjetter commented Jun 18, 2021

In case people are still bothered by deadlocks:

We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks.
These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed.
I would encourage you to try out the latest changes on main and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!

Deadlock fix #4784
Upcoming release dask/community#165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants