You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encounter this issue while running a Django server with gunicorn to serve some remote servers with Docker services that send requests to gunicorn occasionally.
Bug description
On some remote servers, the connection is always timeout when the response reaches a certain threshold (1KB or more for example). When this happens, the gunicorn worker becomes stuck in the sock.sendall(data) loop until it times out for sync worker, or forever for other workers.
Here is the stack trace with sync worker:
[20 24-02-16 00:19:18 +0000] [448] [ERROR] Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 184, in handle_request
resp.write(item)
File "/usr/local/lib/python3.11/site-packages/gunicorn/http/wsgi.py", line 346, in write
util.write(self.sock, arg, self.chunked)
File "/usr/local/lib/python3.11/site-packages/gunicorn/util.py", line 299, in write
sock.sendall(data)
File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 202, in handle_abort
self.cfg.worker_abort(self)
File "/app/config/gunicorn_configs.py", line 58, in worker_abort
raise Exception(f"Gunicorn worker aborted: {worker}")
When I switch to gthread, the thread is stuck for around 952 seconds consistently with the following error (same stack trace):
TimeoutError: [Errno 110] Connection timed out
The cause
After some research and testing, I can narrow down the cause of this bug as a combination of 2 separate problems:
Docker network problem: I can't pinpoint the exact cause, since I have thousands of remote servers with supposedly identical configurations, and only 5 out of them trigger this bug. My best guess is a combination of VPN and corrupt Docker network.
gunicorn socket handling: no timeout on socket operations, which leads to potential deadlock in some specific cases as this one, where the client is already timeout, but gunicorn keeps sending the response somewhere.
I can reproduce this bug consistently by sending requests from the 5 servers that I mention, but honestly I don't know how to reproduce it on other systems.
Anyway, regardless of the root cause of the first problem, the 2nd problem where the deadlock on gunicorn socket is real and should be dealt with. If anyone figures out a way to reproduce it consistently, they can timeout/block all workers easily.
Workaround
To deal with this problem on my systems where I use gthread, I decided to monkey patch the util.write method:
This sets the socket timeout temporarily during this method, and reset it to original value at the end. This way, there is zero impact on the sock object outside of this method.
It works well and the worker thread always timeout correctly instead of getting stuck.
Proposed long term solution
For all workers other than sync, I suggest to impose the timeout settings on socket object with settimeout. It makes sense since the socket operations should never exceeds the worker timeout anyway.
If it is not feasible due to other constraints, we should be able to call the util.write method with timeout as additional parameter.
I can make a PR to address this issue. What do you think?
The text was updated successfully, but these errors were encountered:
I found the root cause and how to reproduce the issue reliably.
My application is connected to a wireguard VPN tunnel and receive requests from remote servers via the VPN. If there is mismatched MTU size between the wireguard interface and the remote server network, the response from gunicorn is never received by the client.
However gunicorn keeps sending data to the socket and is basically stuck, even when the client is already timeout.
Hi,
I encounter this issue while running a Django server with gunicorn to serve some remote servers with Docker services that send requests to gunicorn occasionally.
Bug description
On some remote servers, the connection is always timeout when the response reaches a certain threshold (1KB or more for example). When this happens, the gunicorn worker becomes stuck in the
sock.sendall(data)
loop until it times out for sync worker, or forever for other workers.Here is the stack trace with sync worker:
When I switch to
gthread
, the thread is stuck for around 952 seconds consistently with the following error (same stack trace):The cause
After some research and testing, I can narrow down the cause of this bug as a combination of 2 separate problems:
Docker network problem: I can't pinpoint the exact cause, since I have thousands of remote servers with supposedly identical configurations, and only 5 out of them trigger this bug. My best guess is a combination of VPN and corrupt Docker network.
gunicorn socket handling: no timeout on socket operations, which leads to potential deadlock in some specific cases as this one, where the client is already timeout, but gunicorn keeps sending the response somewhere.
I can reproduce this bug consistently by sending requests from the 5 servers that I mention, but honestly I don't know how to reproduce it on other systems.
Anyway, regardless of the root cause of the first problem, the 2nd problem where the deadlock on gunicorn socket is real and should be dealt with. If anyone figures out a way to reproduce it consistently, they can timeout/block all workers easily.
Workaround
To deal with this problem on my systems where I use
gthread
, I decided to monkey patch theutil.write
method:This sets the socket timeout temporarily during this method, and reset it to original value at the end. This way, there is zero impact on the
sock
object outside of this method.It works well and the worker thread always timeout correctly instead of getting stuck.
Proposed long term solution
For all workers other than sync, I suggest to impose the timeout settings on socket object with
settimeout
. It makes sense since the socket operations should never exceeds the worker timeout anyway.If it is not feasible due to other constraints, we should be able to call the
util.write
method withtimeout
as additional parameter.I can make a PR to address this issue. What do you think?
The text was updated successfully, but these errors were encountered: