-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gunicorn gthread deadlock #2917
Comments
I was experiencing same problem with my setup: no reverse proxy + gunicorn + gthread worker + flask app. The setup could be up in production for ~3-7 days, serving HTTP requests from various untrusted clients, and then gunicorn could suddently enter in deadlock state which resulted in no response to all requests that are made from remote untrusted clients (like external browser user) and local trusted clients (like curl that was called by me). Only gunicorn restart helped to leave out the deadlock. I tried to run gunicorn with Output of
Notice that usually after "entered forwarding state" comes "entered disabled state". But last "entered forwarding state" not resulted in "entered disabled state". Though I don't sure if this related to the problem because I don't know how to interpret
But now it is clear what is going on and I was able to reproduce the issue. Thank you @JorisOnGithub for the reason find! Next I will share how to reproduce it. How to reproduce itRun Run Open terminal window to use Create this Go program: package main
import (
"bufio"
"fmt"
"io"
"net"
"os"
)
// To which origin make HTTP requests.
const origin = "127.0.0.1:5000"
// How many TCP connections to open at the same time.
const connections = 10
// If true, then in opened TCP connection request data will be wrote.
// If false, then TCP connection will be opened but nothing will be written into it.
const writeInConnection = true
func main() {
for i := 0; i < connections; i++ {
go dial()
}
input := bufio.NewScanner(os.Stdin)
input.Scan()
}
func dial() {
conn, err := net.Dial("tcp", origin)
if err != nil {
panic(err)
}
defer conn.Close()
fmt.Println("Connection opened to remote:")
fmt.Println(conn.RemoteAddr().String())
fmt.Println()
if writeInConnection {
data := fmt.Sprintf("GET / HTTP/1.1\r\nHost: %v\r\nUser-Agent: test/0.0\r\nAccept: */*\r\n\r\n", origin)
fmt.Println("Request:")
fmt.Println(data)
fmt.Println()
_, err = conn.Write([]byte(data))
if err != nil {
panic(err)
}
}
response, err := bufio.NewReader(conn).ReadString('\n')
if err != nil && err != io.EOF {
panic(err)
}
fmt.Println("Response:")
fmt.Println(response)
fmt.Println("End dial")
} This program opens TCP connections and optionally writes data into opened connection. If origin server returns response and closes the connection - dial ends and client connection closes. Modify constants to achieve desired behavior. Run it with So I launched
Now I changed to Sometimes, even if there is 9 opened connections, Now let's try to make requests to
And
But after some time they gets closed. By checking nginx's
This indicates that buffering is working correctly. Even with
It is expected behavior from gunicorn?Looks like yes. Problem with such TCP connections already was discussed here - #2652 Need of buffering in front of gunicorn mentioned here - #2334 And similar issue was mentioned here - #2876 and here - #2914 So I think it is expected by authors of gunicorn that developers always will put some reverse-proxy in front of gunicorn. For example it is mentioned here - https://docs.gunicorn.org/en/stable/deploy.html#deploying-gunicorn Because of this probably there is no need to try to fix the issue on gunicorn side. It is all about deployment setup. Solution to the issueAlways put some reverse-proxy (like nginx) in front of gunicorn. |
I'm glad this issue provided some clarity on the problems you were facing. Nice example on how to reproduce! |
I believe I encountered this issue when attempting to use gunicorn + gthread in combination with AWS Classic Load Balancers (ELB). |
gthread: only read sockets when they are readable (#2917)
fixed in master. Thanks for the patch. |
Hi @benoitc, I'm running into this issue too, with gunicorn behind AWS load balancers. Just wondered when a new release would be made that included this patch. |
new release should land tomorrow. |
I had to implement a monitoring solution to restart gunicorn when it freezes up... would be great if this patch can be put into a release soon. The issue was closed after the latest release... as of today. |
We ran into the same issue. In nginx ingress controller, proxy-buffering is OFF by default. We would see gunicorn with gthread freezing a lot, failing health checks, resulting in pod restarts. But the real issue is, this happened with I ended up switching to |
Using gthreads I ran into the same issue. Is there a fix or a work around? Kicking it back to sync makes it work just fine. |
Problem
When using gunicorn with the gthread worker, and using multiple WebView2 browsers, our gunicorn instance ends up in a deadlock.
The WebView2 browser (and maybe other browsers as well, but I have not tested those) can make speculative TCP connections: TCP connections that the browser thinks it will need as it expects to be making multiple HTTP requests to the server in quick succession. Those speculative TCP connections can be created without any data being sent on them.
The Gunicorn gthread worker currently makes the assumption that every tcp connection opened by a client, will contain data. When a new connecion is accepted, the gunicorn worker thread does a blocking recv call on the new socket, this blocks the thread. In our case this recv call would stay blocking, as no data was written to this particular speculative TCP connection.
In general all worker threads can get blocked at the same time when many users connect/send requests to a gunicorn server at the same time with a browser that uses speculative TCP connections (or browsers that create TCP connections without sending data for other reasons). This causes any new requests to be stuck in the queue of the gthread worker and not be processed.
How to reproduce
Real world
They way we initially reproduced the issue was by refreshing N WebView2 browser instances pointing to a gunicorn gthread instance at the same time, where N > the number of threads configured for gunicorn. The more windows, the more likely for gunicorn to enter a deadlock. In our testing the blocked worker threads can stay blocked for hours, as in our case the 'empty' TCP connections were not closed from the browser side. Reproducing the deadlock is not deterministic, refreshing more windows at the same time makes it more likely to hit the deadlock. In our testing, we were able to consistently reproduce it with 10 threads and roughly 12 or more windows.
When refreshing some of the windows (which closes the speculative TCP connections created by them), the threads that were blocked on those connections unblocked (because the socket they were blocked on closed) and continue serving the application without issues.
Minimal example
An easier way to reproduce this for maintainers is to simply run the example test app with
gunicorn --worker-class gthread --threads 2 test:app
. Then open the example app in edge at localhost:8000. Then try to open localhost:8000 in an incognito edge tab, or another browser (I tried firefox). The app won't load in those other tabs. This might depend on your system/edge settings (I reproduced it with all defaults).Doing
ss -tn | grep 8000
on the machine running gunicorn shows the empty open TCP connections.The same behaviour can be seen by running with 3 threads instead of 2, then the initial edge load will work, the next load (eg. edge incognito) will work, but the 3rd load (eg. firefox) will hang. In my testing only edge is sending those TCP connections without data, running multiple firefox instances does not create a deadlock for me.
Proposed solution
I am about to make a PR that proposes a solution to this issue. Instead of instantly doing a blocking recv call on every TCP connection the gthread worker gets, it instead always registers the new tcp connections to the poller, and handles the connections when we are sure that they are readable.
Also, could this possibly be related to #2914?
The text was updated successfully, but these errors were encountered: