Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn gthread deadlock #2917

Closed
JorisOnGithub opened this issue Jan 16, 2023 · 9 comments
Closed

Gunicorn gthread deadlock #2917

JorisOnGithub opened this issue Jan 16, 2023 · 9 comments
Assignees

Comments

@JorisOnGithub
Copy link
Contributor

Problem

When using gunicorn with the gthread worker, and using multiple WebView2 browsers, our gunicorn instance ends up in a deadlock.
The WebView2 browser (and maybe other browsers as well, but I have not tested those) can make speculative TCP connections: TCP connections that the browser thinks it will need as it expects to be making multiple HTTP requests to the server in quick succession. Those speculative TCP connections can be created without any data being sent on them.

The Gunicorn gthread worker currently makes the assumption that every tcp connection opened by a client, will contain data. When a new connecion is accepted, the gunicorn worker thread does a blocking recv call on the new socket, this blocks the thread. In our case this recv call would stay blocking, as no data was written to this particular speculative TCP connection.

In general all worker threads can get blocked at the same time when many users connect/send requests to a gunicorn server at the same time with a browser that uses speculative TCP connections (or browsers that create TCP connections without sending data for other reasons). This causes any new requests to be stuck in the queue of the gthread worker and not be processed.

How to reproduce

Real world

They way we initially reproduced the issue was by refreshing N WebView2 browser instances pointing to a gunicorn gthread instance at the same time, where N > the number of threads configured for gunicorn. The more windows, the more likely for gunicorn to enter a deadlock. In our testing the blocked worker threads can stay blocked for hours, as in our case the 'empty' TCP connections were not closed from the browser side. Reproducing the deadlock is not deterministic, refreshing more windows at the same time makes it more likely to hit the deadlock. In our testing, we were able to consistently reproduce it with 10 threads and roughly 12 or more windows.
When refreshing some of the windows (which closes the speculative TCP connections created by them), the threads that were blocked on those connections unblocked (because the socket they were blocked on closed) and continue serving the application without issues.

Minimal example

An easier way to reproduce this for maintainers is to simply run the example test app with gunicorn --worker-class gthread --threads 2 test:app. Then open the example app in edge at localhost:8000. Then try to open localhost:8000 in an incognito edge tab, or another browser (I tried firefox). The app won't load in those other tabs. This might depend on your system/edge settings (I reproduced it with all defaults).

Doing ss -tn | grep 8000 on the machine running gunicorn shows the empty open TCP connections.
The same behaviour can be seen by running with 3 threads instead of 2, then the initial edge load will work, the next load (eg. edge incognito) will work, but the 3rd load (eg. firefox) will hang. In my testing only edge is sending those TCP connections without data, running multiple firefox instances does not create a deadlock for me.

Proposed solution

I am about to make a PR that proposes a solution to this issue. Instead of instantly doing a blocking recv call on every TCP connection the gthread worker gets, it instead always registers the new tcp connections to the poller, and handles the connections when we are sure that they are readable.

Also, could this possibly be related to #2914?

@Amaimersion
Copy link

I was experiencing same problem with my setup: no reverse proxy + gunicorn + gthread worker + flask app. The setup could be up in production for ~3-7 days, serving HTTP requests from various untrusted clients, and then gunicorn could suddently enter in deadlock state which resulted in no response to all requests that are made from remote untrusted clients (like external browser user) and local trusted clients (like curl that was called by me). Only gunicorn restart helped to leave out the deadlock.

I tried to run gunicorn with loglevel = DEBUG, but nothing useful was printed in logs - only last successful request and then silence.

Output of dmesg:

[9703769.381208] docker0: port 1(vethc2aa562) entered blocking state
[9703769.381211] docker0: port 1(vethc2aa562) entered forwarding state
[9926865.422836] docker0: port 1(vethc2aa562) entered disabled state
[9926865.423496] veth3166d2c: renamed from eth0
[9926865.569054] docker0: port 1(vethc2aa562) entered disabled state
[9926865.569578] device vethc2aa562 left promiscuous mode
[9926865.569582] docker0: port 1(vethc2aa562) entered disabled state
[9926866.581028] docker0: port 1(veth3fb0a4d) entered blocking state
[9926866.581034] docker0: port 1(veth3fb0a4d) entered disabled state
[9926866.581185] device veth3fb0a4d entered promiscuous mode
[9926866.583759] docker0: port 1(veth3fb0a4d) entered blocking state
[9926866.583763] docker0: port 1(veth3fb0a4d) entered forwarding state
[9926866.583815] docker0: port 1(veth3fb0a4d) entered disabled state
[9926866.833822] eth0: renamed from vethf833e1c
[9926866.855096] IPv6: ADDRCONF(NETDEV_CHANGE): veth3fb0a4d: link becomes ready
[9926866.855132] docker0: port 1(veth3fb0a4d) entered blocking state
[9926866.855135] docker0: port 1(veth3fb0a4d) entered forwarding state

Notice that usually after "entered forwarding state" comes "entered disabled state". But last "entered forwarding state" not resulted in "entered disabled state". Though I don't sure if this related to the problem because I don't know how to interpret dmesg output.

top shows that all workers consume a little bit of CPU and all are in sleeping state:

589145 ubuntu    20   0  601812 108780      0 S   0.7  11.3  12:10.69 gunicorn                                                                                              
589148 ubuntu    20   0  601800  98708      0 S   0.7  10.2  15:37.97 gunicorn                                                                                              
589146 ubuntu    20   0  602216 108600      0 S   0.3  11.3  12:24.35 gunicorn                                                                                              
589147 ubuntu    20   0  602048 110672      0 S   0.3  11.5  13:07.25 gunicorn                                                                                              
615237 ubuntu    20   0   11016   3760   3180 R   0.3   0.4   0:00.04 top                                                                                                   
589115 ubuntu    20   0    2388     68      0 S   0.0   0.0   0:00.02 sh                                                                                                    
589144 ubuntu    20   0   35856  21236      0 S   0.0   2.2   0:51.38 gunicorn                                                                                              
614792 ubuntu    20   0   19044   5284   3772 S   0.0   0.5   0:00.05 systemd                                                                                               
614794 ubuntu    20   0  170640   4680      0 S   0.0   0.5   0:00.00 (sd-pam)                                                                                              
614898 ubuntu    20   0   13784   3900   2240 S   0.0   0.4   0:00.41 sshd                                                                                                  
614903 ubuntu    20   0   10048   4336   2636 S   0.0   0.4   0:00.09 bash

But now it is clear what is going on and I was able to reproduce the issue. Thank you @JorisOnGithub for the reason find! Next I will share how to reproduce it.

How to reproduce it

Run gunicorn with gthread worker. App can be empty. The only thing that matters is the number of workers and threads - make it configurable. In my test I launched gunicorn at 127.0.0.1:5000.

Run nginx which acts like reverse-proxy. You can use default configuration. In my test I launched nginx at 127.0.0.1:8888 which forwards all requests to http://127.0.0.1:5000.

Open terminal window to use curl to check if gunicorn is entered in deadlock state.

Create this Go program:

package main

import (
	"bufio"
	"fmt"
	"io"
	"net"
	"os"
)

// To which origin make HTTP requests.
const origin = "127.0.0.1:5000"

// How many TCP connections to open at the same time.
const connections = 10

// If true, then in opened TCP connection request data will be wrote.
// If false, then TCP connection will be opened but nothing will be written into it.
const writeInConnection = true

func main() {
	for i := 0; i < connections; i++ {
		go dial()
	}

	input := bufio.NewScanner(os.Stdin)
	input.Scan()
}

func dial() {
	conn, err := net.Dial("tcp", origin)

	if err != nil {
		panic(err)
	}

	defer conn.Close()

	fmt.Println("Connection opened to remote:")
	fmt.Println(conn.RemoteAddr().String())
	fmt.Println()

	if writeInConnection {
		data := fmt.Sprintf("GET / HTTP/1.1\r\nHost: %v\r\nUser-Agent: test/0.0\r\nAccept: */*\r\n\r\n", origin)

		fmt.Println("Request:")
		fmt.Println(data)
		fmt.Println()

		_, err = conn.Write([]byte(data))

		if err != nil {
			panic(err)
		}
	}

	response, err := bufio.NewReader(conn).ReadString('\n')

	if err != nil && err != io.EOF {
		panic(err)
	}

	fmt.Println("Response:")
	fmt.Println(response)

	fmt.Println("End dial")
}

This program opens TCP connections and optionally writes data into opened connection. If origin server returns response and closes the connection - dial ends and client connection closes. Modify constants to achieve desired behavior. Run it with go run . and exit it with ctrl + c.

So I launched gunicorn with workers = 3 and threads = 3. In theory it can handle up to 9 parallel requests. I set origin = 127.0.0.1:5000, connections = 8, writeInConnection = false and launched the program. gunicorn logs prints nothing.

ss -tn | grep 127.0.0.1:5000 outputs:

ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49598       
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49558       
CLOSE-WAIT 1      0          127.0.0.1:54056       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:49544       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:49574       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49528       
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49586       
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49574       
ESTAB      0      0          127.0.0.1:49542       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:49586       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49584       
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49542       
FIN-WAIT-2 0      0          127.0.0.1:5000        127.0.0.1:54056       
ESTAB      0      0          127.0.0.1:49528       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:49558       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:49584       127.0.0.1:5000        
ESTAB      0      0          127.0.0.1:5000        127.0.0.1:49544       
ESTAB      0      0          127.0.0.1:49598       127.0.0.1:5000 

ESTAB is a pending TCP connections made by Go program. I assume that CLOSE-WAIT and FIN-WAIT related to the listening server (gunicorn in our case).

curl --verbose 127.0.0.1:5000 results in gunicorn response which is reflected in gunicorn logs.

Now I changed to connections = 9. There is 9 opened TCP connections and curl --verbose 127.0.0.1:5000 results in nothing - just pending requests which awaits forever.

Sometimes, even if there is 9 opened connections, curl call results in response, but if you will change connections to bigger value, for example to 18, then you definitely will not get any response using curl.

Now let's try to make requests to nginx which buffers all requests and then sends them to gunicorn. I changed to origin = 127.0.0.1:8888, connections = 18, writeInConnection = false and launched the program.

ss -tn | grep 127.0.0.1:5000 displays no established TCP connections on 5000 port:

CLOSE-WAIT 1      0          127.0.0.1:54056      127.0.0.1:5000        
FIN-WAIT-2 0      0          127.0.0.1:5000       127.0.0.1:54056 

And curl --verbose 127.0.0.1:5000 results in successful response from gunicorn. Same for curl --verbose 127.0.0.1:8888.

ss -tn | grep 127.0.0.1:8888 displays opened TCP connections:

ESTAB      0      0          127.0.0.1:34086      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34076       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34154       
ESTAB      0      0          127.0.0.1:34062      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34168       
ESTAB      0      0          127.0.0.1:34046      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:34140      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:34154      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34014       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34178       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34020       
ESTAB      0      0          127.0.0.1:34122      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34032       
ESTAB      0      0          127.0.0.1:34020      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34180       
ESTAB      0      0          127.0.0.1:34058      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34108       
ESTAB      0      0          127.0.0.1:34080      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:34032      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:34178      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34046       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34080       
ESTAB      0      0          127.0.0.1:34136      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34140       
ESTAB      0      0          127.0.0.1:34180      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34058       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34136       
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34062       
ESTAB      0      0          127.0.0.1:34108      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:34014      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34086       
ESTAB      0      0          127.0.0.1:34168      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34096       
ESTAB      0      0          127.0.0.1:34096      127.0.0.1:8888        
ESTAB      0      0          127.0.0.1:8888       127.0.0.1:34122       
ESTAB      0      0          127.0.0.1:34076      127.0.0.1:8888  

But after some time they gets closed. By checking nginx's error.log I see:

2023/01/29 14:13:27 [info] 378054#378054: *837 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378054#378054: *839 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378054#378054: *842 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378049#378049: *841 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378050#378050: *838 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378055#378055: *844 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *829 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378051#378051: *830 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378050#378050: *840 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *831 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378051#378051: *836 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378050#378050: *843 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *832 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378051#378051: *845 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *833 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378051#378051: *846 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *834 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888
2023/01/29 14:13:27 [info] 378052#378052: *835 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:8888

This indicates that buffering is working correctly. Even with connections = 512 NGINX correctly handles buffering and gunicorn works like a charm.

writeInConnection = true always results in correct behavior both from gunicorn only and nginx + gunicorn. So the issue is opened TCP connections with no data write by client.

It is expected behavior from gunicorn?

Looks like yes. Problem with such TCP connections already was discussed here - #2652 Need of buffering in front of gunicorn mentioned here - #2334 And similar issue was mentioned here - #2876 and here - #2914

So I think it is expected by authors of gunicorn that developers always will put some reverse-proxy in front of gunicorn. For example it is mentioned here - https://docs.gunicorn.org/en/stable/deploy.html#deploying-gunicorn

Because of this probably there is no need to try to fix the issue on gunicorn side. It is all about deployment setup.

Solution to the issue

Always put some reverse-proxy (like nginx) in front of gunicorn.

@JorisOnGithub
Copy link
Contributor Author

I'm glad this issue provided some clarity on the problems you were facing. Nice example on how to reproduce!
Using a reverse proxy like nginx in front would indeed fix this, but PR #2918 fixes the "initial empty tcp connection" issue on the gunicorn side as well. It is a very small change with little downside, and it allows the gunicorn worker to continue running without issues when it gets tcp connections without any data.
I hope PR #2918 gets reviewed/considered, as it would fix my issue without having to put nginx in front of gunicorn.

@TBoshoven
Copy link

I believe I encountered this issue when attempting to use gunicorn + gthread in combination with AWS Classic Load Balancers (ELB).
These load balancers pre-open connections in order to eliminate the initial TCP handshake.

@tilgovi tilgovi self-assigned this Mar 25, 2023
benoitc added a commit that referenced this issue May 7, 2023
gthread: only read sockets when they are readable (#2917)
@benoitc
Copy link
Owner

benoitc commented May 7, 2023

fixed in master. Thanks for the patch.

@benoitc benoitc closed this as completed May 7, 2023
@JamesMilneFilmlight
Copy link

JamesMilneFilmlight commented May 29, 2023

Hi @benoitc,

I'm running into this issue too, with gunicorn behind AWS load balancers.

Just wondered when a new release would be made that included this patch.

@benoitc
Copy link
Owner

benoitc commented May 29, 2023

new release should land tomorrow.

@dcshapiro
Copy link

I had to implement a monitoring solution to restart gunicorn when it freezes up... would be great if this patch can be put into a release soon. The issue was closed after the latest release... as of today.

@Ashish-Bansal
Copy link

We ran into the same issue.

In nginx ingress controller, proxy-buffering is OFF by default.

We would see gunicorn with gthread freezing a lot, failing health checks, resulting in pod restarts.

But the real issue is, this happened with gunicorn 21.2.0 as well.

I ended up switching to gevent worker type.

@jessielw
Copy link

jessielw commented Jan 3, 2024

Using gthreads I ran into the same issue. Is there a fix or a work around?

Kicking it back to sync makes it work just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants