Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

Closed
leonidumanskiy opened this issue Jan 7, 2022 · 8 comments

Comments

@leonidumanskiy
Copy link

leonidumanskiy commented Jan 7, 2022

I am running gunicorn inside a docker container, with a pretty basic configuration:

gunicorn --timeout 90 -k gevent --worker-tmp-dir /dev/shm --bind 0.0.0.0:80 "app.wcgi:application" --capture-output --enable-stdio-inheritance --log-level debug --access-logfile - --error-logfile -

There is no nginx in front of it and no high load besides this one client (this is a dev/staging environment).

Sometimes, connection on the client is being reset.

Looking with WireShark (both on the client machine, host machine and using tcpdump inside a docker network) it looks like gunicorn is sending RST packet(s) right after http request packets are sent:

1	0.000000	99.23.241.219	172.20.0.3	TCP	66	58624→80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
2	0.000031	172.20.0.3	99.23.241.219	TCP	66	80→58624 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128
3	0.147891	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=1 Ack=1 Win=131328 Len=0
4	0.155410	99.23.241.219	172.20.0.3	HTTP	299	GET //info/client_version?platform=android&version=0.1.205 HTTP/1.1 
5	0.155439	172.20.0.3	99.23.241.219	TCP	54	80→58624 [ACK] Seq=1 Ack=246 Win=30336 Len=0
6	0.160088	172.20.0.3	99.23.241.219	TCP	204	[TCP segment of a reassembled PDU]
7	0.160167	172.20.0.3	99.23.241.219	HTTP	72	HTTP/1.1 200 OK  (application/json)
8	0.310643	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=246 Ack=169 Win=131072 Len=0
9	0.339900	99.23.241.219	172.20.0.3	HTTP	271	GET /info/data_model_checksums HTTP/1.1 
10	0.342461	172.20.0.3	99.23.241.219	TCP	206	[TCP segment of a reassembled PDU]
11	0.343415	172.20.0.3	99.23.241.219	HTTP	1378	HTTP/1.1 200 OK  (application/json)
12	0.494407	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=463 Ack=1645 Win=131328 Len=0
13	1.186363	99.23.241.219	172.20.0.9	TCP	66	58626→80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
14	2.345004	172.20.0.3	99.23.241.219	TCP	54	80→58624 [FIN, ACK] Seq=1645 Ack=463 Win=31360 Len=0
15	2.472402	99.23.241.219	172.20.0.3	TCP	641	[TCP segment of a reassembled PDU]
16	2.472433	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1645 Win=0 Len=0
17	2.472405	99.23.241.219	172.20.0.3	HTTP	70	POST //user/game_state HTTP/1.1  (application/x-www-form-urlencoded)
18	2.472442	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1645 Win=0 Len=0
19	2.492618	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=1066 Ack=1646 Win=131328 Len=0
20	2.492638	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1646 Win=0 Len=0
21	2.493181	99.23.241.219	172.20.0.3	TCP	54	58624→80 [FIN, ACK] Seq=1066 Ack=1646 Win=131328 Len=0
22	2.493196	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1646 Win=0 Len=0

Despite the debug log level, there are no meaningful logs regarding this request Last log I see in gunicorn is "GET /info/data_model_checksums" (previous http request right before the failed one). From the gunicorn logs perspective it looks like it never received that request. Using pre_request hook also did not bear any fruit. I am unsure what else I should do in order to troubleshoot this issue.

What makes it difficult is, it seems this problem relies on client sending multiple requests one after another, and depends on timings with these requests. Changing random pieces makes the error (temporarily?) go away. Testing with curl does not replicate the issue.

Retrying and sending the same web request again, always succeeds.

If I put nginx in front of the gunicorn in another container, the problem seems to go away, but I don't feel comfortable with that "solution" and would like to understand the root cause of this.

@jamadden
Copy link
Collaborator

jamadden commented Jan 7, 2022

This reminds me of #2698 where the user was running into socket backlog issues. Adding more workers to keep up with the number of requests or tuning TCP settings to allow for a bigger backlog solved the problem.

@leonidumanskiy
Copy link
Author

leonidumanskiy commented Jan 7, 2022

@jamadden thanks for your reply. The issue you are referring to mentions multiple concurrent requests. In my case, there are no concurrent requests or any other requests being sent, other than what's present in the tcpdumb log.
Three http requests you see here are also sent by the client sequentially and there is no other traffic coming to this server.

What can I do to troubleshoot this further? Is there any way I can turn on more logging for tcp events in gunicorn to see where disconnect is coming from?

@NicolasHoferer
Copy link

We have some similar issue, did you find a solution?
see #2844

@leonidumanskiy
Copy link
Author

@NicolasHoferer No I have not. I ended up switching to gevent.pywsgi and have been using it in both development and production environments with 0 issues.

I know this isn't what community wants to hear but until this is fixed or there are tools and documentation to troubleshoot issues like that, I would not consider gunicorn for production. If I had time/resources to troubleshoot further I would look into adding trace log level to gunicorn that would basically print every tcp/socket packet and operation and try to find what's causing abrupt connection losses.

@benoitc
Copy link
Owner

benoitc commented Oct 14, 2022

seems like the backlog is too short. This is an old issue. Let's cvlosing it for now. If it's still reproducible, please open a new ticket with details to reproduce it.

@benoitc benoitc closed this as completed Oct 14, 2022
@NicolasHoferer
Copy link

How to increase the backlog on an openshift cluster?
I could not find a way to do this

@leonidumanskiy
Copy link
Author

leonidumanskiy commented Oct 15, 2022

@benoitc Perhaps I should have explained this better but there are no concurrent connections being made in my original issue. Only one client making 2 sequential http requests. I don't see how this could be related to the backlog size unless the default backlog size is 1.

To me it seems to be some race condition with sequential http requests being sent one right after another. However since I am not using gunicorn anymore I can not prove that.

@sverdoux
Copy link

sverdoux commented Oct 17, 2022

Sorry for the spam, a new item was asked. moved comments there.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants