Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

leonidumanskiy · 2022-01-07T02:48:54Z

I am running gunicorn inside a docker container, with a pretty basic configuration:

gunicorn --timeout 90 -k gevent --worker-tmp-dir /dev/shm --bind 0.0.0.0:80 "app.wcgi:application" --capture-output --enable-stdio-inheritance --log-level debug --access-logfile - --error-logfile -

There is no nginx in front of it and no high load besides this one client (this is a dev/staging environment).

Sometimes, connection on the client is being reset.

Looking with WireShark (both on the client machine, host machine and using tcpdump inside a docker network) it looks like gunicorn is sending RST packet(s) right after http request packets are sent:

1	0.000000	99.23.241.219	172.20.0.3	TCP	66	58624→80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
2	0.000031	172.20.0.3	99.23.241.219	TCP	66	80→58624 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128
3	0.147891	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=1 Ack=1 Win=131328 Len=0
4	0.155410	99.23.241.219	172.20.0.3	HTTP	299	GET //info/client_version?platform=android&version=0.1.205 HTTP/1.1 
5	0.155439	172.20.0.3	99.23.241.219	TCP	54	80→58624 [ACK] Seq=1 Ack=246 Win=30336 Len=0
6	0.160088	172.20.0.3	99.23.241.219	TCP	204	[TCP segment of a reassembled PDU]
7	0.160167	172.20.0.3	99.23.241.219	HTTP	72	HTTP/1.1 200 OK  (application/json)
8	0.310643	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=246 Ack=169 Win=131072 Len=0
9	0.339900	99.23.241.219	172.20.0.3	HTTP	271	GET /info/data_model_checksums HTTP/1.1 
10	0.342461	172.20.0.3	99.23.241.219	TCP	206	[TCP segment of a reassembled PDU]
11	0.343415	172.20.0.3	99.23.241.219	HTTP	1378	HTTP/1.1 200 OK  (application/json)
12	0.494407	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=463 Ack=1645 Win=131328 Len=0
13	1.186363	99.23.241.219	172.20.0.9	TCP	66	58626→80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
14	2.345004	172.20.0.3	99.23.241.219	TCP	54	80→58624 [FIN, ACK] Seq=1645 Ack=463 Win=31360 Len=0
15	2.472402	99.23.241.219	172.20.0.3	TCP	641	[TCP segment of a reassembled PDU]
16	2.472433	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1645 Win=0 Len=0
17	2.472405	99.23.241.219	172.20.0.3	HTTP	70	POST //user/game_state HTTP/1.1  (application/x-www-form-urlencoded)
18	2.472442	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1645 Win=0 Len=0
19	2.492618	99.23.241.219	172.20.0.3	TCP	54	58624→80 [ACK] Seq=1066 Ack=1646 Win=131328 Len=0
20	2.492638	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1646 Win=0 Len=0
21	2.493181	99.23.241.219	172.20.0.3	TCP	54	58624→80 [FIN, ACK] Seq=1066 Ack=1646 Win=131328 Len=0
22	2.493196	172.20.0.3	99.23.241.219	TCP	54	80→58624 [RST] Seq=1646 Win=0 Len=0

Despite the debug log level, there are no meaningful logs regarding this request Last log I see in gunicorn is "GET /info/data_model_checksums" (previous http request right before the failed one). From the gunicorn logs perspective it looks like it never received that request. Using pre_request hook also did not bear any fruit. I am unsure what else I should do in order to troubleshoot this issue.

What makes it difficult is, it seems this problem relies on client sending multiple requests one after another, and depends on timings with these requests. Changing random pieces makes the error (temporarily?) go away. Testing with curl does not replicate the issue.

Retrying and sending the same web request again, always succeeds.

If I put nginx in front of the gunicorn in another container, the problem seems to go away, but I don't feel comfortable with that "solution" and would like to understand the root cause of this.

The text was updated successfully, but these errors were encountered:

jamadden · 2022-01-07T11:44:31Z

This reminds me of #2698 where the user was running into socket backlog issues. Adding more workers to keep up with the number of requests or tuning TCP settings to allow for a bigger backlog solved the problem.

leonidumanskiy · 2022-01-07T17:42:27Z

@jamadden thanks for your reply. The issue you are referring to mentions multiple concurrent requests. In my case, there are no concurrent requests or any other requests being sent, other than what's present in the tcpdumb log.
Three http requests you see here are also sent by the client sequentially and there is no other traffic coming to this server.

What can I do to troubleshoot this further? Is there any way I can turn on more logging for tcp events in gunicorn to see where disconnect is coming from?

NicolasHoferer · 2022-08-12T08:01:45Z

We have some similar issue, did you find a solution?
see #2844

leonidumanskiy · 2022-08-12T08:20:48Z

@NicolasHoferer No I have not. I ended up switching to gevent.pywsgi and have been using it in both development and production environments with 0 issues.

I know this isn't what community wants to hear but until this is fixed or there are tools and documentation to troubleshoot issues like that, I would not consider gunicorn for production. If I had time/resources to troubleshoot further I would look into adding trace log level to gunicorn that would basically print every tcp/socket packet and operation and try to find what's causing abrupt connection losses.

benoitc · 2022-10-14T08:14:11Z

seems like the backlog is too short. This is an old issue. Let's cvlosing it for now. If it's still reproducible, please open a new ticket with details to reproduce it.

NicolasHoferer · 2022-10-14T08:54:50Z

How to increase the backlog on an openshift cluster?
I could not find a way to do this

leonidumanskiy · 2022-10-15T03:17:45Z

@benoitc Perhaps I should have explained this better but there are no concurrent connections being made in my original issue. Only one client making 2 sequential http requests. I don't see how this could be related to the backlog size unless the default backlog size is 1.

To me it seems to be some race condition with sequential http requests being sent one right after another. However since I am not using gunicorn anymore I can not prove that.

sverdoux · 2022-10-17T10:20:33Z

Sorry for the spam, a new item was asked. moved comments there.

benoitc closed this as completed Oct 14, 2022

sverdoux mentioned this issue Oct 17, 2022

Gunicorn sync workers close connections unexpectedly #2881

Closed

mickmis mentioned this issue Oct 16, 2023

CI sometimes fails due to timeout on query to service interuss/monitoring#28

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

leonidumanskiy commented Jan 7, 2022 •

edited

Loading

jamadden commented Jan 7, 2022

leonidumanskiy commented Jan 7, 2022 •

edited

Loading

NicolasHoferer commented Aug 12, 2022

leonidumanskiy commented Aug 12, 2022

benoitc commented Oct 14, 2022

NicolasHoferer commented Oct 14, 2022

leonidumanskiy commented Oct 15, 2022 •

edited

Loading

sverdoux commented Oct 17, 2022 •

edited

Loading

Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

Gunicorn inside docker intermittently sending RST packet to the client causing connection reset with no logs #2714

Comments

leonidumanskiy commented Jan 7, 2022 • edited Loading

jamadden commented Jan 7, 2022

leonidumanskiy commented Jan 7, 2022 • edited Loading

NicolasHoferer commented Aug 12, 2022

leonidumanskiy commented Aug 12, 2022

benoitc commented Oct 14, 2022

NicolasHoferer commented Oct 14, 2022

leonidumanskiy commented Oct 15, 2022 • edited Loading

sverdoux commented Oct 17, 2022 • edited Loading

leonidumanskiy commented Jan 7, 2022 •

edited

Loading

leonidumanskiy commented Jan 7, 2022 •

edited

Loading

leonidumanskiy commented Oct 15, 2022 •

edited

Loading

sverdoux commented Oct 17, 2022 •

edited

Loading