fix thundering herd #792

benoitc · 2014-06-14T20:06:48Z

Currently all workers are accepting in // with no dialog which make them sometimes accepting the same connection triggering EAGAIN and increasing the CPU usage for nothing.

While modern OSes have mostly fixed that it still can happen in Gunicorn since we can listen on multiple interface.

Solution

The solution I see for that is to introduce some communication between the arbiter and the workers. The accept will still be executed directly in the callers workers if the socket accept returns ok. Otherwise the listen socket is "selected" in the arbiter using an eventloop and an input ready callback will run socket accept from a worker when the event is triggered.

Implementation details:

While it can change in the future by adding more methods like sharing memory between the arbiter and the worker, we will took the simple path for now:

1 pipe will be maintained between the arbiter and the worker. This pipe will be used for the signaling.
The arbiter will put all listener sockets in an eventloop. Once the read event is triggered it will notify one of the available workers to accept.
For the eventloop it will use the selectors in python 3. It will backported for python 2.x

Usual garbage collection will take care about closing the pipe when needed.

* note* Possibly, the pipe will also let the workers notify the arbiter they are alive.

Problems to solve

Each async worker are accepting using their own method without much consideration to gunicorn right now. For example the gevent worker is using the gevent Server object, tornado and eventlet use similar system. We should find a way to adapt them to use the new socket signaling system.

Thoughts? Any other suggestions?

The text was updated successfully, but these errors were encountered:

diwu1989 · 2014-06-15T08:44:24Z

Glad to see that there's going to be a thunder-lock in gunicorn.

methane · 2014-06-15T17:02:41Z

FWIW: http://uwsgi-docs.readthedocs.org/en/latest/articles/SerializingAccept.html

methane · 2014-06-15T17:31:46Z

I think lock based approach is better than signaling based one.
Arbiter doesn't know about which worker is busy and how many connection coming to socket.

benoitc · 2014-06-15T18:54:59Z

@methane not sure to follow, using IPC is about adding a lock system somehow... ( a semaphore or sort of is just that ;) .

The arbiter will know that a worker is busy or not because it will notify arbiter about it (releasing the lock he put on accept).

diwu1989 · 2014-06-20T02:44:50Z

Asking as an outsider, is this something that is feasible to do for the next minor version release or is this a giant feature?

davisp · 2014-06-20T03:31:13Z

Have their been reports about this being an issue? Seems awfully complex. Reading the link from @methane I'd probably vote for the signaling approach as well but as you point out that means we have to alter each worker so that they aren't selecting on the TCP socket and instead wait for the signal on the pipe. Seems reasonable I guess, just complicated.

methane · 2014-06-20T05:29:31Z

Following is comparing flows accepting new connection.

arbiter solution

New connection coming
arbiter wake up from epoll
arbitor selects worker and send signal from pipe
worker wake up from epoll
Try accept

Lock solition

-2. Worker wake up and get lock
-1. Start epoll
0. New connection coming

Worker wake up from epoll.
accept connection and releases lock.

My thought

Lock solution is fewer context switch.

Lock solution is also better on concurrency. Under situation of massive new connection coming,
arbiter may be bottleneck and workers can't work while many cores idle.

So I prefer lock solution.

davisp · 2014-06-20T06:15:48Z

@methane The down side of the lock is that its a single point of contention. With the signaling approach there's room for optimizations like running multiple accepts that don't require the synchronization under load. Not to mention the sheer complexity of attempting to write and support the cross-platform IPC locking scheme. Given the caveats in the article you linked to earlier I'm not really keen on attempting such a thing.

Contemplating the uwsgi article that @methane linked to earlier I'm still not convinced that this is even an issue we should be attempting to "fix" seeing as its really not an issue for modern kernels. I'd vote to tell people that actually experience this that they just need to upgrade their deployment targets. Then again I'm fairly adverse to introducing complexity.

tilgovi · 2014-06-20T18:52:57Z

@davisp if we were simply blocking on accept() in our workers that would be one thing, but, partly because we allow multiple listening sockets, our workers generally select on them, which means the kernel will wake them all.

davisp · 2014-06-20T18:56:28Z

Oh right.

wmttom · 2014-06-20T19:45:38Z

FWIW: http://stackoverflow.com/questions/12494914/how-does-the-operating-system-load-balance-between-multiple-processes-accepting/12502808#12502808
https://www.citi.umich.edu/u/cel/linux-scalability/reports/accept.html

pypeng · 2014-07-17T04:10:42Z

According the article of uwsgi: (Note: Apache is really smart about that, when it only needs to wait on a single file descriptor, it only calls accept() taking advantage of modern kernels anti-thundering herd policies)

How about we fix this common case where we only have one listening socket?

tilgovi · 2014-07-17T05:36:07Z

+1
On Jul 16, 2014 9:10 PM, "pypeng" notifications@github.com wrote:

According the article of uwsgi: (Note: Apache is really smart about that,
when it only needs to wait on a single file descriptor, it only calls
accept() taking advantage of modern kernels anti-thundering herd policies)

How about we fix this common case where we only have one listening socket?

—
Reply to this email directly or view it on GitHub
#792 (comment).

benoitc · 2014-09-12T15:09:51Z

@diwu1989 forgot to answer but this feature will appear in 20.0 in October.

RyPeck · 2016-03-25T14:45:51Z

@benoitc was this fixed? You may want to update the documentation here if so - http://docs.gunicorn.org/en/stable/faq.html#does-gunicorn-suffer-from-the-thundering-herd-problem

methane · 2016-03-25T14:53:05Z

FWI, Linux 4.5 introduced EPOLLEXCLUSIVE.
http://kernelnewbies.org/Linux_4.5#head-64f3b13b9026133a232a418a27ac029e21fff2ba

themanifold · 2016-04-13T09:12:54Z

So this was added to the R20.0 mile stone, then removed. Have we decided not to work on this anymore then?

tilgovi · 2016-04-13T16:05:37Z

I made the 20 milestone and provisionally added things without discussion or input from others. It was aspirational.

As far as I know we don't have a consensus work plan for the milestone. We should probably discuss soon :-)

tilgovi · 2016-04-13T16:06:41Z

Ah, I see Benoit added this one, then removed it. I would guess similar thoughts to mine.

tilgovi · 2018-04-28T01:53:50Z

Python has select.EPOLLEXCLUSIVE now. If someone wants to implement that, I would gladly review the PR.

toywei · 2019-05-08T17:33:57Z

@benoitc
https://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html#how-application-server-developers-solved-it

Fast answer: they generally do not solve/care it

?

golf-player · 2020-03-08T22:59:27Z

this would need to be fixed for every worker class right? seems like it's not that worth fixing...

wouldn't be too hard to implement on sync worker; gthread would need to wait on https://bugs.python.org/issue35517, which appears to be dead

no idea how this would be done with gevent worker; maybe the arbiter would have to be proxying requests to workers??

aaron42net · 2020-05-30T22:16:11Z

Python 3.6 added support for epoll's EPOLLEXCLUSIVE, which will solve Thundering Herd when running on Linux 4.5+. See: https://docs.python.org/3/library/select.html#edge-and-level-trigger-polling-epoll-objects

"EPOLLEXCLUSIVE: Wake only one epoll object when the associated fd has an event. The default (if this flag is not set) is to wake all epoll objects polling on a fd.

New in version 3.6: EPOLLEXCLUSIVE was added. It’s only supported by Linux Kernel 4.5 or later."

horpto · 2021-08-31T16:07:55Z

Wouldn't it be workaround to enable SO_REUSEPORT option in gunicorn? As it mentioned on libuv.

After this I got only such syscalls:

epoll_wait(7, [], 64, 1000)             = 0
fchmod(6, 001)                          = 0
getppid()                               = 85
getpid()                                = 95
epoll_wait(7, [], 64, 1000)             = 0

qq413434162 · 2022-03-31T02:30:00Z

@horpto I review the code in 20.1.0, have found already usereuseport

but curious,
why not create socket in workers but arbitrators? @brosner

horpto · 2022-04-01T09:37:00Z

@qq413434162 yes, but programmer should turn it on clearly in config

qq413434162 · 2022-04-01T10:39:55Z

@qq413434162 yes, but programmer should turn it on clearly in config

Yes!
why not create socket in workers but arbiter.run()?

benoitc · 2022-04-01T10:49:01Z

to monitor them and allow hot upgrades of gunicorn (using USR2). Le ven. 1 avr. 2022 à 12:40, Baob.wu ***@***.***> a écrit :

@qq413434162 <https://github.com/qq413434162> yes, but programmer should turn it on clearly in config Yes! why not create socket in workers but arbiter.run()? — Reply to this email directly, view it on GitHub <#792 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIQYU4W4XHIQCLMJK7TVC3HALANCNFSM4AQRECDA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Sent from my Mobile

qq413434162 · 2022-04-02T14:48:19Z

Thank you for your answer！：）
My doubt is Why still use reuseport in arbiter？
what it does is？

As I see
1.reuseport make kernel choose thread/process weekup load balance in different socket(mean not the same fd)
2.Kernel add The WQ_FLAG_EXCLUSIVE to solve thundering herd probleam in linux2.6,weakness is weekup The thread/process not load balance。

alok2k5ingh · 2023-03-21T05:53:53Z

@benoitc Is it safe to assume that thundering herd problem wont occur in gunicorn, at least with gevent worker type, if using linux 4.5+ ?

benoitc added Improvement labels Jun 14, 2014

This was referenced Sep 12, 2014

Shutdown socket writes before close #874

Closed

import selectors inside Gunicorn #886

Closed

benoitc added this to the R20.0 milestone Sep 22, 2014

benoitc mentioned this issue Oct 9, 2014

Add selectors module to gunicorn codebase. #909

Merged

benoitc removed this from the R20.0 milestone Dec 6, 2015

benoitc added the - Mailing List - label Feb 26, 2017

benoitc added this to Acknowledged in Mailing List Feb 26, 2017

tilgovi added this to the 21 milestone Apr 21, 2020

benoitc removed this from the 21 milestone Jan 7, 2021

benoitc closed this as completed May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix thundering herd #792

fix thundering herd #792

benoitc commented Jun 14, 2014

diwu1989 commented Jun 15, 2014

methane commented Jun 15, 2014

methane commented Jun 15, 2014

benoitc commented Jun 15, 2014

diwu1989 commented Jun 20, 2014

davisp commented Jun 20, 2014

methane commented Jun 20, 2014

davisp commented Jun 20, 2014

tilgovi commented Jun 20, 2014

davisp commented Jun 20, 2014

wmttom commented Jun 20, 2014

pypeng commented Jul 17, 2014

tilgovi commented Jul 17, 2014

benoitc commented Sep 12, 2014

RyPeck commented Mar 25, 2016

methane commented Mar 25, 2016

themanifold commented Apr 13, 2016

tilgovi commented Apr 13, 2016

tilgovi commented Apr 13, 2016

tilgovi commented Apr 28, 2018

toywei commented May 8, 2019

golf-player commented Mar 8, 2020 •

edited

Loading

aaron42net commented May 30, 2020

horpto commented Aug 31, 2021

qq413434162 commented Mar 31, 2022 •

edited

Loading

horpto commented Apr 1, 2022 •

edited

Loading

qq413434162 commented Apr 1, 2022

benoitc commented Apr 1, 2022 via email

qq413434162 commented Apr 2, 2022

alok2k5ingh commented Mar 21, 2023

fix thundering herd #792

fix thundering herd #792

Comments

benoitc commented Jun 14, 2014

Solution

Implementation details:

Problems to solve

diwu1989 commented Jun 15, 2014

methane commented Jun 15, 2014

methane commented Jun 15, 2014

benoitc commented Jun 15, 2014

diwu1989 commented Jun 20, 2014

davisp commented Jun 20, 2014

methane commented Jun 20, 2014

arbiter solution

Lock solition

My thought

davisp commented Jun 20, 2014

tilgovi commented Jun 20, 2014

davisp commented Jun 20, 2014

wmttom commented Jun 20, 2014

pypeng commented Jul 17, 2014

tilgovi commented Jul 17, 2014

benoitc commented Sep 12, 2014

RyPeck commented Mar 25, 2016

methane commented Mar 25, 2016

themanifold commented Apr 13, 2016

tilgovi commented Apr 13, 2016

tilgovi commented Apr 13, 2016

tilgovi commented Apr 28, 2018

toywei commented May 8, 2019

golf-player commented Mar 8, 2020 • edited Loading

aaron42net commented May 30, 2020

horpto commented Aug 31, 2021

qq413434162 commented Mar 31, 2022 • edited Loading

horpto commented Apr 1, 2022 • edited Loading

qq413434162 commented Apr 1, 2022

benoitc commented Apr 1, 2022 via email

qq413434162 commented Apr 2, 2022

alok2k5ingh commented Mar 21, 2023

golf-player commented Mar 8, 2020 •

edited

Loading

qq413434162 commented Mar 31, 2022 •

edited

Loading

horpto commented Apr 1, 2022 •

edited

Loading