Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set max_accept on gevent worker-class to 1 when workers > 1 #2266

Merged
merged 1 commit into from
Mar 3, 2020

Conversation

rowillia
Copy link
Contributor

We've had really terrible tail latencies with gevent and gunicorn under load.

Inspecting our services with strace we see the following:

23:11:01.651529 accept4(5, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 223 <0.000015>
..{18 successful calls to accept4}...
23:11:01.652590 accept4(5, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 249 <0.000010>
23:11:01.652647 accept4(5, 0x7ffcd46c09d0, [110], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000012>
23:11:01.657622 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000009>
23:11:01.657682 recvfrom(223, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000011>
..{16 calls to recvfrom}...
23:11:01.740726 recvfrom(243, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000012>
23:11:01.746074 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000013>
23:11:01.746153 recvfrom(246, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000014>
23:11:01.751540 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000010>
23:11:01.751599 recvfrom(249, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000013>

Notice we see a flury of 20 accept4s followed by 20 calls to to recvfrom. Each call to recvfrom happens 5ms after the previous,
so the last recvfrom is called ~100ms after the call to accept4 for that fd.

gevent suggest setting max_accept to a lower value when there's multiple working processes on the same listening socket: https://github.com/gevent/gevent/blob/785b7b5546fcd0a184ea954f5d358539c530d95f/src/gevent/baseserver.py#L89-L102
gevent sets max_accept to 1 when wsgi.multiprocess is True: https://github.com/gevent/gevent/blob/9d27d269ed01a7e752966caa7a6f85d773780a1a/src/gevent/pywsgi.py#L1470-L1472
gunicorn does in fact set this when the number of workers is > 1:

"wsgi.multiprocess": (cfg.workers > 1),

and this gets passed to gevent.pywsgi.WSGIServer:
environ = base_environ(self.cfg)
environ.update({
"wsgi.multithread": True,
"SERVER_SOFTWARE": VERSION,
})
server = self.server_class(
s, application=self.wsgi, spawn=pool, log=self.log,
handler_class=self.wsgi_handler, environ=environ,
**ssl_args)

However, when worker-class is gevent we directly create a gevent.server.StreamServer:
hfun = partial(self.handle, s)
server = StreamServer(s, handle=hfun, spawn=pool, **ssl_args)

Fixing this dropped the p50 response time on an especially probelmatic benchmark from 250ms to 115ms.

We've had really terrible tail latencies with gevent and gunicorn under load.

Inspecting our services with strace we see the following:

```
23:11:01.651529 accept4(5, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 223 <0.000015>
..{18 successful calls to accept4}...
23:11:01.652590 accept4(5, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 249 <0.000010>
23:11:01.652647 accept4(5, 0x7ffcd46c09d0, [110], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000012>
23:11:01.657622 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000009>
23:11:01.657682 recvfrom(223, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000011>
..{16 calls to recvfrom}...
23:11:01.740726 recvfrom(243, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000012>
23:11:01.746074 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000013>
23:11:01.746153 recvfrom(246, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000014>
23:11:01.751540 getsockname(5, {sa_family=AF_UNIX, sun_path="/run/gunicorn/gunicorn.sock"}, [110->30]) = 0 <0.000010>
23:11:01.751599 recvfrom(249, "XXX"..., 8192, 0, NULL, NULL) = 511 <0.000013>
```

Notice we see a flury of 20 `accept4`s followed by 20 calls to to `recvfrom`.  Each call to `recvfrom` happens 5ms after the previous,
so the last `recvfrom` is called ~100ms after the call to `accept4` for that fd.

gevent suggest setting `max_accept` to a lower value when there's multiple working processes on the same listening socket: https://github.com/gevent/gevent/blob/785b7b5546fcd0a184ea954f5d358539c530d95f/src/gevent/baseserver.py#L89-L102
gevent sets `max_accept` to `1` when `wsgi.multiprocess` is True: https://github.com/gevent/gevent/blob/9d27d269ed01a7e752966caa7a6f85d773780a1a/src/gevent/pywsgi.py#L1470-L1472
gunicorn does in fact set this when the number of workers is > 1: https://github.com/benoitc/gunicorn/blob/e4e20f273e95f505277a8dadf390bbdd162cfff4/gunicorn/http/wsgi.py#L73
and this gets passed to `gevent.pywsgi.WSGIServer`: https://github.com/benoitc/gunicorn/blob/e4e20f273e95f505277a8dadf390bbdd162cfff4/gunicorn/workers/ggevent.py#L67-L75
However, when `worker-class` is `gevent` we directly create a `gevent.server.StreamServer`: https://github.com/benoitc/gunicorn/blob/e4e20f273e95f505277a8dadf390bbdd162cfff4/gunicorn/workers/ggevent.py#L77-L78

Fixing this dropped the p50 response time on an especially probelmatic benchmark from 250ms to 115ms.
@rowillia
Copy link
Contributor Author

@jamadden mind taking a look at this as well?

Copy link
Collaborator

@jamadden jamadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

@benoitc
Copy link
Owner

benoitc commented Feb 13, 2020

Makes sense to me.

mmm? why? Server should be accepting as much as they want, then it's limited by the pool just like the thread worker. What could explain the latency actually if the socket is in a select loop?

@benoitc benoitc self-requested a review February 13, 2020 10:33
@jamadden
Copy link
Collaborator

Quoting the referenced libuv documentation:

Having simultaneous accepts can significantly improve the rate of accepting connections (which is why it is enabled by default) but may lead to uneven load distribution in multi-process setups.

With values above 1, a single worker may accept many connections, preventing them from being accepted by other processes. But because the worker is single-threaded and non-preemptive, it may then only service the connections in a fairly sequential manner. If they'd been accepted by many processes much more parallelism could be exploited.

@benoitc
Copy link
Owner

benoitc commented Feb 13, 2020

Quoting the referenced libuv documentation:

Having simultaneous accepts can significantly improve the rate of accepting connections (which is why it is enabled by default) but may lead to uneven load distribution in multi-process setups.

With values above 1, a single worker may accept many connections, preventing them from being accepted by other processes. But because the worker is single-threaded and non-preemptive, it may then only service the connections in a fairly sequential manner. If they'd been accepted by many processes much more parallelism could be exploited.

yes, though the system should normally take care of load balancing connections by system threads first. What's surprising is that this is only appearing today on the radar. Was there any recent change that could have added that? Maybe at OS level?

@jamadden
Copy link
Collaborator

yes, though the system should normally take care of load balancing connections by system threads first.

Maybe. On linux, it's highly dependent on SO_REUSEPORT, and gunicorn's default value for reuse-port has flip-flopped a few times. For a while it was true, but as of 19.8, the default is false.

Without SO_REUSEPORT, connections tend to go to the busiest worker process on the machine, whereas with it turned on, connections are routed fairly.

@rowillia
Copy link
Contributor Author

yes, though the system should normally take care of load balancing connections by system threads first. What's surprising is that this is only appearing today on the radar.

We've had this problem for years and generally ran with really poor CPU utilization in response. We finally tracked down part of the reason now.

@jamadden
Copy link
Collaborator

I'd like to merge this if there are no objections. For heavily loaded systems this is an important fix; for lightly loaded systems I think it's unlikely to have an observable effect.

@jamadden
Copy link
Collaborator

It'd also be good to backport to 19.x.

@JordanP
Copy link
Contributor

JordanP commented Feb 25, 2020

We are running gunicorn v20 with gevent workers on servers that service around 250 RPS. We have an adequate monitoring to assess whether this will have an impact on p50 or P90. It's worth a shot.

@jamadden jamadden self-assigned this Mar 1, 2020
@jamadden jamadden merged commit 46b2cff into benoitc:master Mar 3, 2020
jamadden added a commit that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants