Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
pywsgi.py with SSE produces extraneous TCP packets #1233
Using various versions of SSE (Server Side Events), for example from here: http://flask.pocoo.org/snippets/116/ with the pywsgi embedded server, running Wireshark shows multiple packets per server push. This is caused by:
which itself is caused by never running https://github.com/gevent/gevent/blob/master/src/gevent/pywsgi.py#L698 .
A browser will ignore the extra packets, but they are sent and processed anyway. I could not find an easy workaround by changing settings or headers. Using gunicorn to serve, rather than the build-in WSGIServer does not produce these extra packets.
@jamadden would https://docs.python.org/3/library/socket.html#socket.socket.sendmsg help in this case?
I don't think
I don't think watching for
gunicorn handles this case normally, e.g., using chunked encoding. It's true, gunicorn makes one socket call to send a chunk. It does this by copying all the data into a single buffer. That has a different set of tradeoffs, notably more memory usage and higher CPU time on the server, but potentially fewer packets in flight.
The creation of a single buffer is something that happens in Python with the GIL held and without having the opportunity to switch greenlets, so it potentially hurts concurrency.
As the data chunks get larger (and copying costs go up) the chances of using fewer packets go down as fragmentation starts to happen. Or to put it another way, the extra packets cost less and less, relatively speaking.
Given TCP windows, all three of gevent's packets can probably be sent at the same time without waiting for individual ACKs from the client, so overall I wouldn't expect this to add much latency. Of course, on poor quality networks (some mobile connections) things may be different.
Lets see if a benchmark will tell us anything.
Here's an app that will write 10,000 19-byte chunks:
# server.py def app(environ, start_response): def gen(): for _ in range(10000): yield b'data: This is some data\n\n' start_response('200 OK', [('Content-Type', 'text/plain')]) return gen()
Under Python 3.6 with
If I change gevent's chunked writting to collect into a bytearray, I get 5.17 requests per second:
def _write(self, data): if self.response_use_chunked: ## Write the chunked encoding header = ("%x\r\n" % len(data)).encode('ascii') towrite = bytearray() towrite.extend(header) towrite.extend(data) towrite.extend(b'\r\n') self._sendall(towrite)
Directly concatenating and sending the strings (
What if we have 5,000 byte chunks? Three packets yields 6.05 requests per second
and concatenation yields 5.94 RPS
Let's simulate a poor wireless network. Sticking with our 5000 byte chunks and with a 100ms delay and 1% packet drops, the three packet (current) version drops to 5.92 RPS, 97.8% of the perfect performance:
Under these conditions, the concatenation approach comes to 5.74 RPS, or 96.6% of the perfect performance:
(There's some variance here, I've seen concatenation as low as 5.53 and as high as 5.9; I've also seen three-packet as high as 5.99; I chose representative numbers.)
For the 19 byte chunks, three-packet gives us anywhere from 5.85 to 6.20 RPS (91.5% to 97.0%), while concatenation under this network condition gives us 6.08 to 6.16 RPS (96.5 to 97.7%). This roughly matches our intuition: when there are fewer packets overall, the effect highly depends on which 1% get lost.
So it basically looks to me like the additional packets don't hurt us, compared to the overhead of creating a new data buffer. This is especially true on good networks and/or larger chunk sizes. The effect holds for larger chunk sizes on lossy networks, and for small chunk sizes on lossy networks its a wash.
I don't think it's quite that simple
More seriously, without using
This effect of simply turning off
$ http http://localhost:8080 HTTP/1.1 200 OK Content-Type: text/plain Date: Fri, 15 Jun 2018 21:14:20 GMT http: error: ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Read timed out.
curl is architected differently and parses each incoming
Now it's true that one specific (non-IANA-standard) MIME type (text/event-stream) mandates newline in message framing, but there's no way to know if clients for that MIME type are able to hook low-enough into the HTTP stream to be able to pull them out without actually using chunks. While the HTML living standard does recommend that browsers use an appropriate line-based buffering mode for that event type when constructing an
That gets us back to the reason why this was requested in the first place, which was to reduce the number of
Other datapoints: gevent does not currently disable Nagle's algorithm for its pywsgi sockets (call
gunicorn does disable Nagle's algorithm, I believe.
There's also the platform-specific TCP_NOPUSH/TCP_CORK option, which explicitly allows multiple writes to be combined. It did indeed dramatically reduce the number of packets transmitted, by 3 orders of magnitude, at the cost of significant latency. It seems (on macOS anyway) once it is turned on it can't fully be turned off (at least for short lived streams).
Finally, it looks like my first round of synthetic benchmarks may be largely invalid. I have to do some more checking, but I may not have been testing what I thought I was testing. If so, my apologies for putting everyone through all that. I will double check and if necessary run numbers again.
OK, above are more accurate numbers. The X axis is RPS on Python 3.6 with 10 clients and 200 requests. This does indeed show a small but constant overhead for a perfect network.
Likewise for a lossy network, there's a constant overhead. The TCP_NODELAY value is all over the place, though, with it apparently helping for non-chunked transfers on lossy networks but hurting for chunked, and the opposite on perfect networks. I didn't run any statistical regressions but I'd probably consider the difference in the margin of error.
So with this new data, I think it's clear that the overhead is measurable in these synthetic benchmarks (whether its significant in a practical context is undetermined). The overhead does begin to vanish as chunk sizes increase; the lines meet at around 75000 bytes. I suspect most chunks are smaller than that. Based on this, it seems reasonable to me to attempt a combined write, as its not that hard and there may be benefits (I say attempt because
Thanks for this conversation! Once again I apologise for drawing incorrect conclusions based on faulty data above.