New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pywsgi.py with SSE produces extraneous TCP packets #1233

Closed
tomberek opened this Issue Jun 2, 2018 · 9 comments

Comments

Projects
None yet
3 participants
@tomberek

tomberek commented Jun 2, 2018

  • gevent version: 1.2.2
  • Python version: Python 2.7.14
  • Operating System: Ubuntu x86_64

Description:

Using various versions of SSE (Server Side Events), for example from here: http://flask.pocoo.org/snippets/116/ with the pywsgi embedded server, running Wireshark shows multiple packets per server push. This is caused by:

https://github.com/gevent/gevent/blob/master/src/gevent/pywsgi.py#L724-L726

which itself is caused by never running https://github.com/gevent/gevent/blob/master/src/gevent/pywsgi.py#L698 .

A browser will ignore the extra packets, but they are sent and processed anyway. I could not find an easy workaround by changing settings or headers. Using gunicorn to serve, rather than the build-in WSGIServer does not produce these extra packets.

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 2, 2018

Packets at the TCP level shouldn't matter, that's independent of the stream of data (i.e., what the browser sees is above the level of TCP). Other than slightly more overhead, is there some problem this causes?

@tomberek

This comment has been minimized.

tomberek commented Jun 4, 2018

No. But having three round trips per push negates much of the reason to use SSE. It’s quite a bit of overhead. A single conditional on the content-type could set the Boolean to not use chunked encoding.

@arcivanov

This comment has been minimized.

Contributor

arcivanov commented Jun 4, 2018

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 7, 2018

I don't think socket.sendmsg does much for us here:

  • It's only guaranteed on Unix;
  • It's only available in Python 3.3+ (and the OP is on Python 2.7)
  • Prior to Python 3.5, in the event of an InterruptedError it may be possible to be in an unknown state with regards to what data was sent.

I don't think watching for text/event-stream as a returned content type and turning off chunking is the right thing to do:

  • I would hope we can find something more general;
  • That would actually violate the HTTP spec, at least as commonly implemented. Clients either want a Content-Length header or they want Transfer-Encoding: chunked to know how much to read. (That may not strictly be required, but I'm pretty sure some clients would break if one or the other of those wasn't there.)
  • Lastly,text/event-stream is not yet an IANA official MIME type AFAICS

gunicorn handles this case normally, e.g., using chunked encoding. It's true, gunicorn makes one socket call to send a chunk. It does this by copying all the data into a single buffer. That has a different set of tradeoffs, notably more memory usage and higher CPU time on the server, but potentially fewer packets in flight.

The creation of a single buffer is something that happens in Python with the GIL held and without having the opportunity to switch greenlets, so it potentially hurts concurrency.

As the data chunks get larger (and copying costs go up) the chances of using fewer packets go down as fragmentation starts to happen. Or to put it another way, the extra packets cost less and less, relatively speaking.

Given TCP windows, all three of gevent's packets can probably be sent at the same time without waiting for individual ACKs from the client, so overall I wouldn't expect this to add much latency. Of course, on poor quality networks (some mobile connections) things may be different.

Lets see if a benchmark will tell us anything.

Here's an app that will write 10,000 19-byte chunks:

# server.py
def app(environ, start_response):
    def gen():
        for _ in range(10000):
            yield b'data: This is some data\n\n'

    start_response('200 OK', [('Content-Type', 'text/plain')])
    return gen()

Under Python 3.6 with python -m gevent.pywsgi server:app in one window and ab -n 200 -c 5 http://127.0.0.1:8080/ in the other, I get 6.39 requests per second:

Concurrency Level:      5
Time taken for tests:   31.288 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      50020200 bytes
HTML transferred:       50000000 bytes
Requests per second:    6.39 [#/sec] (mean)
Time per request:       782.205 [ms] (mean)
Time per request:       156.441 [ms] (mean, across all concurrent requests)
Transfer rate:          1561.22 [Kbytes/sec] received

If I change gevent's chunked writting to collect into a bytearray, I get 5.17 requests per second:

def _write(self, data):
        if self.response_use_chunked:
            ## Write the chunked encoding
            header = ("%x\r\n" % len(data)).encode('ascii')
            towrite = bytearray()
            towrite.extend(header)
            towrite.extend(data)
            towrite.extend(b'\r\n')
            self._sendall(towrite)
Concurrency Level:      5
Time taken for tests:   38.706 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      50020200 bytes
HTML transferred:       50000000 bytes
Requests per second:    5.17 [#/sec] (mean)
Time per request:       967.646 [ms] (mean)
Time per request:       193.529 [ms] (mean, across all concurrent requests)
Transfer rate:          1262.03 [Kbytes/sec] received

Directly concatenating and sending the strings (self._sendall(header + data + b'\r\n')) was faster, but still not as fast as sending three packets (incidentally, gunicorn's b''.join( approach was in the middle; in the rest of this post we'll just use concatenation---even though it's not actually a valid approach because it doesn't handle byte buffers)

Concurrency Level:      5
Time taken for tests:   31.731 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      50020200 bytes
HTML transferred:       50000000 bytes
Requests per second:    6.30 [#/sec] (mean)
Time per request:       793.285 [ms] (mean)
Time per request:       158.657 [ms] (mean, across all concurrent requests)
Transfer rate:          1539.42 [Kbytes/sec] received

What if we have 5,000 byte chunks? Three packets yields 6.05 requests per second

Concurrency Level:      5
Time taken for tests:   33.032 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      10000020200 bytes
HTML transferred:       10000000000 bytes
Requests per second:    6.05 [#/sec] (mean)
Time per request:       825.795 [ms] (mean)
Time per request:       165.159 [ms] (mean, across all concurrent requests)
Transfer rate:          295643.64 [Kbytes/sec] received

and concatenation yields 5.94 RPS

Concurrency Level:      5
Time taken for tests:   33.646 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      10000020200 bytes
HTML transferred:       10000000000 bytes
Requests per second:    5.94 [#/sec] (mean)
Time per request:       841.162 [ms] (mean)
Time per request:       168.232 [ms] (mean, across all concurrent requests)
Transfer rate:          290242.58 [Kbytes/sec] received

Let's simulate a poor wireless network. Sticking with our 5000 byte chunks and with a 100ms delay and 1% packet drops, the three packet (current) version drops to 5.92 RPS, 97.8% of the perfect performance:

Concurrency Level:      5
Time taken for tests:   33.803 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      10000020200 bytes
HTML transferred:       10000000000 bytes
Requests per second:    5.92 [#/sec] (mean)
Time per request:       845.074 [ms] (mean)
Time per request:       169.015 [ms] (mean, across all concurrent requests)
Transfer rate:          288899.24 [Kbytes/sec] received

Under these conditions, the concatenation approach comes to 5.74 RPS, or 96.6% of the perfect performance:

Concurrency Level:      5
Time taken for tests:   34.863 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      10000020200 bytes
HTML transferred:       10000000000 bytes
Requests per second:    5.74 [#/sec] (mean)
Time per request:       871.573 [ms] (mean)
Time per request:       174.315 [ms] (mean, across all concurrent requests)
Transfer rate:          280115.48 [Kbytes/sec] received

(There's some variance here, I've seen concatenation as low as 5.53 and as high as 5.9; I've also seen three-packet as high as 5.99; I chose representative numbers.)

For the 19 byte chunks, three-packet gives us anywhere from 5.85 to 6.20 RPS (91.5% to 97.0%), while concatenation under this network condition gives us 6.08 to 6.16 RPS (96.5 to 97.7%). This roughly matches our intuition: when there are fewer packets overall, the effect highly depends on which 1% get lost.

So it basically looks to me like the additional packets don't hurt us, compared to the overhead of creating a new data buffer. This is especially true on good networks and/or larger chunk sizes. The effect holds for larger chunk sizes on lossy networks, and for small chunk sizes on lossy networks its a wash.

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 15, 2018

I'll wait a few more days for additional discussion and feedback on this. If I don't hear anything my intent at this time is to close this as "known set of tradeoffs."

@tomberek

This comment has been minimized.

tomberek commented Jun 15, 2018

Simply exposing the response_use_chunked or similar boolean in some manner that allows traversal down the other side of the conditional would allow EventStream users to trim out the extra packets.

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 15, 2018

I don't think it's quite that simple 😄 . For one thing, without a Content-Length header or a Transfer-Encoding: chunked header, the only way for the client to know it's done expecting results is for the connection to close. To prevent it from trying to pipeline requests, the response should include the Connection: close header (and of course we should actually close the connection). Currently we don't have any way to detect and set that.

More seriously, without using Transfer-Encoding: chunked, the client doesn't necessarily have any way to know when it should present a buffer for reading to higher levels. TCP doesn't tell it. Strictly speaking neither does HTTP, of course, but exposing chunks seems to be a common and relied upon implementation strategy. Depending on fragmentation and socket buffers, intervening proxies, etc, a single message might wind up visible in a single recv call or multiple recv calls or multiple messages might get bundled into a single recv call.

This effect of simply turning off response_use_chunked is easy to see with a client like httpie if you use our earlier app, and add a sleep to slow it down between yields. Because httpie is waiting for the connection to close to know the message is done, you get:

$ http http://localhost:8080
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Fri, 15 Jun 2018 21:14:20 GMT


http: error: ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Read timed out.

curl is architected differently and parses each incoming recv looking for newlines to display, but if you take those out of the example, it too hangs without displaying any output. (And without Connection: close, we wait for a second request to come, even after we've exhausted the iterable, and curl *still sits there.)

Now it's true that one specific (non-IANA-standard) MIME type (text/event-stream) mandates newline in message framing, but there's no way to know if clients for that MIME type are able to hook low-enough into the HTTP stream to be able to pull them out without actually using chunks. While the HTML living standard does recommend that browsers use an appropriate line-based buffering mode for that event type when constructing an EventSource object, there are definitely polyfills for otherwise unsupported browsers that have no access to that level and can't.

  • I don't think gevent should provide any specific support for text/event-stream (neither does gunicorn).
  • Even so, it's somewhat more complicated than it looks to disable chunking because of the need to set the Connection: close header and actually close the connection. We can't rely on the WSGI app to do this.
  • For the best buffering and proxy experience across all sorts of clients, I think that WSGI iterables should be sent using chunked encoding whenever possible.

That gets us back to the reason why this was requested in the first place, which was to reduce the number of send calls and the number of packets due to "quite a bit of overhead." My synthetic benchmarks didn't detect significant overhead from the three calls, and actually showed that to be more efficient than building a single buffer (which is the only remaining option). Thus the question that needs answered is, can this overhead be demonstrated or documented as being problematic in other benchmarks or real world uses, to the extent that it justifies whatever complexity is required to implement and test a different approach?

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 15, 2018

Other datapoints: gevent does not currently disable Nagle's algorithm for its pywsgi sockets (call sock.setsockopt(SOL_TCP, socket.TCP_NODELAY, 1). This means that we would typically expect small writes to accumulate if we are awaiting remote ACKs, thus gaining some buffering benefits at the kernel level. On a localhost, this would probably make a very small difference, but across a WAN it might make a larger difference in reducing the actual packets. (In local benchmarks I observed a 3-8% drop in performance from disabling Nagle. Since we're bottlenecked in Python, this makes sense. Disabling it resulted in an increase of the number of packets sent by 3%, again that makes sense.)

gunicorn does disable Nagle's algorithm, I believe.

There's also the platform-specific TCP_NOPUSH/TCP_CORK option, which explicitly allows multiple writes to be combined. It did indeed dramatically reduce the number of packets transmitted, by 3 orders of magnitude, at the cost of significant latency. It seems (on macOS anyway) once it is turned on it can't fully be turned off (at least for short lived streams).

Finally, it looks like my first round of synthetic benchmarks may be largely invalid. I have to do some more checking, but I may not have been testing what I thought I was testing. If so, my apologies for putting everyone through all that. I will double check and if necessary run numbers again.

@jamadden

This comment has been minimized.

Member

jamadden commented Jun 16, 2018

image

OK, above are more accurate numbers. The X axis is RPS on Python 3.6 with 10 clients and 200 requests. This does indeed show a small but constant overhead for a perfect network.

image

Likewise for a lossy network, there's a constant overhead. The TCP_NODELAY value is all over the place, though, with it apparently helping for non-chunked transfers on lossy networks but hurting for chunked, and the opposite on perfect networks. I didn't run any statistical regressions but I'd probably consider the difference in the margin of error.

So with this new data, I think it's clear that the overhead is measurable in these synthetic benchmarks (whether its significant in a practical context is undetermined). The overhead does begin to vanish as chunk sizes increase; the lines meet at around 75000 bytes. I suspect most chunks are smaller than that. Based on this, it seems reasonable to me to attempt a combined write, as its not that hard and there may be benefits (I say attempt because socket.sendall, especially in gevent, calls send repeatedly, so a lot depends on the state of the buffers, congestion, window sizes, etc.).

Thanks for this conversation! Once again I apologise for drawing incorrect conclusions based on faulty data above.

jamadden added a commit that referenced this issue Jun 16, 2018

Use a buffer to generate HTTP chunks.
The intent is to reduce network overhead for small chunk sizes withoud adding much overhead for larger sizes.

Fixes #1233.

jamadden added a commit that referenced this issue Jun 18, 2018

Use bytearray += instead of bytearray.extend
This compiles to fewer bytecodes and uses built-in slots instead of
method dispatch, so it should be faster.

Benchmarks show a tiny improvement (using the same benchmarks as
issue #1233, so not exactly designed to test this):

RPS

+------+------+-------+
|Size  |master|+=     |
+------+------+-------+
|1     |5.60  | 5.63  |
+------+------+-------+
| 10   |5.49  |  5.53 |
+------+------+-------+
|  100 |4.99  |   5.00|
+------+------+-------+
|  1000|3.53  |   3.57|
+------+------+-------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment