New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for HTTP #141

Closed
normanmaurer opened this Issue Mar 13, 2018 · 29 comments

Comments

Projects
None yet
9 participants
@normanmaurer
Copy link
Member

normanmaurer commented Mar 13, 2018

Currently the HTTP performance is not as good as we have hoped for.... This may be caused by multiple things:

  • cross-module calls (no optimisations for this in swift yet :( )
  • HttpHeaders create Strings in a non-lazy fashion
  • others....

We should look into what we can do to make it faster.

@normanmaurer normanmaurer self-assigned this Mar 13, 2018

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Mar 13, 2018

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Mar 13, 2018

As suggested by @Lukasa we may be able to just "lazy parse and create Strings" and just make the HttpHeaders directly operate on ByteBuffer. This way we would only need to create Strings if the user ask for a specific header etc.

@tanner0101

This comment has been minimized.

Copy link
Contributor

tanner0101 commented Mar 13, 2018

@normanmaurer pre-NIO, Vapor had implemented lazy headers and that gave us a huge performance gain. Definitely the biggest gain over any previous optimizations we had done (besides non-blocking :P).

Some notes about how we did it:

  • HTTPHeaders struct has a COW HTTPHeadersStorage class.
  • HTTPHeadersStorage is backed by an UnsafeMutablePointer<UInt8>.
  • Reading headers means scanning over the bytes and extracting the desired headers.
  • Mutating headers means first doing COW check (and possibly copying the buffer) then editing the bytes in memory, re-allocating the buffer if needed.
  • The important piece was that the HTTPHeadersStorage buffer was always valid HTTP/1 headers. This allowed us to simply write the header storage buffer directly to the output stream when serializing HTTP messages.
  • We were also able to optimize some common base-cases, like:
Content-Length: 0

These base-case HTTPHeaders structs were statically available and, if never mutated (which is common for simple plaintext benchmarks), the performance was great.

link: HTTPHeaderStorage.swift

I was hoping at some point (before NIO happened) to further expand this HTTPHeaders concept to the entire HTTP message header (what NIO calls "head part"). I think everything there can benefit from being parsed lazily (besides perhaps the version, but still no harm done if it's parsed lazily). For example, Vapor has the concept of an "anything" router segment that matches all URIs. This, combined with lazy URI parsing in an HTTP request, could mean the URI never needs to get copied.

For reference, here is a rough idea of the performance I was seeing:

  • vapor/engine: ~130kRPS (highly optimized HTTP echo)
  • go/fast-http: ~120kRPS (also just HTTP echo)
  • vapor/vapor (pre-nio): ~105kRPS (vapor app returning "hello, world")
  • go/gin: ~90kRPS (gin app returning "hello, world")
  • vapor/vapor (post-nio): ~80kRPS (vapor app returning "hello, world")
@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented Mar 13, 2018

https://lists.swift.org/pipermail/swift-server-dev/Week-of-Mon-20171211/000735.html

(As mentioned, I would prefer a lower level HTTPMessage object, which directly peeks into and reuses the receive buffer. But again, all this has been discussed quite actively, and people insisted that String is what we want to have here, and that it’ll be fast enough - so I guess we just stick to that.)

¯\_(ツ)_/¯

Or in other words: I absolutely agree. We had this discussion w/ @weissi in the HTTP list and IIRR he said that Strings are cool ;-)

@weissi

This comment has been minimized.

Copy link
Contributor

weissi commented Mar 13, 2018

@helje5 Well, I still think the API should be Strings or at least some String-like type and not [UInt8], ByteBuffer or Unsafe[Raw]BufferPointer directly. So today I think we should make the String creation lazy and hope that that's good enough. Swift 5 (or maybe even 4.2) should improve the performance quite a bit because AFAIK it'll have a native UTF-8 backing and also the short string optimisation (Strings shorter than some number of characters can be stored inside of the pointer value, no allocations necessary).
If that turns out not to be fast enough still, we could always think about our own String-like type that is backed by a ByteBuffer without any further allocations. Possibly a future version of Swift would even allow the actual String with additional representations.

Now I have to admit that I would have expected Swift to deliver a String type that is backed by memory provided externally by this point in time. In other words I previously had hoped that just naively going with String will be good enough as I hoped that the language would develop its String capabilities pretty soon. Unfortunately, that hasn't happened to this day but there's still hope. So far I think leaving the API as is but making the String creation lazy should be good enough for now.

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented Mar 13, 2018

cross-module calls (no optimisations for this in swift yet :( )

@jckarter may provide input on this, but my feeling is that this won't actually change. Or in other words, it may make sense to have one big module (I made that mistake in Noze.io, and used a lot of tiny modules, which is pretty bad for perf, especially if generics are involved).

The other thing I noticed from a performance perspective, is that you create quite few helper objects and make a lot of indirections. To such a degree that I wonder whether this is going to hurt performance (in general, not just HTTP) just by malloc and function call (+ARC - a thread safe op!) overhead. No idea.

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented Mar 13, 2018

@helje5 Well, I still think the API should be Strings or at least some String-like type and not [UInt8], ByteBuffer or Unsafe[Raw]BufferPointer directly.

Yes, I never suggested that ;-)

There is a lot of optimisation potential in here. It may also make sense to provide static strings as the subscript input (sth like header[NIO.Headers.ContentLength]). And for some values (User-Agent etc) come to mind, it may make sense to maintain an intern table.

@jckarter

This comment has been minimized.

Copy link
Member

jckarter commented Mar 13, 2018

Cross-module inlining is planned to become publicly available using the @inlinable attribute from https://forums.swift.org/t/se-0193-cross-module-inlining-and-specialization/7310 once that's finalized; the design is unlikely to diverge substantially from the existing @_inlineable (note the second e) internal attribute that's already present in 4.x except in spelling. Longer term, the build system ought to support cross-module optimization as a matter of course.

@tanner0101

This comment has been minimized.

Copy link
Contributor

tanner0101 commented Mar 13, 2018

Well, I still think the API should be Strings or at least some String-like type

Totally agree.

So today I think we should make the String creation lazy and hope that that's good enough.

Just lazy String creation will be a huge win by itself. I think this makes sense as a first step before needing to delve into optimizing the actual String type.

It may also make sense to provide static strings as the subscript input

An overload for HTTPHeaders.subscript(name:) that accepts StaticString would be great.


Ideally, parsing an HTTP/1 message head (startline and headers), would require zero copies. After the OS reads the data into a ByteBuffer, that buffer would then be used to produce all needed data lazily. The HTTP scanner just needs to iterate over the bytes until it finds \r\n\r\n, then it's done--at this point it has been fully parsed into an HTTP{Request|Response}Head. These would all be computed properties that access the private COW storage.

The HTTP decoder (pipeline handler) will of course need to peek at some things like the Connection, Transfer-Encoding, or Content-Length headers and potentially the HTTP version to do its job. But those should be the only copies required as a part of the pipeline and they should happen only once required.

In terms of HTTP/2, I'm not too well versed with the spec yet, but as I understand it the same method should work just as well with HTTP/2 frames. Probably with even more room for optimization since we don't need to convert text-encoded integers to discover content length or search for crlfs.


That being said, the above change would require a lot of work. It would replace a lot of, if not everything, http-parser is providing. A simpler alternative (and perhaps a better starting place) would be to apply this lazy parsing to the HTTP headers struct only. The pointer info we get from the http-parser callbacks (on_header_field and on_header_value) can be stored to facilitate reads and writes on the underlying storage (see HTTPHeaderStorage.indexes property).

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented Mar 14, 2018

HTTP/2 does something similar to what I suggested above:

it may make sense to maintain an intern table

The disadvantage of preserving the HTTP header as-is, is that the header block can actually become sizeable (1K or more) and may require multiple reallocs. Another way to approach this are as mentioned string tables and maybe structured values.

The far majority of header names and values are very static. Only few headers actually change in real live (Date, Content-Length). Since you have to look at every byte anyways (to find CRLF etc), you can as well match headers during that process.

Eventually instead of storing one big blob, you could store them in a structured way, either as actual Strings (with CoW values) or as indices into a string table (the latter could save you a lot of ARC). So that

Content-Type: text/json
Content-Length: 0
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13,

becomes:
(-1, 1)
(-2, 2)
(-3, 3)
aka ~6 bytes instead of 175.

This should play especially well with the "performance tests" ;-)

FWIW: This is also what Netty seems to be doing at least for common header names:

public final class HttpHeaderNames {
    public static final AsciiString ACCEPT = AsciiString.cached("accept");
}

But in the end I'm not the poor soul who has to implement that, so I guess I stop doing suggestions 😬

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Mar 14, 2018

I will work on this next week.... Or @tanner0101 you want to try port your stuff from vapor ?

@tanner0101

This comment has been minimized.

Copy link
Contributor

tanner0101 commented Mar 21, 2018

@normanmaurer honestly very tempting offer... but I don't have enough free time right now to do the job right. I am more than happy to help out with code review and testing though!

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Mar 22, 2018

@tanner0101 ok will hopefully start on monday

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented Mar 28, 2018

JFYI: https://github.com/Geal/parser_benchmarks

Benchmarks of parsing HTTP requests with nom (a parser combinator):
• The classical way: 355 Mb/s,
• Optimized (with SIMD): 1930 Mb/s.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Apr 3, 2018

@tanner0101 can you share how you ran the benchmark ? Like what command you used on the client side and what response you expected etc

@tanner0101

This comment has been minimized.

Copy link
Contributor

tanner0101 commented Apr 3, 2018

I always run this same command to keep things consistent:

wrk -t 4 -c 128 -d <usually between 1s - 60s> http://localhost:8080/ping

And it hits a route that returns:

HTTP/1.1 200 OK
Content-Length: 4
Date: ...
Content-Type: text/plain

pong

note: Date header is required to be valid HTTP which is annoying because it's quite slow to generate the date. Vapor's DateMiddleware optimizes this by caching the result and only recalculating when it changes.

In Vapor, that looks like:

router.get("ping") { req -> StaticString in
    return "pong" 
}

Since the "pong" string doesn't change between requests, we can use StaticString which gives a nice boost.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Apr 4, 2018

@tanner0101 thanks. Could you also provide me with an example that I can run using vapor-core only ?

@twof

This comment has been minimized.

Copy link
Contributor

twof commented Apr 4, 2018

@tanner0101 Do you have any stats on memory usage during those benchmarks?

@tanner0101

This comment has been minimized.

Copy link
Contributor

tanner0101 commented Apr 5, 2018

@twof 6.3MB at boot, 7.1MB during peak load.

@normanmaurer here is a base Vapor 3 (release candidate) project with the ping route.

nio-http.zip

Use this to run it:

swift run -c release Run --env prod

Beware that (at time of writing this) Vapor hasn't had the official 3.0.0 tag placed yet and thus hasn't undergone any in-depth performance or memory analysis. There are likely to be a few performance issues and possibly some memory leaks. (If you find any, please let me know haha). But it should hopefully help you nonetheless. You can check back on https://github.com/vapor/vapor/releases/tag/3.0.0 for the official release.

Also note that DateMiddleware is disabled by default. You can uncomment in configure.swift to enable it.

@fadi-botros

This comment has been minimized.

Copy link
Contributor

fadi-botros commented May 1, 2018

Very good issue. Actually I tried to benchmark SwiftNIO against Vertx (tried to make a benchmark under https://www.techempower.com/benchmarks/), and the results were extremely disappointing to me.

I think we need a WHOLE NEW string class, also, JSON Encoding / Decoding, we need it in a high-performance way (I know SwiftNIO is low-level, it may not support JSON out-of-the-box, but I see making a lot of "performant converters" that doesn't use Swift String will give a lot of performance gain) for this framework (this converters may have separate module for each, so that one can only use what he needs).

Swift is very performant in a lot of things, but a big question mark on Swift String performance.

We need something like cstring, and a lot of operations on it, in HTTP, one nearly doesn't use UTF-8 not other encoding, it will be used in database and other things (out-of-scope of SwiftNIO), but HTTP itself doesn't need things out of cstrings. One should try making StringBuffer implementation that doesn't use Swift String, for example.

@mcdappdev

This comment has been minimized.

Copy link

mcdappdev commented May 1, 2018

@fadi-botros Can you please share the results from the benchmarks you ran?

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented May 1, 2018

@fadi-botros It would be very helpful if you would provide a lot more info on this, as such this is just a pretty useless claim? You did JSON encoding/decoding? How? Using Codable? Maybe even Foundation.JSONDecoder? You'll get very disappointing performance with those for sure. What Swift version did you use? etc.

Wrt the Swift String performance, I think the claim is, that a Swift 4.1 UTF-8 String will be pretty much the same as the buffer. There is probably no need for a new class for cases which actually warrant a String (e.g. IMO this isn't the case in the HTTP stack, also checkout PR #355 on this).

I did a pretty detailed performance analysis for Redi/S, and NIO performance is already (surprisingly) close to the C implementation. You can find it over here: RedisServer/Performance.

@twof

This comment has been minimized.

Copy link
Contributor

twof commented May 1, 2018

Can you also post the code for your tests? There are some very confusing results coming out of Vapor too. No idea if anyone's looked into those yet.

@fadi-botros

This comment has been minimized.

Copy link
Contributor

fadi-botros commented May 1, 2018

Also, may using low-level threading primitives may be better than GCD ??
@helje5 Says in his performance analysis to his tool Redi/S, that using one thread was faster than using GCD or EventLoop (EventLoop code is also built on GCD I guess).

@twof For the code: Here is it:
https://gist.github.com/fadi-botros/b8276bb5e2b83a63045877aeb7415f34

@mcdappdev

This comment has been minimized.

Copy link

mcdappdev commented May 1, 2018

@fadi-botros Can you share the actual results of your benchmark? Including what you hardware and software you used to run it (wrk, etc). It's hard to contextualize your concerns without seeing the numbers and I'm curious to know what you used to perform the tests.

@fadi-botros

This comment has been minimized.

Copy link
Contributor

fadi-botros commented May 1, 2018

@mcdappdev This is the results
Sorry, it is as it is resultant from TechEmpower framework benchmark scripts.
Hardware: MacBook Pro 13'', Late 2011, with SSD hard disk 256 GB, 8 GB of RAM
Software: Linux (Ubuntu Trusty) VM, using Vagrant, (which is built on VirtualBox),
the host OS is macOS High Sierra.
The "framework" itself is https://github.com/TechEmpower/FrameworkBenchmarks

https://gist.github.com/fadi-botros/fc0aeb0bbd27a986a1f2a4c3056a5e3d

Also note, my try of "plaintext" test has problems and I don't know where

This is my code (anyway, the link was also in a previous reply)
https://gist.github.com/fadi-botros/b8276bb5e2b83a63045877aeb7415f34

@helje5

This comment has been minimized.

Copy link
Contributor

helje5 commented May 1, 2018

Upfront: I actually don't think this belongs here, maybe there should be a NIO mailing list (personally I wouldn't take part in a forum, but maybe that too). This is more like a general NIO programming question than an actual HTTP performance issue(, I think). E.g. even basic stuff like that NIO doesn't use GCD is missing here ...

Also, may using low-level threading primitives may be better than GCD ??

First of all: NIO does not use GCD. In a way it uses a simplified version of that, which has advantages.

What threading primitives work best for you, depends on the use case a lot. E.g. for Redi/S a plain Readers/Writer lock turned out to be best. But then the wrapped operations operate just in memory and are very fast in the first place.

Summary: it depends a lot.

"Almost" nothing to do w/ NIO. It has a thread safe setup, which is a little more expensive than what for example Node does, but if you avoid TS promises and such, it doesn't show up in my tests as an issue.

@helje5 Says in his performance analysis to his tool Redi/S, that using one thread was faster than using GCD or EventLoop (EventLoop code is also built on GCD I guess).

So you didn't even read what I wrote. :-> I actually found that for Redi/S the sweet spot was two threads.
But yes, Node.js style separate processes w/o locking overhead may actually perform better than a threaded server (w/ the necessary in process locking). But all that again really depends on what you are actually doing (how much shared state etc).

@twof For the code: Here is it:
https://gist.github.com/fadi-botros/b8276bb5e2b83a63045877aeb7415f34

This has quite a few basic NIO programming errors (flush, flush, close?, why is it an ChannelOutboundHandler?), it uses Codable, worse JSONEncoder, a DateFormatter, etc. I'm not surprised this is slow.
But none of that has to do w/ NIO.

That you come forward and claim that we need a:

WHOLE NEW string class

seems a little premature and not backed by actual analysis ;-)

@fadi-botros

This comment has been minimized.

Copy link
Contributor

fadi-botros commented May 3, 2018

@helje5 Sorry but could you suggest improvements of my code ?
I tried removing date formatter and using C time library { time, gmtime }

let weekdays = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
let months = ["January", "February", "March", "April", "May", "June", "July", "August", "September",
              "October", "November", "December"]
let comma = ", "
let gmt = " GMT"
let colon = ":"
let space = " "

...

var t: time_t = 0
time(&t)
let a = gmtime(&t).pointee
let weekday = weekdays[Int(a.tm_wday)]
let month = months[Int(a.tm_mon)]

I even tried to convert them to utf8CString then appending them, it was a pain, and the performance is worst

Also changed the writeAndFlush, writeAndFlush, close
to write, write, writeAndFlush, close

And also no significant performance improvements

It seems the only problem is with JSONEncoder, but also I think String class may need some C enhancements (not sure of that).

@weissi

This comment has been minimized.

Copy link
Contributor

weissi commented Nov 27, 2018

the suggested items have been addressed and everything changes anyway with Swift 5

@weissi weissi closed this Nov 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment