-
Notifications
You must be signed in to change notification settings - Fork 32
Using RackBody does many many many write() syscalls #78
Comments
Can we use https://github.com/WeTransfer/zip_tricks/blob/master/lib/zip_tricks/write_buffer.rb for this? I would also like to make this configurable since one of the biggest gripes with things like Net::HTTP that we had was all the implicit buffering going on, so I would love to give the user the possibility to bypass buffering if they so desire |
@felixbuenemann what are your thoughts on this? |
|
What I mean is that if we plug the WriteBuffer into the |
That should work, though some care is needed around |
I am doing a similar thing in xlsxtream to reduce CRC32 overhead: https://github.com/felixbuenemann/xlsxtream/blob/master/lib/xlsxtream/io/zip_tricks.rb Great idea with the initial capacity, I'm not doing that yet in xlsxtream. Since Some benchmarking should be done to verify a good choice. If we don't buffer more than the kernel, it shouldn't really increase latency. |
@WJWH Is it unsafe to do I would assume that could avoid the need allocate a new string, but havent verified. I'm alos curious why you are tracking |
That will work if we insert the buffer below the byte counter of the output (so at the lowest level). I think we could do some IPS benchmarking with an archive which has a lot of items? |
@julik I haven't read the code to closely, but if |
Put together a POC branch - @felixbuenemann if you could take this one would really appreciate it, I am a bit swamped atm 😇 And indeed we could pass the buffer size everywhere appropriate. Feel free to push on top of the branch if you want. |
I'm working on a ton of things myself, but this one shouldn't be too hard. @WJWH Would you like to help with benchmarking different buffer sizes? Best thing would be real world use case, eg. though full http stack with puma. If we only test in isolation we might end up with too large buffers. |
@julik Any reason that zip tricks is doing |
@julik Just swapping the two usages above makes CRC32 about 40x faster for smaller block sizes. I also benchmarked CRC32 performance and since roughly everything between 4K and 512K is roughly the same speed, I'd go with 4K as the default buffer size, since it is the page size on pretty much all operating systems, so you can't allocate less than that anyways. |
Here's the CRC benchmark: # frozen_string_literal: true
require 'benchmark/ips'
require 'zlib'
Benchmark.ips do |x|
x.time = 10
x.warmup = 2
osize = 1<<20 # 1 MiB
expected = Zlib.crc32("\0"*osize)
(0..20).each do |bits|
isize = 1<<bits
blocks = osize/isize
size = bits < 10 ? "#{isize}B" : "#{isize>>10}K"
x.report("crc32 #{blocks} * #{size}") do
ibuf = String.new("\0"*isize, encoding: Encoding::BINARY)
obuf = String.new(capacity: osize, encoding: Encoding::BINARY)
crc = Zlib.crc32()
for i in 0...blocks
crc = Zlib.crc32(ibuf, crc)
obuf << ibuf
end
fail if crc != expected
end
# x.report("crc32cm #{blocks} * #{size}") do
# ibuf = String.new("\0"*isize, encoding: Encoding::BINARY)
# obuf = String.new(capacity: osize, encoding: Encoding::BINARY)
# crc = Zlib.crc32()
# for i in 0...blocks
# crc = Zlib.crc32_combine(crc, Zlib.crc32(ibuf), ibuf.bytesize)
# obuf << ibuf
# end
# fail if crc != expected
# end
end
x.compare!
end |
Slow speed of CRC32 calculations (aside from buffer size), is fixed by #80. |
|
Btw. just fixed a spec in #74 and noticed that just for streaming a ZIP with a file called "hello.txt" and content "ßHello from Rails", it did produce a body enumerator with 46 parts… |
Yeap :-) |
POC looking into WeTransfer#78
The RackBody class spins up a Streamer which in turn spins up a ZipWriter, passing along an io-like object. The ZipWriter will << many extremely small strings into this io object, often only one or two bytes for local header values. Puma handles this rather poorly, since it essentially does body.each {|chunk| socket.syswrite(chunk)}. This leads to dozens/hundreds of very small write() syscalls and all the context switches to the kernel generates quite a bit of overhead. We also optimize the write buffer a little bit so that it won't be buffering writes which are larger than the desired buffer size, but will pass them through untouched. We ensure the write buffer does not create copies of the string it hosts, to conserve memory, and we explain why we do this in the documentation. On webserver use, the buffering OutputEnumerator will make the serving roughly 15 to 20 percent faster as opposed to not using buffering at all. For all standard cases we are going to continue using unbuffered writes, as the performance actually worsens when buffering writes to the local filesystem (as it has some buffering on its own already). Co-authored-by: Felix Bünemann <felix.buenemann@gmail.com> Closes #78
The
RackBody
class spins up aStreamer
which in turn spins up aZipWriter
, passing along an io-like object. TheZipWriter
will<<
many extremely small strings into this io object, often only one or two bytes for local header values. Puma handles this rather poorly, since it essentially doesbody.each {|chunk| socket.syswrite(chunk)}
. This leads to dozens/hundreds of very smallwrite()
syscalls and all the context switches to the kernel generates quite a bit of overhead. This can be verified by starting the script below withstrace -f puma syscall_ziptricks_example.ru
and observing the syscalls generated.Reproduction script:
The Run the script with
puma syscall_ziptricks_example.ru
and then test usingwrk
withwrk -c 2 -d 30
. Using the non-chunked body, I get about 600-650 requests per second. The most straightforward way to reduce syscall overhead is to do more work per syscall, so accumulating the smaller strings into a buffer first has a big effect. Changing the example to use the chunked example, throughput goes up to about 2600-2750 req/s.The text was updated successfully, but these errors were encountered: