Skip to content
This repository has been archived by the owner on Mar 28, 2024. It is now read-only.

Using RackBody does many many many write() syscalls #78

Closed
WJWH opened this issue Nov 5, 2020 · 18 comments
Closed

Using RackBody does many many many write() syscalls #78

WJWH opened this issue Nov 5, 2020 · 18 comments

Comments

@WJWH
Copy link

WJWH commented Nov 5, 2020

The RackBody class spins up a Streamer which in turn spins up a ZipWriter, passing along an io-like object. The ZipWriter will << many extremely small strings into this io object, often only one or two bytes for local header values. Puma handles this rather poorly, since it essentially does body.each {|chunk| socket.syswrite(chunk)}. This leads to dozens/hundreds of very small write() syscalls and all the context switches to the kernel generates quite a bit of overhead. This can be verified by starting the script below with strace -f puma syscall_ziptricks_example.ru and observing the syscalls generated.

Reproduction script:

# syscall_ziptricks_example.ru
require 'zip_tricks'

class ChonkyBoi
  CHONK_SIZE = 16384 # default-ish linux socket write buffer size 

  def initialize(body)
    @body = body
  end

  def each
    buffer = String.new(capacity: CHONK_SIZE).force_encoding(Encoding::BINARY)
    current_buffer_size = 0
    @body.each do |chonk|
      chonk_size = chonk.bytesize
      if current_buffer_size + chonk.bytesize > CHONK_SIZE
        yield buffer
        buffer = String.new(capacity: CHONK_SIZE).force_encoding(Encoding::BINARY)
        current_buffer_size = 0
      end
      buffer << chonk
      current_buffer_size += chonk_size
    end
    yield buffer
  end
end

app = ->(env) {
  body = ZipTricks::RackBody.new do | zip |
    zip.write_stored_file('syscall_ziptricks_example.ru') do |sink|
      File.open('broken_ziptricks_example.ru', 'rb'){|source| IO.copy_stream(source, sink) }
    end
  end
  # [200, {}, ChonkyBoi.new(body)]
  [200, {}, body]
}

run app

The Run the script with puma syscall_ziptricks_example.ru and then test using wrk with wrk -c 2 -d 30. Using the non-chunked body, I get about 600-650 requests per second. The most straightforward way to reduce syscall overhead is to do more work per syscall, so accumulating the smaller strings into a buffer first has a big effect. Changing the example to use the chunked example, throughput goes up to about 2600-2750 req/s.

@julik
Copy link
Contributor

julik commented Nov 5, 2020

Can we use https://github.com/WeTransfer/zip_tricks/blob/master/lib/zip_tricks/write_buffer.rb for this? I would also like to make this configurable since one of the biggest gripes with things like Net::HTTP that we had was all the implicit buffering going on, so I would love to give the user the possibility to bypass buffering if they so desire

@julik
Copy link
Contributor

julik commented Nov 5, 2020

@felixbuenemann what are your thoughts on this?

@WJWH
Copy link
Author

WJWH commented Nov 5, 2020

WriteBuffer is similar, yes. I used a normal String because it lets me set the initial capacity, preventing the code from having to grow-and-copy the array.

@julik
Copy link
Contributor

julik commented Nov 5, 2020

What I mean is that if we plug the WriteBuffer into the << calls the Streamer makes then we can buffer all the Streamer output, which will fix the issue for all cases - not only for the RackBody (which is just a different name for the output enumerator)

@WJWH
Copy link
Author

WJWH commented Nov 5, 2020

That should work, though some care is needed around simulate_write and friends.

@felixbuenemann
Copy link
Contributor

felixbuenemann commented Nov 5, 2020

I am doing a similar thing in xlsxtream to reduce CRC32 overhead:

https://github.com/felixbuenemann/xlsxtream/blob/master/lib/xlsxtream/io/zip_tricks.rb

Great idea with the initial capacity, I'm not doing that yet in xlsxtream.

Since ZipTricks::RackBody has a predefined use case, I think it makes sense to choose a suitable buffer size for HTTP usage. For example on Linux tcp socket write buffer is 16 KiB by default, can be as small as 4 KiB and as big as 64 / 128 KiB (lowmem system has 64 KiB, see tcp_wmem docs).

Some benchmarking should be done to verify a good choice. If we don't buffer more than the kernel, it shouldn't really increase latency.

@felixbuenemann
Copy link
Contributor

felixbuenemann commented Nov 5, 2020

@WJWH Is it unsafe to do buffer.clear in your example after the yield?

I would assume that could avoid the need allocate a new string, but havent verified.

I'm alos curious why you are tracking current_buffer_size instead of using buffer.bytesize.

@julik
Copy link
Contributor

julik commented Nov 5, 2020

That should work, though some care is needed around simulate_write and friends.

That will work if we insert the buffer below the byte counter of the output (so at the lowest level). I think we could do some IPS benchmarking with an archive which has a lot of items?

@felixbuenemann
Copy link
Contributor

@julik I haven't read the code to closely, but if ZipTricks::Streamer takes a bufffer size as an option, then ZipTricks::RackBody could override initialize from ZipTricks::OutputEnumerator to set a good default and call super to pass it through to the streamer.

julik added a commit that referenced this issue Nov 5, 2020
@julik
Copy link
Contributor

julik commented Nov 5, 2020

Put together a POC branch - @felixbuenemann if you could take this one would really appreciate it, I am a bit swamped atm 😇 And indeed we could pass the buffer size everywhere appropriate. Feel free to push on top of the branch if you want.

@felixbuenemann
Copy link
Contributor

I'm working on a ton of things myself, but this one shouldn't be too hard.

@WJWH Would you like to help with benchmarking different buffer sizes? Best thing would be real world use case, eg. though full http stack with puma. If we only test in isolation we might end up with too large buffers.

@felixbuenemann
Copy link
Contributor

@julik Any reason that zip tricks is doing @crc = Zlib.crc32_combine(@crc, Zlib.crc32(blob), blob.bytesize) instead of just @crc = Zlib.crc32(blob, @crc)?

@felixbuenemann
Copy link
Contributor

@julik Just swapping the two usages above makes CRC32 about 40x faster for smaller block sizes.

I also benchmarked CRC32 performance and since roughly everything between 4K and 512K is roughly the same speed, I'd go with 4K as the default buffer size, since it is the page size on pretty much all operating systems, so you can't allocate less than that anyways.

@felixbuenemann
Copy link
Contributor

Here's the CRC benchmark:

# frozen_string_literal: true

require 'benchmark/ips'
require 'zlib'

Benchmark.ips do |x|
  x.time = 10
  x.warmup = 2

  osize = 1<<20 # 1 MiB
  expected = Zlib.crc32("\0"*osize)
  (0..20).each do |bits|
    isize = 1<<bits
    blocks = osize/isize
    size = bits < 10 ? "#{isize}B" : "#{isize>>10}K"
    x.report("crc32 #{blocks} * #{size}") do
      ibuf = String.new("\0"*isize, encoding: Encoding::BINARY)
      obuf = String.new(capacity: osize, encoding: Encoding::BINARY)
      crc = Zlib.crc32()
      for i in 0...blocks
        crc = Zlib.crc32(ibuf, crc)
        obuf << ibuf
      end
      fail if crc != expected
    end
    # x.report("crc32cm #{blocks} * #{size}") do
    #   ibuf = String.new("\0"*isize, encoding: Encoding::BINARY)
    #   obuf = String.new(capacity: osize, encoding: Encoding::BINARY)
    #   crc = Zlib.crc32()
    #   for i in 0...blocks
    #     crc = Zlib.crc32_combine(crc, Zlib.crc32(ibuf), ibuf.bytesize)
    #     obuf << ibuf
    #   end
    #   fail if crc != expected
    # end
  end

  x.compare!
end

@felixbuenemann
Copy link
Contributor

Slow speed of CRC32 calculations (aside from buffer size), is fixed by #80.

@WJWH
Copy link
Author

WJWH commented Nov 5, 2020

@felixbuenemann :

  • Manually tracking buffer size is mostly a holdover from another issue in the Puma repo. String.bytesize would work just as well here.
  • I tried looking through the ruby source but AFAICT, buffer.clear would also reset the size of the underlying buffer. Maybe a separate issue on MRI is in order to make a "clear string but keep underlying buffer in place" method on String.
  • The buffer size of 16384 was chosen since that was the socket write buffer size on my dev laptop. I think it's a reasonable default but there's no good reason not to make it configurable. One thing to take into account is that (for Puma at least and probably the other servers have equivalents) Nagles algorithm (and TCP_CORK if available) will be enabled for every connection unless you specifically enable low latency mode in the config, so the many small writes don't actually result in many small packets being sent out. It's just the syscall overhead that causes slowdown.

@felixbuenemann
Copy link
Contributor

felixbuenemann commented Nov 5, 2020

Btw. just fixed a spec in #74 and noticed that just for streaming a ZIP with a file called "hello.txt" and content "ßHello from Rails", it did produce a body enumerator with 46 parts…

@julik
Copy link
Contributor

julik commented Nov 5, 2020

Yeap :-)

julik added a commit that referenced this issue Nov 9, 2020
felixbuenemann pushed a commit to felixbuenemann/zip_tricks that referenced this issue Nov 13, 2020
julik added a commit that referenced this issue Nov 23, 2020
@julik julik closed this as completed in a9d762f Nov 23, 2020
julik referenced this issue in julik/zip_kit Feb 29, 2024
The RackBody class spins up a Streamer which in turn spins up a ZipWriter, passing along an io-like object. The ZipWriter will << many extremely small strings into this io object, often only one or two bytes for local header values. Puma handles this rather poorly, since it essentially does body.each {|chunk| socket.syswrite(chunk)}. This leads to dozens/hundreds of very small write() syscalls and all the context switches to the kernel generates quite a bit of overhead.

We also optimize the write buffer a little bit so that it won't be buffering writes which are larger than the desired buffer size, but will pass them through untouched. We ensure the write buffer does not create copies of the string it hosts, to conserve memory, and we explain why we do this in the documentation.

On webserver use, the buffering OutputEnumerator will make the serving roughly 15 to 20 percent faster as opposed to not using buffering at all.

For all standard cases we are going to continue using unbuffered writes, as the performance actually worsens when buffering writes to the local filesystem (as it has some buffering on its own already).

Co-authored-by: Felix Bünemann <felix.buenemann@gmail.com>

Closes #78
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants