New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Extremely poor IO multiplexer throughput for large messages #139

Closed
vandlol opened this Issue Mar 13, 2018 · 20 comments

Comments

Projects
None yet
2 participants
@vandlol
Copy link

vandlol commented Mar 13, 2018

I'm deploying compiled berkeley db files - which are just big binary blobs - do you have any plans on speeding this up? This is what slows down my deploys very much. I know this is quite out of your focus right now - just asking.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

How many machines are you targeting and are those machines being reached over a high latency / low bandwidth connection? Also what Ansible module are you using today to push them? I've got plans to make file transfer more efficient, by reducing bandwidth usage on slow links (topology aware file distribution) by using an initial hop in the remote network that can cache and redistribute as desired, but I suspect you already know about that.

As for bending the rules of networking, sadly there's not much that can be done :) Mitogen's goal is mostly to reduce "control plane" inefficiency, there is not much that can be done with big file transfers aside from compression, except in situations where multiple targets share a high bandwidth fabric, and the smart distribution tricks from above can be used.

@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 13, 2018

one target, high bandwidth, low latency - my whole setup currently aims towards deployment on "local" datacenter - so one machine in my datacenter will be my dedicated ansible machine, with high bandwidth connection to all target machines.

Currently i'm using copy, controlled by a with_filetree loop:

- name: Copy each Policy file to given target
  copy:
    src: "{{ item.src }}"
    dest: /dest/path
  with_filetree:
    - "/this/is/my/dir"
  when: item.state == 'file'

Where does mitogen store copied files? It looks like the deploy is stuck for quite a while before continuing normal. I'm deploying a few hundred files here - maybe too many files? Are there any restrictions?

dw added a commit that referenced this issue Mar 13, 2018

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

I had never seen nor used with_filetree before. Keeping this open and treating it as a bug until I can reproduce locally. You might be tickling a known issue at present -- the copy module sucks the entire file into RAM, and triggers quadratic behaviour in the buffer management code on both the controller and target.

You can find it under the 'high risk' section of the docs. Have you tried timing just the upload step with plain Ansible vs. with Mitogen? I've a feeling if you watch /usr/bin/time output, you may find controller CPU usage in the Mitogen version spiking like crazy compared to the Ansible version

Assuming that's the issue, don't worry, I very much have a solid plan for fixing it, but you're the first user who may have actually hit it :)

Basically if you have a 1GB file, Mitogen loads that up into a single big string, then feeds the string to the OS in maximum of 16KB chunks, before pruning the string prefix according to what the OS reports it was able to buffer. This will lead, in both controller and target, to up to 16 terabytes of string copies. As I say, it's easy to fix, but I only have one pair of hands!

@dw dw added bug ansible labels Mar 13, 2018

@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 13, 2018

Timings are still in progress - Mitogen really takes its time on this one.
I will have a look into other ways than copy to deploy files. Any suggestions?

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

The synchronize module won't be impacted, since it doesn't use the relevant Ansible APIs that Mitogen is sucking at just now -- but please assume this will be fixed in the next week or two regardless :)

@dw dw added the NeedsTest label Mar 13, 2018

@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 13, 2018

i will post timings as soon as possible - will use synchronize until fixed

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

Just one final question – what size are your DB files (smallest and largest)? I have this GB-sized number in my head, but in reality I have never actually profiled it under any scenario

@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 13, 2018

Smallest Files are 12 K biggest ~ 330 M
BTW: I had to stop the timings on mitogen copy - took more that 3h for less than 1G Data

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

Holy crap, that's bad! Thanks for reporting, will get on to this promptly.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

Equivalent reproduction:

n = 330 * 1048576
s = ' ' * n
cs = 65536
while s:
    s = s[cs:]
@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 13, 2018

its about 3G in total - should've mentioned

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

Mitogen-equivalent reproduction:

import mitogen

def count_bytes_in_string(s):
    return len(s)

@mitogen.main()
def main(router):
    local = router.local()

    s = ' ' * (330 * 1048576)
    print 'bytes in string...'
    print local.call(count_bytes_in_string, s)

On OS X, writes to the pipe proceeed at a maximum rate of 8192 bytes per syscall, even after fixing the quadratic buffer management. This looks super broken, my laptop tops out at 1.4MB/sec between two local processes. Will investigate tonight

@dw dw added the core label Mar 13, 2018

@dw dw changed the title any plans for speeding up deployment of bigger binary blobs? core: Extremely poor IO multiplexer throughput for large messages Mar 13, 2018

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 13, 2018

The obvious workaround for this is to just use scp and a real SSH connection just like Ansible does, but regardless this should not be a difficult problem to fix in Mitogen, and even if we fix Ansible by using SSH directly, the problem will still need to be fixed in the core library.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 14, 2018

I've got a local prototype that pushes it from 1.4MB/sec to about 300MB/sec :D But it's hairy as all hell, this may take a few more days to get right.

Turns out the expensive part is not in the master, but building up all the partial writes into one cohesive string on the target.

Also increased the IO chunk size form 16k to 128k to reduce IO loop iterations, and bumped the socket send and receive buffers to match. (This basically wastes some RAM in exchange for having those fast producer/consumer loops handled by the kernel rather than some crappy slow Python)

dw added a commit that referenced this issue Mar 15, 2018

issue #139: core: bump CHUNK_SIZE from 16kb to 128Kb
Reduces the number of IO loop iterations required to receive large
messages at a small cost to RAM usage.

Note that when calling read() with a large buffer value like this,
Python must zero-allocate that much RAM. In other words, for even a
single byte received, 128kb of RAM might need to be written.
Consequently CHUNK_SIZE is quite a sensitive value and this might need
further tuning.

dw added a commit that referenced this issue Mar 15, 2018

issue #139: core: Side.write(): let the OS write as much as possible.
There is no penalty for just passing as much data to the OS as possible,
it is not copied, and for a non-blocking socket, the OS will just keep
buffer as much as it can and tell us how much that was.

Also avoids a rather pointless string slice.

dw added a commit that referenced this issue Mar 15, 2018

issue #139: eliminate quadratic behaviour on input path
Rather than slowly build up a Python string over time, we just store a
deque of chunks (which, in a later commit, will now be around 128KB
each), and track the total buffer size in a separate integer.

The tricky loop is there to ensure the header does not need to be sliced
off the full message (which may be huge, causing yet another spike and
copy), but rather only off the much smaller first 128kb-sized chunk
received.

There is one more problem with this code: the ''.join() causes RAM usage
to temporarily double, but that was true of the old solution too. Shall
wait for bug reports before fixing this, as it gets very ugly very fast.

dw added a commit that referenced this issue Mar 15, 2018

issue #139: eliminate quadratic behaviour from transmit path
Implication: the entire message remains buffered until its last byte is
transmitted. Not wasting time on it, as there are pieces of work like
issue #6 that might invalidate these problems on the transmit path
entirely.

dw added a commit that referenced this issue Mar 15, 2018

issue #139: bump kernel socket buffer size to 128kb
This allows us to write 128kb at a time towards SSH, but it doesn't help
with sudo, where the ancient tty layer is always used.
@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 15, 2018

You should find that performance of the copy module to a /non sudo account/ has vastly improved. Unfortunately buffer sizes in the TTY layer are fixed at 4kb, so there is still going to be many loop iterations of the two Python interpreters on the target machine if you are sending that file across a become: invocation.

There are several ways to fix that, ranging from making sudo TTY usage a configurable parameter (gross! Configurables suck), to teaching sudo to connect back to the parent process over a (controllable buffer) socket.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 15, 2018

Oh damn. Nope :( This still requires more work. We're using the TTY layer for SSH too. I completely forgot.

@vandlol

This comment has been minimized.

Copy link
Author

vandlol commented Mar 15, 2018

I m thinking about writing some kind of wdt module - but that of course wouldn't help you - but just me instead
But i'd love to provide some help - if I can

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 15, 2018

Cackles! There is a third option: while we need a controlling TTY to capture e.g. SSH and sudo password prompts (and satisfy requiretty), that does /not/ mean SSH and sudo's standard in/standard out need to be connected to that controlling TTY :)

This will require a ton of restructuring the code, but it should let us use sockets rather than TTYs in both places.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 15, 2018

Hi, I am closing this issue and opening new ones (#148 and #149) to track further solutions to the above problems. For the time being, there is nothing more I can do to improve throughput over TTYs -- on OS X it seems to top out around 3MB/sec.

With the (unfortunately development-heavy) solution in #148, throughput should be pushed towards 200MB/sec and would constitute the real fix to the root cause of your problem above. Meanwhile I will think over the next few days about poking holes or adding configurables in Mitogen so SSH continues to be used for file transfer.

Thanks for the report, and sorry this one is longer term than you were probably hoping for!

@dw dw closed this Mar 15, 2018

dw added a commit that referenced this issue Mar 15, 2018

dw added a commit that referenced this issue Mar 19, 2018

dw added a commit that referenced this issue Mar 19, 2018

issue #139: core: bump CHUNK_SIZE from 16kb to 128Kb
Reduces the number of IO loop iterations required to receive large
messages at a small cost to RAM usage.

Note that when calling read() with a large buffer value like this,
Python must zero-allocate that much RAM. In other words, for even a
single byte received, 128kb of RAM might need to be written.
Consequently CHUNK_SIZE is quite a sensitive value and this might need
further tuning.

dw added a commit that referenced this issue Mar 19, 2018

issue #139: core: Side.write(): let the OS write as much as possible.
There is no penalty for just passing as much data to the OS as possible,
it is not copied, and for a non-blocking socket, the OS will just keep
buffer as much as it can and tell us how much that was.

Also avoids a rather pointless string slice.

dw added a commit that referenced this issue Mar 19, 2018

issue #139: eliminate quadratic behaviour on input path
Rather than slowly build up a Python string over time, we just store a
deque of chunks (which, in a later commit, will now be around 128KB
each), and track the total buffer size in a separate integer.

The tricky loop is there to ensure the header does not need to be sliced
off the full message (which may be huge, causing yet another spike and
copy), but rather only off the much smaller first 128kb-sized chunk
received.

There is one more problem with this code: the ''.join() causes RAM usage
to temporarily double, but that was true of the old solution too. Shall
wait for bug reports before fixing this, as it gets very ugly very fast.

dw added a commit that referenced this issue Mar 19, 2018

issue #139: eliminate quadratic behaviour from transmit path
Implication: the entire message remains buffered until its last byte is
transmitted. Not wasting time on it, as there are pieces of work like
issue #6 that might invalidate these problems on the transmit path
entirely.

dw added a commit that referenced this issue Mar 19, 2018

issue #139: bump kernel socket buffer size to 128kb
This allows us to write 128kb at a time towards SSH, but it doesn't help
with sudo, where the ancient tty layer is always used.
@dw

This comment has been minimized.

Copy link
Owner

dw commented Apr 22, 2018

Hi @vandlol

I've implemented the last piece of the file transfer puzzle and the code is now on master. You should see at least 14MB/sec using the built-in copy module, and ideally much better than that given a beefy machine.

Please let me know if you re-test or still bump into issues. Thanks :) David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment