Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: startup hang due to reaping first stage before draining core_src pipe #179

Closed
kneufeld opened this Issue Mar 31, 2018 · 34 comments

Comments

Projects
None yet
3 participants
@kneufeld
Copy link

kneufeld commented Mar 31, 2018

I'm looking to speedup my too slow ansible scripts so mitogen looks like the ticket. However it doesn't connect to any of my machines. It did once however, I was running against a group and it connected to one machine and was crazy fast.

I'm connecting to Ubuntu 16.04 from OSX but have tried connecting from Ubuntu 16.04 as well.

log.txt

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 31, 2018

Some guy on Reddit reported this stack trace a few weeks ago, but I forgot all about it. Is the machine heavily loaded by any chance? Presumably you've got good networking to the target.

Basically 'ec0_received()' means the bootstrap reported successful start, but 'EC1' never got written, which means Python failed to start up (within the timeout) after we sent core.py.

This is most likely some kind of problem with the target, perhaps involving some modules missing that core.py needs, or a bug in core.py bootstrap that your config is tickling.

(Random guesswork) do you have e.g. a custom bashrc or similar on the target box that might be fiddling with the Python environment in some way?

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 31, 2018

Is Python 3 on the remote machine? This could also be explained by that ,since the bootstrap is already 3.x compatible but core.py very much is not.

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Mar 31, 2018

I did think of .bashrc and .profile and moved them out of the way, same issue.

python 3.6 is on the remote machine but running python gives you 2.7.

Ansible without the mitogen strategy and ssh in general work fine.

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Mar 31, 2018

I tried moving my ~/.ssh/config out of the way and still timing out on EC1.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Mar 31, 2018

I flew long haul earlier and I'm wrecked today, will look at improving logging in this case at the very least. One trick is moving python on the box to e.g. python2.7.real, and replacing python2.7 with a shell script that does "exec strace -o /tmp/python-strace -ff /path/to/python.real "$@"" and seeing if there is anything fun in the logs.

Actually if you do that please attach the logs here :)

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Mar 31, 2018

I shouldn't be working on Saturday either. ;) Go have a nap or something, I'll try the strace thing next week.

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Apr 2, 2018

Ok, I tried the strace thing remotely but since Ansible/mitogen never logs in the strace-python is never run. Makes sense. If that's not what you meant I also did the same the thing with my local python but it spawned 38 different processes, so a bit of a needle in a haystack.

So fun fact, I then made a new vm and mitogen works fine with it and holy crap is mitogen fast. I thought for sure mitogen would start failing after the single sign on with freeipa was installed but nope, kept on working for many subsequent test runs.

So I have no idea what's going on but I guess it's not mitogen's fault and there is something freaky deaky about my setup.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Apr 3, 2018

If Mitogen breaks where Ansible succeeds, then it's Mitogen's problem :)

Yep, I meant doing the strace thing on the remote box. I'm really dumb -- I said to switch out "python2.7", but actually by default Ansible runs "python". Can you give it another try? So sorry! What I'd expect to be looking for is lines like "write(1, "Traceback (most recent call last)...." in the output, but really no clue if it's an exception or some other problem.

As for the script, the only strace juju that may be useful is increasing the logged string length, like:

#!/bin/bash
exec strace -s 200 -o /tmp/python-strace -ff /path/to/python.real "$@"
@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Apr 3, 2018

I did replace python, not python2.7.

However, since I can't successfully ssh/connect, the remote python is never attempted to run.

@dw

This comment has been minimized.

Copy link
Owner

dw commented Apr 5, 2018

Hi Kurt,

Thought of another way to try this -- just do "ps auxfw | grep ssh" and attach "strace -ff -o /tmp/... -p [sshd pid]", where the pid is the oldest sshd process on the system.

The traceback reading "ec0_received" definitely indicates login completed and at least one Python interpreter process existed on the target host, so there is some issue with the /usr/bin/python wrapping approach above.

The strace will produce a ton of logs (at least it did here). Sort by last modified time and they should be in the last 5... grep "execve.*python" should find them. There are two -- the first is the one created immediately following connect, the second is forked (and should have a PID +1 the previous) to implement bootstrap

If you like (and if it doesn't touch confidential data), feel free to just ZIP everything up and mail it to dw@botanicus.net

Also be sure to disable SSH connection multiplexing temporarily if that's in use... otherwise it won't be the oldest sshd that forks

@dw dw added bug ansible labels Apr 9, 2018

@dw

This comment has been minimized.

Copy link
Owner

dw commented Apr 16, 2018

Hey Kurt, any luck with this one? :) Still can't reproduce on this end

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Apr 16, 2018

Sorry, I've been slammed at work. I'll try some more stracing and see what I can figure out. I keep doing a git pull every few days to see if this problem gets magically fixed but so far no luck. :(

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented Apr 16, 2018

Holy crap... so I moved all my .bash* and .profile* out of /root and mitogen worked! Cool but I had tried that before... So then I moved the files back one at a time to try and find the one that broke things. And... I got all the files placed back and mitogen still works.

So on another machine I tried the same experiment by moving all the same files out of the way and mitogen still didn't work.

The strace trick doesn't/can't work because the the ssh login process doesn't finish and no command on the remote host is run.

So I'm pretty much out of ideas. I'd love to be able to use mitogen as the few times I've seen it run it's crazy fast and way less CPU. But here we are. So unless you've got something brilliant to try next I think I'm done. :(

@dw

This comment has been minimized.

Copy link
Owner

dw commented Apr 18, 2018

Hi Kurt!

Thanks for all your efforts :) Regarding 'the login process doesn't finish', that cannot be true, as in the original stack trace:

File "/Users/kneufeld/company/ansible/mitogen/mitogen/ssh.py", line 133, in _connect_bootstrap
    self._ec0_received()
  File "/Users/kneufeld/company/ansible/mitogen/mitogen/parent.py", line 499, in _ec0_received
    discard_until(self.receive_side.fd, 'EC1\n', time.time() + 10.0)

The _ec0_received() function cannot run until the remote prints "EC0", which can only happen if login both succeeded and started the initial "python -c ....." command, which causes EC0 to be printed.

Keeping this bug open in case anyone hits the same issue. Thanks again :)

@sharph

This comment has been minimized.

Copy link

sharph commented May 11, 2018

We're encountering this issue too. We get ec0_received, and then the timeout is encountered and mitogen cleans up the connection. It doesn't happen on all targets, just a subset, but they tend to be the same hosts. It also seems to not happen on all the affected hosts all the time. The hosts I checked out were definitely under load, but they started regular ssh sessions and python (2.7) interpreters quickly when I checked by hand.

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Hi @sharph, thanks for commenting :) Have you tried simply bumping timeout= under [defaults] in ansible.cfg up to something like 120 or similar? I'd love it if this were simply an issue of too short a timeout :)

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Ah whups. The relevant timeout is (erroneously) hard-coded. It is the "+ 10" value in mitogen/parent.py::_ec0_received. If you say, increase that number 10x, does the problem abate?

dw added a commit that referenced this issue May 11, 2018

issue #179: ec0_receive() and connect_bootstrap() should use deadline.
Now there is a single global deadline derived from ansible.cfg timeout=
value.

dw added a commit that referenced this issue May 11, 2018

issue #179: ec0_receive() and connect_bootstrap() should use deadline.
Now there is a single global deadline derived from ansible.cfg timeout=
value.

dw added a commit that referenced this issue May 11, 2018

Merge pull request #246 from dw/issue179
issue #179: ec0_receive() and connect_bootstrap() should use deadline.
@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Okay, now the value is derived from ansible.cfg as it should be.

@sharph

This comment has been minimized.

Copy link

sharph commented May 11, 2018

If I change to 100 seconds in the source code, I get the same result.

I wrote this small script to help me test. It hangs about half the time, succeeds quickly the other half. If I ssh into the target while it is hanging, I can see the mitogen python processes just sitting there. This doesn't seem to be ansible related.

import mitogen.master
import os

broker = mitogen.master.Broker()
router = mitogen.master.Router(broker)

try:
    router.enable_debug()
    host = router.ssh(
        hostname='my.host',
        check_host_keys='ignore'
    )

    host.call(os.system, 'echo hello')
finally:
    broker.shutdown()

It leaves the following log (timeout was still 10s at this point.)

Parent PID: 1
Created by:

  File "test.py", line 12, in <module>
    check_host_keys='ignore'
  File "/fabric/ansible/mitogen-master/mitogen/parent.py", line 1073, in ssh
    return self.connect('ssh', **kwargs)
  File "/fabric/ansible/mitogen-master/mitogen/parent.py", line 1031, in connect
    return self._connect(klass, name=name, **kwargs)
  File "/fabric/ansible/mitogen-master/mitogen/parent.py", line 1013, in _connect
    stream.connect()
  File "/fabric/ansible/mitogen-master/mitogen/ssh.py", line 143, in connect
    super(Stream, self).connect()
  File "/fabric/ansible/mitogen-master/mitogen/parent.py", line 742, in connect
    LOG.debug('%r.connect()', self)
  File "/usr/lib/python2.7/logging/__init__.py", line 1155, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle
    self.emit(record)
  File "/fabric/ansible/mitogen-master/mitogen/core.py", line 265, in emit
    self._reopen()
  File "/fabric/ansible/mitogen-master/mitogen/core.py", line 257, in _reopen
    ''.join(traceback.format_stack()),

17:35:24 D mitogen: mitogen.ssh.Stream('default').connect()
17:35:25 D mitogen: hybrid_tty_create_child() pid=319 stdio=7, tty=6, cmd: "ssh" "-o" "Compression yes" "-o" "ServerAliveInterval 15" "-o" "ServerAliveCountMax 3" "-o" "StrictHostKeyChecking no" "-o" "UserKnownHostsFile /dev/null" "-o" "GlobalKnownHostsFile /dev/null" "[REDACTED]" "python2.7" "-c" "'import codecs,os,sys;_=codecs.decode;exec(_(_(\"eNpd0FFLwzAQB/Bn+yl8u4SFkVvFrYWCUsT3PtgHFWnXVMK6JKSdUT+9Nzow3Vt+97/kuFSiLuy4dtopxhMvQiTd3xJ66w+M58kNnbuT2zApUEr+70rE8pTi7P1gR8WqGD5GHSPMUOZLe2te4bF6fpHwXow/VPxW+9PUtIOae4gDWwZiyRWwo57spzK5t3Z62Pabpt82uzZL0zzFew48oZeC15NiKOCplG+GamXxwc47d9YpQ5uCb4GvvWo6hojpjgv41W6+fGmqBQSgL7g8VsZZWGZwh1mW0aBVeT0ez+P/AOySeEU=\".encode(),\"base64\"),\"zip\"))'"
17:35:25 D mitogen: mitogen.ssh.Stream('local.319').connect(): child process stdin/stdout=7
17:35:25 D mitogen.io: iter_read(6) -> "Warning: Permanently added '[REDACTED]' (ECDSA) to the list of known hosts.\r\n"
17:35:25 D mitogen: mitogen.ssh.Stream('local.319'): received "Warning: Permanently added '[REDACTED]' (ECDSA) to the list of known hosts.\r\n"
17:35:26 D mitogen.io: iter_read(7) -> 'EC0\n'
17:35:26 D mitogen: mitogen.ssh.Stream('local.319'): received 'EC0\n'
17:35:26 D mitogen: mitogen.ssh.Stream('local.319')._ec0_received()
17:35:36 D mitogen: mitogen.ssh.Stream('local.319'): child process still alive, sending SIGTERM
17:35:36 D mitogen: Waker(Broker(0x7faa42caf210) rfd=3, wfd=4).on_disconnect()
17:35:36 D mitogen.io: Broker(0x7faa42caf210).stop_receive(Waker(Broker(0x7faa42caf210) rfd=3, wfd=4))
17:35:36 D mitogen.io: Broker(0x7faa42caf210)._stop_transmit(Waker(Broker(0x7faa42caf210) rfd=3, wfd=4))
@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Can you try the aforementioned trick to try and capture strace of the Pythons involved? Depending on the point this hang occurs at, there will be up to two Python processes.

The first, with almost certainly the lowest PID, will have either "/usr/bin/python" or "/usr/bin/python(mitogen:foo@bar:1234)" depending on whether exec has happened yet. I am expecting exec will already have happened.

The second almost certainly has a PID that is 1+ the previous process, it is the first stage. I suspect this process will be hanging on a read() or write() system call for unknown reasons. The relevant code is mitogen/parent.py::_first_stage(). This is the process that is responsible for writing EC0, then using Python to fully read and fully write the source of core.py, before finally writing EC1.

The strace trick is something like:

cat > /tmp/strace-python.sh <<EOF
#!/bin/sh
strace -ff -o /tmp/strace-python /usr/bin/python "$@"
EOF

chmod +x /tmp/strace-python.sh

Then pass python_path="/tmp/strace-python.sh" as kwarg to .ssh().

Thanks for your help! I bet this one is blindingly simple, but I just can't see it.

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Can you also capture ls -l /proc/pid/fd output for both first stage and parent process during the hang.. that would also be super useful

@sharph

This comment has been minimized.

Copy link

sharph commented May 11, 2018

It produced 3 files. I killed mitogen processes which were remaining on the server.
strace-python.15552.txt
strace-python.15553.txt

15552 is as I retrieved it from the server -- it doesn't look like it is completely written.

strace-python.15551.txt does not appear to contain any confidential info, but there are some binary strings in there I haven't decoded so out of an abundance of caution I am going to send it to you via email.

@sharph

This comment has been minimized.

Copy link

sharph commented May 11, 2018

I don't seem to have /proc/pid/fd on my system.

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

You're fabulous, I think I may have spotted it. It's not to do with CPU load, but with memory pressure. There is an ordering bug during startup -- we try to reap the first stage (PID 1551 hung wait()ing for it to exit) before trying to read a copy of the core.py source that the first stage is still blocked in the process of writing.

What I don't understand is why this only shows up some of the time, Linux is documented as having fixed-size pipe buffers.

The ordering fix is simple, but I'd like to confirm the reason the pipe buffer is full isn't due to something else writing to it, so let me reproduce the issue here first.

Thanks so much for this! It's the oldest user bug :)

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Can you please dump the output of uname -a, sudo sysctl -a | grep pipe, and free -m from that machine? Thanks :)

dw added a commit that referenced this issue May 11, 2018

issue #179: don't reap first stage until core_src_fd is drained.
Bootstrap would hang if (as of writing) a pipe sufficient to hold 42,006
bytes was not handed out by the kernel to the first stage. It was luck
that this didn't manifest before, as first stage could write the full
source and exit completely before reading begun.

It is not clear under which circumstances this could previously occur,
but at least since Linux 4.5, it can be triggered if
/proc/sys/fs/pipe-max-size is reduced from the default of 1MiB, which
can have the effect of capping the default pipe buffer size of 64KiB to
something lower.

Suspicion is that buffer pipe size could also be reduced under memory
pressure, as reference to busy machines appeared a few times in the bug
report.

dw added a commit that referenced this issue May 11, 2018

Merge pull request #247 from dw/issue179
issue #179: don't reap first stage until core_src_fd is drained.
@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

Managed to reproduce it locally -- please drop a comment if you're still seeing this, otherwise I'm pretty sure we finally found it. :) Thanks again so much, it was a massive head-scratcher without debug logs!

@dw dw closed this May 11, 2018

@dw dw changed the title ansible: can't connect and/or hangs when conecting to remote machines core: startup hang due to reaping first stage before draining core_src pipe May 11, 2018

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented May 11, 2018

I had to bump up the timeout but holy crappers! Totally fast now! Thanks guys.

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 11, 2018

@kneufeld would still be interested in the output of these from your affected box :)

uname -a, sudo sysctl -a | grep pipe, and free -m

If there is a piece of software restricting pipe buffer size sysctl, this has performance ramifications for file transfer too, so would still love to know how your boxes end up configured like this. Maybe just low RAM

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 12, 2018

Had a read through fs/pipe.c and turns out the likely root cause was right there in the pipe(7) man page:

/proc/sys/fs/pipe-user-pages-soft (since Linux 4.5)
      The soft limit on the total size (in pages) of all pipes created or  set
      by   a   single   unprivileged   user   (i.e.,   one  with  neither  the
      CAP_SYS_RESOURCE nor the CAP_SYS_ADMIN  capability).   So  long  as  the
      total number of pages allocated to pipe buffers for this user is at this
      limit, individual pipes created by a user will be limited to  one  page,
      and attempts to increase a pipe's capacity will be denied.

      When  the  value  of  this limit is zero, no soft limit is applied.  The
      default value for this file is 16384, which permits creating up to  1024
      pipes with the default capacity.

It would seem in both cases, the user Mitogen is attempting to connect as already has opened (assuming kernel defaults) sufficient pipes to exhaust 64MiB of kernel memory, which would be (assuming default 64KiB pipe buffer) 1024 pipes. Once that limit is hit Linux allocates 4KiB buffers rather than 64KiB buffers, which triggered the deadlock.

I guess the scenario above could describe a great deal of server apps, so surprising it was hit by so few people. Thanks again

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented May 12, 2018

# uname -a
Linux www1 4.13.4-1-pve #1 SMP PVE 4.13.4-26 (Mon, 6 Nov 2017 11:23:55 +0100) x86_64 x86_64 x86_64 GNU/Linux

# sysctl -a | grep pipe
fs.pipe-max-size = 1048576
fs.pipe-user-pages-hard = 0
fs.pipe-user-pages-soft = 16384
kernel.core_pipe_limit = 0
sysctl: reading key "kernel.unprivileged_userns_apparmor_policy"
sysctl: reading key "net.ipv6.conf.all.stable_secret"
sysctl: reading key "net.ipv6.conf.default.stable_secret"
sysctl: reading key "net.ipv6.conf.eth0.stable_secret"
sysctl: reading key "net.ipv6.conf.lo.stable_secret"

# free -m
              total        used        free      shared  buff/cache   available
Mem:            512         441          65        1416           4          65
Swap:           512          99         412
@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented May 12, 2018

For the record I was connecting as root so not sure if those pipe limits apply?

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 12, 2018

The plot thickens! Indeed they're not supposed to, assuming a 'real' root user. Was it perhaps inside any kind of e.g. Docker/LXC/LXD container that has its root user stripped of typical privileges?

edit: I see from uname output you're likely running Proxmox, I think that probably qualifies :)

@kneufeld

This comment has been minimized.

Copy link
Author

kneufeld commented May 12, 2018

It's most definitely running in an LXC container (Proxmox for the win) but this would be the first time I've come across anything like this. But first for anything of course.

@dw

This comment has been minimized.

Copy link
Owner

dw commented May 12, 2018

I have no clue how Proxmox handles things, but you can check output of getpcaps $$ from a shell to see if cap_sys_resource or cap_sys_admin are present. IIRC the former is a global override for things like ulimits (and you'd assume it would be missing in a container), and the latter is a catch-all for things like configuring network interfaces. I imagine it should also be gone. Without either of those, whatever your UID 0 maps to will receive the soft limit.

Just checked in Docker and they're both def missing, even as UID 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.