Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

dbevacqua · 2019-07-26T14:12:10Z

SUMMARY

Under certain conditions (see "STEPS TO REPRODUCE"), an ansible playbook run with local connection will hang when gathering facts.

We have traced the problem to unsafe use of fork() (via creation of WorkerProcess[1]) after result Threads have been join()ed. [2]. The python docs [3] warn against this:

Note that safely forking a multithreaded process is problematic

This bug report [4] makes the point somewhat more emphatically:

Do realize that while working on this it is fundamentally impossible per POSIX for os.fork() to be safely used at the Python level in a process also using pthreads. That this ever appeared to work is a pure accident of implementations of underlying libc, malloc, system libraries, and kernel behaviors. POSIX considers it undefined behavior. Nothing done in CPython can avoid that. Any "fix" for these kinds of issues is merely working around the inevitable which will re-occur.

We have experimented with using forkserver instead of fork for starting subprocesses in ansible, and, barring some minor workaround for serialization issues that arise as a result, this seems to fix the problem. We would therefore propose either a configuration option which allowed users to pick a start method or - perhaps more correctly/definitively but - make this the default where available (though we realise that there could be some consequences here). We would appreciate some opinions/guidance on whether this is a sane approach before we dive in and try to make the change.

[1] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L320
[2] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L228
[3] https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
[4] https://bugs.python.org/issue35866#msg340442

ISSUE TYPE

Bug Report

COMPONENT NAME

Core

ANSIBLE VERSION

ansible 2.7.0
  config file = None
  configured module search path = ['/home/tw-admin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.8 (default, Jan 14 2019, 11:02:34) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]

CONFIGURATION

OS / ENVIRONMENT

$ lsb_release -d
Description:	Ubuntu 18.04.2 LTS

$ uname -a
Linux amp 4.15.0-1044-aws #46-Ubuntu SMP Thu Jul 4 13:38:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep libc6
ii  libc6:amd64                    2.27-3ubuntu1                     amd64        GNU C Library: Shared libraries
ii  libc6-dbg:amd64                2.27-3ubuntu1                     amd64        GNU C Library: detached debugging symbols

$ dpkg -l | grep libgcc
ii  libgcc1:amd64                  1:8.3.0-6ubuntu1~18.04.1          amd64        GCC support library

STEPS TO REPRODUCE

It's complicated. The issue arises when we invoke systemd or command to both enable and start on filebeat service.

e.g.

---
  - command: systemctl enable filebeat.service
  - command: systemctl start filebeat.service

and then try and run any other playbook.

EXPECTED RESULTS

The second playbook completes.

ACTUAL RESULTS

The second playbook hangs while gathering facts.

TL; DR: all of the results below are expected (or maybe "allowed") behaviour from CPython's point of view. Usually we get away with it in ansible because stack unwinding is very quick, but something about our setup makes this slower than usual.

Backtrace from gdb shows that the worker process is hung around here:

  File "/usr/local/lib/python3.6/dist-packages/ansible/playbook/play_context.py", line 433, in set_task_and_variable_override
    new_info.remote_user = pwd.getpwuid(os.getuid()).pw_name
  File "/usr/local/lib/python3.6/dist-packages/ansible/executor/task_executor.py", line 465, in _execute
    self._play_context = self._play_context.set_task_and_variable_override(task=self._task, variables=variables, templar=templar)

The call to getpwuid requires opening of a shared library, libnss_compat.so.2, which requires a lock, which claims to be held by another process...

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f56aa26c098 in __GI___pthread_mutex_lock (mutex=0x7f56aaa9a990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:113
#2  0x00007f56aa87e01e in _dl_add_to_namespace_list (new=0x1fd7f30, nsid=0) at dl-object.c:33
#3  0x00007f56aa8793c7 in _dl_map_object_from_fd (name=name@entry=0x7ffc3e1e3f90 "libnss_compat.so.2", origname=origname@entry=0x0, fd=-1, fbp=fbp@entry=0x7ffc3e1e37b0, realname=<optimised out>, 
    loader=loader@entry=0x0, l_type=<optimised out>, mode=<optimised out>, stack_endp=<optimised out>, nsid=<optimised out>) at dl-load.c:1322
#4  0x00007f56aa87affc in _dl_map_object (loader=0x0, loader@entry=0x7f56aaa90000, name=name@entry=0x7ffc3e1e3f90 "libnss_compat.so.2", type=type@entry=2, trace_mode=trace_mode@entry=0, 
    mode=mode@entry=-1879048191, nsid=<optimised out>) at dl-load.c:2389
#5  0x00007f56aa886ee4 in dl_open_worker (a=a@entry=0x7ffc3e1e3d20) at dl-open.c:235
#6  0x00007f56aa5e82df in __GI__dl_catch_exception (exception=exception@entry=0x7ffc3e1e3d00, operate=operate@entry=0x7f56aa886dc0 <dl_open_worker>, args=args@entry=0x7ffc3e1e3d20) at dl-error-skeleton.c:196
#7  0x00007f56aa8867ca in _dl_open (file=0x7ffc3e1e3f90 "libnss_compat.so.2", mode=-2147483647, caller_dlopen=0x7f56aa5c9886 <nss_load_library+294>, nsid=<optimised out>, argc=8, argv=<optimised out>, 
    env=0x7ffc3e1e6e00) at dl-open.c:605
#8  0x00007f56aa5e73ad in do_dlopen (ptr=ptr@entry=0x7ffc3e1e3f50) at dl-libc.c:96

however, that process is no longer running by this point. Tracing it back by catching clone() system calls, we determined that the holding process was actually an ansible results thread. At the point of the fork, the backtrace of the results thread (or rather what's left of the underlying LWP - by this point join() has returned and it's Python C code that is cleaning up the pthread) looks like this:

#0  _Unwind_IteratePhdrCallback (info=<optimised out>, size=<optimised out>, ptr=0x7fffefb0e970) at ../../../src/libgcc/unwind-dw2-fde-dip.c:406
#1  0x00007ffff7b49f21 in __GI___dl_iterate_phdr (callback=callback@entry=0x7fffee907890 <_Unwind_IteratePhdrCallback>, data=data@entry=0x7fffefb0e970) at dl-iteratephdr.c:75
#2  0x00007fffee908b31 in _Unwind_Find_FDE (pc=0x6383c4 <PyThread_exit_thread+20>, bases=bases@entry=0x7fffefb0edc8) at ../../../src/libgcc/unwind-dw2-fde-dip.c:469
#3  0x00007fffee905223 in uw_frame_state_for (context=context@entry=0x7fffefb0ed20, fs=fs@entry=0x7fffefb0ea60) at ../../../src/libgcc/unwind-dw2.c:1257
#4  0x00007fffee90672f in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0x7fffefb0fd70, context=context@entry=0x7fffefb0ed20, frames_p=frames_p@entry=0x7fffefb0ec28) at ../../../src/libgcc/unwind.inc:193
#5  0x00007fffee906cfc in _Unwind_ForcedUnwind (exc=0x7fffefb0fd70, stop=stop@entry=0x7ffff77d5d80 <unwind_stop>, stop_argument=<optimised out>) at ../../../src/libgcc/unwind.inc:217
#6  0x00007ffff77d5f10 in __GI___pthread_unwind (buf=<optimised out>) at unwind.c:121
#7  0x00007ffff77cdae5 in __do_cancel () at pthreadP.h:297
#8  __pthread_exit (value=<optimised out>) at pthread_exit.c:28
#9  0x00007ffff7b14504 in __pthread_exit (retval=<optimised out>) at forward.c:173
#10 0x00000000006383c5 in PyThread_exit_thread () at ../Python/thread_pthread.h:300
#11 0x00000000005e5f0f in t_bootstrap () at ../Modules/_threadmodule.c:1030
#12 0x0000000000638084 in pythread_wrapper (arg=<optimised out>) at ../Python/thread_pthread.h:205
#13 0x00007ffff77cc6db in start_thread (arg=0x7fffefb0f700) at pthread_create.c:463
#14 0x00007ffff7b0588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The call from _dl_iterate_phdr requires the same lock that _dlopen is waiting for in the first backtrace. From the worker process' point of view, that lock is never released.

Something about our particular setup means that the stack unwinding takes an unusually long time. We only see the problem when there is a cold page cache, which seems to make sense when you look at what _Unwind_IteratePhdrCallback does (tl;dr: it requires a lot of IO). We also only see it on certain EC2 instance types - it seems to be about balance between IO and compute.

The text was updated successfully, but these errors were encountered:

ansibot · 2019-07-26T14:13:56Z

Files identified in the description:
None

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

samdoran · 2019-07-30T15:52:27Z

cc @jimi-c

jimi-c · 2019-08-01T16:03:46Z

@dbevacqua so it seems to me the easiest way to fix this would be to

Prevent workers from starting if we've started cleaning up
Spin before calling join() on the results thread until we're sure no more workers are starting

I haven't tried to reproduce this yet but based on your description it may not be something easy to do. Would you be available via IRC or email to test out any patches I may come up with?

dbevacqua · 2019-08-02T10:35:54Z

Hi @jimi-c

Unfortunately I'm not sure that will fix the issue. I think it might be worth clarifying what the sequence of events is and where things are happening.

ansible creates results thread (backed by a kernel task, say T1) and calls join()
join() returns, but (on Linux with glibc at least) the underlying LWP T1 continues to run in the background. I think this behaviour is allowed/expected in CPython - see [1]
T1 does some cleanup work on itself, which involves taking some locks relating to shared libraries
ansible creates a worker process (backed by another kernel task, say T2) for next playbook (using fork() on Linux). At this point, in my specific conditions, T1 still holds a lock, which T2 will take with it and will never see released. I've been very unlucky but this could happen to anyone.
worker process tries to load a shared library but can't because it thinks T1 still has the lock

The point is that the behaviour of the task after the results thread has join()ed is completely out of Python's control and could go on for an arbitrary amount of time, with no way of knowing (from Python) whether it's finished or not.

My proposed fix is to use forkserver start method for creating subprocesses. This seems to be the recommended way to avoid this kind of problem and, if I understand correctly, would work as follows:

early on, we tell ansible to use forkserver start method globally. a process is forked whose sole purpose is to fork other processes. This fork is safe because no threads have been created yet. <- this is the new bit
ansible creates results thread
join() returns, LWP continues
T1 unwinds
ansible creates worker process, T2, but this is forked from the forkserver process, which is "clean" (i.e. has no locks)
worker process tries to load shared library and succeeds!

I believe that this could also solve #49207 and related issues.

I should probably stress that I am neither a CPython expert nor a systems programmer, so I am quite prepared to be proved wrong on this.

Very happy to collaborate by testing fixes, but I'd be even happier to contribute more substantively towards one as I've invested considerably more time on this than I'd care to admit and would love to see it through!

[1] https://stackoverflow.com/a/57053954/527997 (disclaimer: Q&A are both mine)

pbrezina · 2019-10-29T10:55:24Z

Is there any progress on this ticket?

local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642

bcoca · 2020-10-30T14:44:39Z

@dbevacqua can you confirm that PR above fixes your issue?

mrstanwell · 2020-10-30T15:09:17Z

@bcoca I wish I could, but I haven't been able to reproduce it under controlled conditions (the playbook is run by my CI/CD system). I'll go weeks without seeing it. I've run the playbook in a loop for hours without being able to trigger it. I have multiple playbooks, but only the one that starts with a gather facts hangs. The hang occurs almost immediately after the gather facts task, regardless of what other tasks come next. I've tried rearranging tasks, and always see the hang at the same spot regardless of task. It's hung on tasks being skipped. It's hung on an include. I'd like to get a stack trace of a hang in progress, but the CI system kills it after a timeout. I can't even leave it running with DEBUG logging on, because that would expose secrets in the CI log files.

I realize there's not much to go on here, which is why I haven't opened an issue.

dbevacqua · 2020-10-30T15:13:54Z

@bcoca inasmuch as it removes the line on which the hang was occurring then yes, it would probably fix the particular issue I reported, but I doubt it will fix all problems of this class. I'll give it a go but I personally wouldn't consider the issue fixed as a result of this change.

bcoca · 2020-10-30T16:11:35Z

In general we avoid mixing threads and forks, but yes, we are dependant that included libraries don't do such things either.

Aside from pwd I don't believe we have a remaining library that does this in core code. As for 3rd party plugins, we cannot really control or stop, just document and advise.

I cannot think of a possible way to resolve this universally (aside from not forking nor threading at all .. which is not an option), just ensure that all code added adheres to the good behaviour. So I'm going to consider this specific issue resolved once I merge the PR.

dbevacqua · 2020-10-30T17:09:41Z

I believe using start method forkserver as I suggest above would resolve this universally, though would probably be a bit slower.

See the python docs

fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

Available on Unix only. The default on Unix.

forkserver
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.

bcoca · 2020-10-30T17:56:12Z

we've tried using forkserver but it won't work in many contexts since most of the code is not compatible with things like spawn as we rely on fork inheritance.

pbrezina · 2020-11-04T12:55:44Z

As mentioned in #61701 this makes Ansible very unreliable when used under terminal multiplexors. In my case if it hangs it always hangs at this point:

- name: Remove hostname from /etc/hosts
  become: True
  lineinfile:
    path: /etc/hosts
    regexp: '^(127\.0\.0\.1|::1)[ \t]+{{ ipa.hostname }}.*$'
    state: absent

It would be good to see this issue prioritized.

local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642

local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642 (cherry picked from commit 488b9d6)

* remove redundant remote_user for local setting local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642 (cherry picked from commit 488b9d6) * ensure local exposes correct user (#72543) * ensure local exposes correct user avoid corner case in which delegation relied on playcontext fallback which was removed fixes #72541 (cherry picked from commit aa4d53c)

dbevacqua changed the title ~~Unsafe use of fork() after threads join() causes ansible to hang under certain conditions.~~ Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. Jul 26, 2019

samdoran removed the needs_triage Needs a first human triage before being processed. label Jul 30, 2019

samdoran assigned jimi-c Jul 30, 2019

sivel mentioned this issue Sep 6, 2019

Deadlock when running ansible under screen #61701

Closed

MorphBonehunter mentioned this issue Sep 25, 2020

ansible hangs indefinitely waiting for command termination when SSH connection is OK #18305

Closed

This comment has been minimized.

Sign in to view

bcoca added a commit to bcoca/ansible that referenced this issue Oct 30, 2020

remove redundant remote_user for local setting

d3a0f88

local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642

bcoca mentioned this issue Oct 30, 2020

remove redundant remote_user for local setting #72412

Merged

ansibot added the has_pr This issue has an associated PR. label Oct 30, 2020

mattclay closed this as completed in #72412 Nov 5, 2020

mattclay pushed a commit that referenced this issue Nov 5, 2020

remove redundant remote_user for local setting

488b9d6

local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642

This was referenced Nov 5, 2020

remove redundant remote_user for local setting #72507

Merged

remove redundant remote_user for local setting #72508

Merged

ansible locked and limited conversation to collaborators Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

dbevacqua commented Jul 26, 2019

ansibot commented Jul 26, 2019

samdoran commented Jul 30, 2019

jimi-c commented Aug 1, 2019

dbevacqua commented Aug 2, 2019

pbrezina commented Oct 29, 2019

This comment has been minimized.

This comment has been minimized.

bcoca commented Oct 30, 2020

mrstanwell commented Oct 30, 2020

dbevacqua commented Oct 30, 2020

bcoca commented Oct 30, 2020

dbevacqua commented Oct 30, 2020

bcoca commented Oct 30, 2020

pbrezina commented Nov 4, 2020

Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

Comments

dbevacqua commented Jul 26, 2019

SUMMARY

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

CONFIGURATION

OS / ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ansibot commented Jul 26, 2019

samdoran commented Jul 30, 2019

jimi-c commented Aug 1, 2019

dbevacqua commented Aug 2, 2019

pbrezina commented Oct 29, 2019

This comment has been minimized.

This comment has been minimized.

bcoca commented Oct 30, 2020

mrstanwell commented Oct 30, 2020

dbevacqua commented Oct 30, 2020

bcoca commented Oct 30, 2020

dbevacqua commented Oct 30, 2020

bcoca commented Oct 30, 2020

pbrezina commented Nov 4, 2020