Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642

Closed
dbevacqua opened this issue Jul 26, 2019 · 14 comments · Fixed by #72412
Closed
Assignees
Labels
affects_2.7 This issue/PR affects Ansible v2.7 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback.

Comments

@dbevacqua
Copy link

SUMMARY

Under certain conditions (see "STEPS TO REPRODUCE"), an ansible playbook run with local connection will hang when gathering facts.

We have traced the problem to unsafe use of fork() (via creation of WorkerProcess[1]) after result Threads have been join()ed. [2]. The python docs [3] warn against this:

Note that safely forking a multithreaded process is problematic

This bug report [4] makes the point somewhat more emphatically:

Do realize that while working on this it is fundamentally impossible per POSIX for os.fork() to be safely used at the Python level in a process also using pthreads. That this ever appeared to work is a pure accident of implementations of underlying libc, malloc, system libraries, and kernel behaviors. POSIX considers it undefined behavior. Nothing done in CPython can avoid that. Any "fix" for these kinds of issues is merely working around the inevitable which will re-occur.

We have experimented with using forkserver instead of fork for starting subprocesses in ansible, and, barring some minor workaround for serialization issues that arise as a result, this seems to fix the problem. We would therefore propose either a configuration option which allowed users to pick a start method or - perhaps more correctly/definitively but - make this the default where available (though we realise that there could be some consequences here). We would appreciate some opinions/guidance on whether this is a sane approach before we dive in and try to make the change.

[1] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L320
[2] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L228
[3] https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
[4] https://bugs.python.org/issue35866#msg340442

ISSUE TYPE
  • Bug Report
COMPONENT NAME

Core

ANSIBLE VERSION
ansible 2.7.0
  config file = None
  configured module search path = ['/home/tw-admin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.8 (default, Jan 14 2019, 11:02:34) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
CONFIGURATION
OS / ENVIRONMENT
$ lsb_release -d
Description:	Ubuntu 18.04.2 LTS

$ uname -a
Linux amp 4.15.0-1044-aws #46-Ubuntu SMP Thu Jul 4 13:38:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ dpkg -l | grep libc6
ii  libc6:amd64                    2.27-3ubuntu1                     amd64        GNU C Library: Shared libraries
ii  libc6-dbg:amd64                2.27-3ubuntu1                     amd64        GNU C Library: detached debugging symbols

$ dpkg -l | grep libgcc
ii  libgcc1:amd64                  1:8.3.0-6ubuntu1~18.04.1          amd64        GCC support library
STEPS TO REPRODUCE

It's complicated. The issue arises when we invoke systemd or command to both enable and start on filebeat service.

e.g.

---
  - command: systemctl enable filebeat.service
  - command: systemctl start filebeat.service

and then try and run any other playbook.

EXPECTED RESULTS

The second playbook completes.

ACTUAL RESULTS

The second playbook hangs while gathering facts.

TL; DR: all of the results below are expected (or maybe "allowed") behaviour from CPython's point of view. Usually we get away with it in ansible because stack unwinding is very quick, but something about our setup makes this slower than usual.

Backtrace from gdb shows that the worker process is hung around here:

  File "/usr/local/lib/python3.6/dist-packages/ansible/playbook/play_context.py", line 433, in set_task_and_variable_override
    new_info.remote_user = pwd.getpwuid(os.getuid()).pw_name
  File "/usr/local/lib/python3.6/dist-packages/ansible/executor/task_executor.py", line 465, in _execute
    self._play_context = self._play_context.set_task_and_variable_override(task=self._task, variables=variables, templar=templar)

The call to getpwuid requires opening of a shared library, libnss_compat.so.2, which requires a lock, which claims to be held by another process...

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f56aa26c098 in __GI___pthread_mutex_lock (mutex=0x7f56aaa9a990 <_rtld_global+2352>) at ../nptl/pthread_mutex_lock.c:113
#2  0x00007f56aa87e01e in _dl_add_to_namespace_list (new=0x1fd7f30, nsid=0) at dl-object.c:33
#3  0x00007f56aa8793c7 in _dl_map_object_from_fd (name=name@entry=0x7ffc3e1e3f90 "libnss_compat.so.2", origname=origname@entry=0x0, fd=-1, fbp=fbp@entry=0x7ffc3e1e37b0, realname=<optimised out>, 
    loader=loader@entry=0x0, l_type=<optimised out>, mode=<optimised out>, stack_endp=<optimised out>, nsid=<optimised out>) at dl-load.c:1322
#4  0x00007f56aa87affc in _dl_map_object (loader=0x0, loader@entry=0x7f56aaa90000, name=name@entry=0x7ffc3e1e3f90 "libnss_compat.so.2", type=type@entry=2, trace_mode=trace_mode@entry=0, 
    mode=mode@entry=-1879048191, nsid=<optimised out>) at dl-load.c:2389
#5  0x00007f56aa886ee4 in dl_open_worker (a=a@entry=0x7ffc3e1e3d20) at dl-open.c:235
#6  0x00007f56aa5e82df in __GI__dl_catch_exception (exception=exception@entry=0x7ffc3e1e3d00, operate=operate@entry=0x7f56aa886dc0 <dl_open_worker>, args=args@entry=0x7ffc3e1e3d20) at dl-error-skeleton.c:196
#7  0x00007f56aa8867ca in _dl_open (file=0x7ffc3e1e3f90 "libnss_compat.so.2", mode=-2147483647, caller_dlopen=0x7f56aa5c9886 <nss_load_library+294>, nsid=<optimised out>, argc=8, argv=<optimised out>, 
    env=0x7ffc3e1e6e00) at dl-open.c:605
#8  0x00007f56aa5e73ad in do_dlopen (ptr=ptr@entry=0x7ffc3e1e3f50) at dl-libc.c:96

however, that process is no longer running by this point. Tracing it back by catching clone() system calls, we determined that the holding process was actually an ansible results thread. At the point of the fork, the backtrace of the results thread (or rather what's left of the underlying LWP - by this point join() has returned and it's Python C code that is cleaning up the pthread) looks like this:

#0  _Unwind_IteratePhdrCallback (info=<optimised out>, size=<optimised out>, ptr=0x7fffefb0e970) at ../../../src/libgcc/unwind-dw2-fde-dip.c:406
#1  0x00007ffff7b49f21 in __GI___dl_iterate_phdr (callback=callback@entry=0x7fffee907890 <_Unwind_IteratePhdrCallback>, data=data@entry=0x7fffefb0e970) at dl-iteratephdr.c:75
#2  0x00007fffee908b31 in _Unwind_Find_FDE (pc=0x6383c4 <PyThread_exit_thread+20>, bases=bases@entry=0x7fffefb0edc8) at ../../../src/libgcc/unwind-dw2-fde-dip.c:469
#3  0x00007fffee905223 in uw_frame_state_for (context=context@entry=0x7fffefb0ed20, fs=fs@entry=0x7fffefb0ea60) at ../../../src/libgcc/unwind-dw2.c:1257
#4  0x00007fffee90672f in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0x7fffefb0fd70, context=context@entry=0x7fffefb0ed20, frames_p=frames_p@entry=0x7fffefb0ec28) at ../../../src/libgcc/unwind.inc:193
#5  0x00007fffee906cfc in _Unwind_ForcedUnwind (exc=0x7fffefb0fd70, stop=stop@entry=0x7ffff77d5d80 <unwind_stop>, stop_argument=<optimised out>) at ../../../src/libgcc/unwind.inc:217
#6  0x00007ffff77d5f10 in __GI___pthread_unwind (buf=<optimised out>) at unwind.c:121
#7  0x00007ffff77cdae5 in __do_cancel () at pthreadP.h:297
#8  __pthread_exit (value=<optimised out>) at pthread_exit.c:28
#9  0x00007ffff7b14504 in __pthread_exit (retval=<optimised out>) at forward.c:173
#10 0x00000000006383c5 in PyThread_exit_thread () at ../Python/thread_pthread.h:300
#11 0x00000000005e5f0f in t_bootstrap () at ../Modules/_threadmodule.c:1030
#12 0x0000000000638084 in pythread_wrapper (arg=<optimised out>) at ../Python/thread_pthread.h:205
#13 0x00007ffff77cc6db in start_thread (arg=0x7fffefb0f700) at pthread_create.c:463
#14 0x00007ffff7b0588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The call from _dl_iterate_phdr requires the same lock that _dlopen is waiting for in the first backtrace. From the worker process' point of view, that lock is never released.

Something about our particular setup means that the stack unwinding takes an unusually long time. We only see the problem when there is a cold page cache, which seems to make sense when you look at what _Unwind_IteratePhdrCallback does (tl;dr: it requires a lot of IO). We also only see it on certain EC2 instance types - it seems to be about balance between IO and compute.

@dbevacqua dbevacqua changed the title Unsafe use of fork() after threads join() causes ansible to hang under certain conditions. Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. Jul 26, 2019
@ansibot
Copy link
Contributor

ansibot commented Jul 26, 2019

Files identified in the description:
None

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.7 This issue/PR affects Ansible v2.7 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback. labels Jul 26, 2019
@samdoran
Copy link
Contributor

cc @jimi-c

@samdoran samdoran removed the needs_triage Needs a first human triage before being processed. label Jul 30, 2019
@jimi-c
Copy link
Member

jimi-c commented Aug 1, 2019

@dbevacqua so it seems to me the easiest way to fix this would be to

  1. Prevent workers from starting if we've started cleaning up
  2. Spin before calling join() on the results thread until we're sure no more workers are starting

I haven't tried to reproduce this yet but based on your description it may not be something easy to do. Would you be available via IRC or email to test out any patches I may come up with?

@dbevacqua
Copy link
Author

Hi @jimi-c

Unfortunately I'm not sure that will fix the issue. I think it might be worth clarifying what the sequence of events is and where things are happening.

  1. ansible creates results thread (backed by a kernel task, say T1) and calls join()
  2. join() returns, but (on Linux with glibc at least) the underlying LWP T1 continues to run in the background. I think this behaviour is allowed/expected in CPython - see [1]
  3. T1 does some cleanup work on itself, which involves taking some locks relating to shared libraries
  4. ansible creates a worker process (backed by another kernel task, say T2) for next playbook (using fork() on Linux). At this point, in my specific conditions, T1 still holds a lock, which T2 will take with it and will never see released. I've been very unlucky but this could happen to anyone.
  5. worker process tries to load a shared library but can't because it thinks T1 still has the lock

The point is that the behaviour of the task after the results thread has join()ed is completely out of Python's control and could go on for an arbitrary amount of time, with no way of knowing (from Python) whether it's finished or not.

My proposed fix is to use forkserver start method for creating subprocesses. This seems to be the recommended way to avoid this kind of problem and, if I understand correctly, would work as follows:

  1. early on, we tell ansible to use forkserver start method globally. a process is forked whose sole purpose is to fork other processes. This fork is safe because no threads have been created yet. <- this is the new bit
  2. ansible creates results thread
  3. join() returns, LWP continues
  4. T1 unwinds
  5. ansible creates worker process, T2, but this is forked from the forkserver process, which is "clean" (i.e. has no locks)
  6. worker process tries to load shared library and succeeds!

I believe that this could also solve #49207 and related issues.

I should probably stress that I am neither a CPython expert nor a systems programmer, so I am quite prepared to be proved wrong on this.

Very happy to collaborate by testing fixes, but I'd be even happier to contribute more substantively towards one as I've invested considerably more time on this than I'd care to admit and would love to see it through!

[1] https://stackoverflow.com/a/57053954/527997 (disclaimer: Q&A are both mine)

@pbrezina
Copy link
Contributor

Is there any progress on this ticket?

@mrstanwell

This comment has been minimized.

@bcoca

This comment has been minimized.

bcoca added a commit to bcoca/ansible that referenced this issue Oct 30, 2020
  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes ansible#59642
@bcoca
Copy link
Member

bcoca commented Oct 30, 2020

@dbevacqua can you confirm that PR above fixes your issue?

@ansibot ansibot added the has_pr This issue has an associated PR. label Oct 30, 2020
@mrstanwell
Copy link

@bcoca I wish I could, but I haven't been able to reproduce it under controlled conditions (the playbook is run by my CI/CD system). I'll go weeks without seeing it. I've run the playbook in a loop for hours without being able to trigger it. I have multiple playbooks, but only the one that starts with a gather facts hangs. The hang occurs almost immediately after the gather facts task, regardless of what other tasks come next. I've tried rearranging tasks, and always see the hang at the same spot regardless of task. It's hung on tasks being skipped. It's hung on an include. I'd like to get a stack trace of a hang in progress, but the CI system kills it after a timeout. I can't even leave it running with DEBUG logging on, because that would expose secrets in the CI log files.

I realize there's not much to go on here, which is why I haven't opened an issue.

@dbevacqua
Copy link
Author

@bcoca inasmuch as it removes the line on which the hang was occurring then yes, it would probably fix the particular issue I reported, but I doubt it will fix all problems of this class. I'll give it a go but I personally wouldn't consider the issue fixed as a result of this change.

@bcoca
Copy link
Member

bcoca commented Oct 30, 2020

In general we avoid mixing threads and forks, but yes, we are dependant that included libraries don't do such things either.

Aside from pwd I don't believe we have a remaining library that does this in core code. As for 3rd party plugins, we cannot really control or stop, just document and advise.

I cannot think of a possible way to resolve this universally (aside from not forking nor threading at all .. which is not an option), just ensure that all code added adheres to the good behaviour. So I'm going to consider this specific issue resolved once I merge the PR.

@dbevacqua
Copy link
Author

I believe using start method forkserver as I suggest above would resolve this universally, though would probably be a bit slower.

See the python docs

fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

Available on Unix only. The default on Unix.

forkserver
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.

@bcoca
Copy link
Member

bcoca commented Oct 30, 2020

we've tried using forkserver but it won't work in many contexts since most of the code is not compatible with things like spawn as we rely on fork inheritance.

@pbrezina
Copy link
Contributor

pbrezina commented Nov 4, 2020

As mentioned in #61701 this makes Ansible very unreliable when used under terminal multiplexors. In my case if it hangs it always hangs at this point:

- name: Remove hostname from /etc/hosts
  become: True
  lineinfile:
    path: /etc/hosts
    regexp: '^(127\.0\.0\.1|::1)[ \t]+{{ ipa.hostname }}.*$'
    state: absent

It would be good to see this issue prioritized.

mattclay pushed a commit that referenced this issue Nov 5, 2020
  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes #59642
bcoca added a commit to bcoca/ansible that referenced this issue Nov 5, 2020
  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes ansible#59642

(cherry picked from commit 488b9d6)
bcoca added a commit to bcoca/ansible that referenced this issue Nov 5, 2020
  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes ansible#59642

(cherry picked from commit 488b9d6)
@ansible ansible locked and limited conversation to collaborators Dec 3, 2020
relrod pushed a commit that referenced this issue Dec 7, 2020
* remove redundant remote_user for local setting

  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes #59642

(cherry picked from commit 488b9d6)

* ensure local exposes correct user (#72543)

* ensure local exposes correct user

  avoid corner case in which delegation relied on
  playcontext fallback which was removed

  fixes #72541

(cherry picked from commit aa4d53c)
relrod pushed a commit that referenced this issue Dec 8, 2020
* remove redundant remote_user for local setting

  local action plugin already does and this also should fix
  fork/thread issue by removing use of pwd library

  fixes #59642

(cherry picked from commit 488b9d6)

* ensure local exposes correct user (#72543)

* ensure local exposes correct user

  avoid corner case in which delegation relied on
  playcontext fallback which was removed

  fixes #72541

(cherry picked from commit aa4d53c)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.7 This issue/PR affects Ansible v2.7 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants