New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsafe use of fork() after join()-ing threads causes ansible to hang under certain conditions. #59642
Comments
Files identified in the description: If these files are inaccurate, please update the |
cc @jimi-c |
@dbevacqua so it seems to me the easiest way to fix this would be to
I haven't tried to reproduce this yet but based on your description it may not be something easy to do. Would you be available via IRC or email to test out any patches I may come up with? |
Hi @jimi-c Unfortunately I'm not sure that will fix the issue. I think it might be worth clarifying what the sequence of events is and where things are happening.
The point is that the behaviour of the task after the results thread has My proposed fix is to use
I believe that this could also solve #49207 and related issues. I should probably stress that I am neither a CPython expert nor a systems programmer, so I am quite prepared to be proved wrong on this. Very happy to collaborate by testing fixes, but I'd be even happier to contribute more substantively towards one as I've invested considerably more time on this than I'd care to admit and would love to see it through! [1] https://stackoverflow.com/a/57053954/527997 (disclaimer: Q&A are both mine) |
Is there any progress on this ticket? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642
@dbevacqua can you confirm that PR above fixes your issue? |
@bcoca I wish I could, but I haven't been able to reproduce it under controlled conditions (the playbook is run by my CI/CD system). I'll go weeks without seeing it. I've run the playbook in a loop for hours without being able to trigger it. I have multiple playbooks, but only the one that starts with a gather facts hangs. The hang occurs almost immediately after the gather facts task, regardless of what other tasks come next. I've tried rearranging tasks, and always see the hang at the same spot regardless of task. It's hung on tasks being skipped. It's hung on an include. I'd like to get a stack trace of a hang in progress, but the CI system kills it after a timeout. I can't even leave it running with DEBUG logging on, because that would expose secrets in the CI log files. I realize there's not much to go on here, which is why I haven't opened an issue. |
@bcoca inasmuch as it removes the line on which the hang was occurring then yes, it would probably fix the particular issue I reported, but I doubt it will fix all problems of this class. I'll give it a go but I personally wouldn't consider the issue fixed as a result of this change. |
In general we avoid mixing threads and forks, but yes, we are dependant that included libraries don't do such things either. Aside from pwd I don't believe we have a remaining library that does this in core code. As for 3rd party plugins, we cannot really control or stop, just document and advise. I cannot think of a possible way to resolve this universally (aside from not forking nor threading at all .. which is not an option), just ensure that all code added adheres to the good behaviour. So I'm going to consider this specific issue resolved once I merge the PR. |
I believe using start method See the python docs fork Available on Unix only. The default on Unix. forkserver |
we've tried using forkserver but it won't work in many contexts since most of the code is not compatible with things like |
As mentioned in #61701 this makes Ansible very unreliable when used under terminal multiplexors. In my case if it hangs it always hangs at this point: - name: Remove hostname from /etc/hosts
become: True
lineinfile:
path: /etc/hosts
regexp: '^(127\.0\.0\.1|::1)[ \t]+{{ ipa.hostname }}.*$'
state: absent It would be good to see this issue prioritized. |
local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642
local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642 (cherry picked from commit 488b9d6)
local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes ansible#59642 (cherry picked from commit 488b9d6)
* remove redundant remote_user for local setting local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642 (cherry picked from commit 488b9d6) * ensure local exposes correct user (#72543) * ensure local exposes correct user avoid corner case in which delegation relied on playcontext fallback which was removed fixes #72541 (cherry picked from commit aa4d53c)
* remove redundant remote_user for local setting local action plugin already does and this also should fix fork/thread issue by removing use of pwd library fixes #59642 (cherry picked from commit 488b9d6) * ensure local exposes correct user (#72543) * ensure local exposes correct user avoid corner case in which delegation relied on playcontext fallback which was removed fixes #72541 (cherry picked from commit aa4d53c)
SUMMARY
Under certain conditions (see "STEPS TO REPRODUCE"), an ansible playbook run with local connection will hang when gathering facts.
We have traced the problem to unsafe use of
fork()
(via creation ofWorkerProcess
[1]) after resultThreads
have beenjoin()
ed. [2]. The python docs [3] warn against this:This bug report [4] makes the point somewhat more emphatically:
We have experimented with using
forkserver
instead offork
for starting subprocesses in ansible, and, barring some minor workaround for serialization issues that arise as a result, this seems to fix the problem. We would therefore propose either a configuration option which allowed users to pick a start method or - perhaps more correctly/definitively but - make this the default where available (though we realise that there could be some consequences here). We would appreciate some opinions/guidance on whether this is a sane approach before we dive in and try to make the change.[1] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L320
[2] https://github.com/ansible/ansible/blob/v2.7.0/lib/ansible/plugins/strategy/__init__.py#L228
[3] https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
[4] https://bugs.python.org/issue35866#msg340442
ISSUE TYPE
COMPONENT NAME
Core
ANSIBLE VERSION
CONFIGURATION
OS / ENVIRONMENT
STEPS TO REPRODUCE
It's complicated. The issue arises when we invoke
systemd
orcommand
to both enable and start on filebeat service.e.g.
and then try and run any other playbook.
EXPECTED RESULTS
The second playbook completes.
ACTUAL RESULTS
The second playbook hangs while gathering facts.
TL; DR: all of the results below are expected (or maybe "allowed") behaviour from CPython's point of view. Usually we get away with it in ansible because stack unwinding is very quick, but something about our setup makes this slower than usual.
Backtrace from
gdb
shows that the worker process is hung around here:The call to
getpwuid
requires opening of a shared library,libnss_compat.so.2
, which requires a lock, which claims to be held by another process...however, that process is no longer running by this point. Tracing it back by catching
clone()
system calls, we determined that the holding process was actually an ansible results thread. At the point of the fork, the backtrace of the results thread (or rather what's left of the underlying LWP - by this pointjoin()
has returned and it's Python C code that is cleaning up the pthread) looks like this:The call from
_dl_iterate_phdr
requires the same lock that_dlopen
is waiting for in the first backtrace. From the worker process' point of view, that lock is never released.Something about our particular setup means that the stack unwinding takes an unusually long time. We only see the problem when there is a cold page cache, which seems to make sense when you look at what
_Unwind_IteratePhdrCallback
does (tl;dr: it requires a lot of IO). We also only see it on certain EC2 instance types - it seems to be about balance between IO and compute.The text was updated successfully, but these errors were encountered: