-
Notifications
You must be signed in to change notification settings - Fork 23.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ansible hangs indefinitely due to critical design flaw when 3rd party library in 3rd party plugin uses threads #75408
Comments
Files identified in the description: If these files are incorrect, please update the |
@sivel Is there anything done about this bug other than changing my title? In our project we had to apply dumb w/a with early import to get rid of the manifestation, but we would like to see a real fix. |
I suppose we are waiting for the bot to close the bug after inactivity period, so put a new comment here to make sure it does not happen. |
Any update on this issue? |
This is happens under heavy load. Is there any update in this topic? |
Any news in this topic? |
I also noticed Ansible hanging forever, not easily reproducable, and I could not figure out why it hanged. |
The bot auto-closes issues with a label requesting more info and takes 60 days (and leaves a reminder comment at day 30). |
Still hoping for some solution before bot closes this issue. Anybody? |
@gsikorski As was already mentioned by @s-hertel above the bot will auto-close the issue only if there is Having said that, there is no such label applied on this issue at the moment and as such the bot will NOT auto close this issue with the current set of labels applied. |
We may have hit this issue as well. We have moved from Python 2 with Ansible 2.10.4 to Python 3.6 and Ansible 2.11.1. We have some long running playbook which patches Windows servers which has since then begun hanging itself near the end of the play. Strategy free is selected for the play If we check with top and the c and V option to get the full command info and tree information we can see the hung ansible-playbook has a defunct child proces and some sleeping child processes while the main proces never returns. It keeps on running with 75% CPU usage but does nothing anymore. |
@gsikorski as stated in #59642 this is not a problem in Ansible as shipped (we have had those and fixed in the past), it is only an issue if you use a library that it itself creates the undesired behavior. So we don't have any immediate plans to redo this as it requires a complete redesign of the core Ansible engine. We might address this as part of other plans of revamping said engine, but that still won't happen anytime soon. |
@bcoca Of course it is a problem in Ansible. As explained, the problem is always present whenever a new library is loaded (using standard |
@gsikorski my changing of the title (mostly adding to) was just to be more specific for when i or other maintainers review our lists and don't think this is a more generic issue. It is not a de-prioritization, titles are not how we prioritize, the P1-P3 labels are, none are assigned by default. As to 'no explanation' ... there is my own comment above done at the same time, so it really puzzles me that you state these things. So I'm going to 'ADD' information to the title, again, so we can more easily recognize the issue, you can change it again as you just did, that won't change the priority of it. |
At least with the |
Summary
Ansible hangs indefinitely due to known critical design flaw greatly described and analysed in #59642. Unfortunately, as many other bugs, it was closed without ability to comment and left unresolved hitting other people later.
In my case the issue manifests with a Python
crypt
module loaded by thewhen
clause:The issue seems to be fixed by a foolish pre-loading of the
crypt
module inMT
(/usr/lib/python3.6/site-packages/ansible/plugins/strategy/_init_.py
file). This "patch", like the original https://github.com/ansible/ansible/pull/72412/files looks like shooting blind to me and does not really address the original problem baked deep into the way Ansible runs playbooks.Detailed analyses:
MT
starts Result ThreadRT
at the end of a playbook.RT
unlinks any of the shared libraries and allocates a mutex to update an array of linked libs. This normally happens during thread cleanup process and the control may be returned to theMT
.MT
forks a processFP
to execute next playbook. Mutex's information is copied to the new process fromMT
, including information about lock being held byRT
.RT
cleanup handler and the thread is removed.FP
loads a new file (in our case it iscrypt
Python module required by Ansible'swhen
clause, nut may be any in many other places of the Ansible code or plugins)FP
tries to update the array of linked libs, but cannot do this as it discovers the mutex is being held byRT
. The mutex is not released inFP
, as it was copied fromMT
locked, just before it was released. As the information in the new processFP
is a mere copy, it is never updated and the process hangs forever.Issue Type
Bug Report
Component Name
core
Ansible Version
Configuration
Any
OS / Environment
Any (RHEL8 in my case)
Steps to Reproduce
Run many Ansible playbooks under heavy load. In our case it is tricky and is difficult to reproduce, but as analysed in the bug description, it can happen always.
Expected Results
No hang :)
Actual Results
Playbook hangs indefinitely on a mutex.
Code of Conduct
The text was updated successfully, but these errors were encountered: