Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible hangs indefinitely due to critical design flaw when 3rd party library in 3rd party plugin uses threads #75408

Open
1 task
gsikorski opened this issue Aug 5, 2021 · 17 comments
Labels
affects_2.16 affects_2.17 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team.

Comments

@gsikorski
Copy link

gsikorski commented Aug 5, 2021

Summary

Ansible hangs indefinitely due to known critical design flaw greatly described and analysed in #59642. Unfortunately, as many other bugs, it was closed without ability to comment and left unresolved hitting other people later.

In my case the issue manifests with a Python crypt module loaded by the when clause:

#93 Frame 0x560ee4d1b888, for file /usr/lib/python3.6/site-packages/ansible/plugins/loader.py, line 525, in _load_module_source (self=<Jinja2Loader(class_name='FilterModule', base_class=None, package='ansible.plugins.filter', subdir='filter_plugins', aliases={}, config=['/opt/app-root/src/.ansible/plugins/filter', '/usr/share/ansible/plugins/filter'], _extra_dirs=['/usr/share/my-playbooks/ansible/filter_plugins'], _module_cache={'/usr/share/my-playbooks/ansible/filter_plugins/address.py': <module at remote 0x7fa725ba64a8>}, _paths=['/usr/share/my-playbooks/ansible/filter_plugins', '/opt/app-root/src/.ansible/plugins/filter', '/usr/share/ansible/plugins/filter', '/usr/lib/python3.6/site-packages/ansible/plugins/filter', '/usr/lib/python3.6/site-packages/ansible/plugins/filter/__pycache__'], _plugin_path_cache={}, _searched_paths=set(), package_path='/usr/lib/python3.6/site-packages/ansible/plugins/filter') at remote 0x7fa72a6a6198>, name='8763610137784027770_core', path='/usr/lib/python3.6/site-packages/ansible/plugi...(truncated)
#97 Frame 0x560ee48b0ac8, for file /usr/lib/python3.6/site-packages/ansible/plugins/loader.py, line 670, in all (self=<Jinja2Loader(class_name='FilterModule', base_class=None, package='ansible.plugins.filter', subdir='filter_plugins', aliases={}, config=['/opt/app-root/src/.ansible/plugins/filter', '/usr/share/ansible/plugins/filter'], _extra_dirs=['/usr/share/my-playbooks/ansible/filter_plugins'], _module_cache={'/usr/share/my-playbooks/ansible/filter_plugins/address.py': <module at remote 0x7fa725ba64a8>}, _paths=['/usr/share/my-playbooks/ansible/filter_plugins', '/opt/app-root/src/.ansible/plugins/filter', '/usr/share/ansible/plugins/filter', '/usr/lib/python3.6/site-packages/ansible/plugins/filter', '/usr/lib/python3.6/site-packages/ansible/plugins/filter/__pycache__'], _plugin_path_cache={}, _searched_paths=set(), package_path='/usr/lib/python3.6/site-packages/ansible/plugins/filter') at remote 0x7fa72a6a6198>, args=(), kwargs={}, dedupe=False, path_only=False, class_only=False, all_matches=['/usr/share/softi-a...(truncated)
#100 Frame 0x7fa725b9be48, for file /usr/lib/python3.6/site-packages/ansible/plugins/loader.py, line 754, in <listcomp> (.0=<generator at remote 0x7fa725ba7048>, p=<FilterModule(_original_path='/usr/share/my-playbooks/ansible/filter_plugins/address.py', _load_name='address') at remote 0x7fa725c049b0>)

The issue seems to be fixed by a foolish pre-loading of the crypt module in MT (/usr/lib/python3.6/site-packages/ansible/plugins/strategy/_init_.py file). This "patch", like the original https://github.com/ansible/ansible/pull/72412/files looks like shooting blind to me and does not really address the original problem baked deep into the way Ansible runs playbooks.

Detailed analyses:

  1. Main Thread MT starts Result Thread RT at the end of a playbook.
  2. RT unlinks any of the shared libraries and allocates a mutex to update an array of linked libs. This normally happens during thread cleanup process and the control may be returned to the MT.
  3. When the mutex is allocated, MT forks a process FP to execute next playbook. Mutex's information is copied to the new process from MT, including information about lock being held by RT.
  4. Mutex is released by RT cleanup handler and the thread is removed.
  5. FP loads a new file (in our case it is crypt Python module required by Ansible's when clause, nut may be any in many other places of the Ansible code or plugins)
  6. FP tries to update the array of linked libs, but cannot do this as it discovers the mutex is being held by RT. The mutex is not released in FP, as it was copied from MT locked, just before it was released. As the information in the new process FP is a mere copy, it is never updated and the process hangs forever.

Issue Type

Bug Report

Component Name

core

Ansible Version

$ ansible --version
ansible 2.9.14
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/opt/app-root/src/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.6/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.6.8 (default, Dec  5 2019, 15:45:45) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]

Configuration

Any

OS / Environment

Any (RHEL8 in my case)

Steps to Reproduce

Run many Ansible playbooks under heavy load. In our case it is tricky and is difficult to reproduce, but as analysed in the bug description, it can happen always.

Expected Results

No hang :)

Actual Results

Playbook hangs indefinitely on a mutex.

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot
Copy link
Contributor

ansibot commented Aug 5, 2021

Files identified in the description:
None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.9 This issue/PR affects Ansible v2.9 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Aug 5, 2021
@jborean93 jborean93 removed the needs_triage Needs a first human triage before being processed. label Aug 5, 2021
@sivel sivel changed the title Ansible hangs indefinitely due to critical design flaw Ansible hangs indefinitely due forking and threads Aug 5, 2021
@gsikorski
Copy link
Author

@sivel Is there anything done about this bug other than changing my title? In our project we had to apply dumb w/a with early import to get rid of the manifestation, but we would like to see a real fix.

@gsikorski
Copy link
Author

I suppose we are waiting for the bot to close the bug after inactivity period, so put a new comment here to make sure it does not happen.

@gsikorski
Copy link
Author

Any update on this issue?

@kbalos
Copy link

kbalos commented Sep 14, 2021

This is happens under heavy load. Is there any update in this topic?

@gsikorski
Copy link
Author

Any news in this topic?

@msiczek
Copy link

msiczek commented Sep 21, 2021

I also noticed Ansible hanging forever, not easily reproducable, and I could not figure out why it hanged.

@s-hertel
Copy link
Contributor

s-hertel commented Sep 21, 2021

The bot auto-closes issues with a label requesting more info and takes 60 days (and leaves a reminder comment at day 30).

@ansibot ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 21, 2021
@s-hertel s-hertel removed the needs_info This issue requires further information. Please answer any outstanding questions. label Sep 21, 2021
@gsikorski
Copy link
Author

Still hoping for some solution before bot closes this issue. Anybody?

@mkrizek
Copy link
Contributor

mkrizek commented Oct 4, 2021

@gsikorski As was already mentioned by @s-hertel above the bot will auto-close the issue only if there is needs info label. The bot would also warn about closing the issue.

Having said that, there is no such label applied on this issue at the moment and as such the bot will NOT auto close this issue with the current set of labels applied.

@gsikorski
Copy link
Author

Thanks @mkrizek. @s-hertel is there anything done with this bug? Any plans?

@PulsatingQuasar
Copy link

We may have hit this issue as well. We have moved from Python 2 with Ansible 2.10.4 to Python 3.6 and Ansible 2.11.1. We have some long running playbook which patches Windows servers which has since then begun hanging itself near the end of the play. Strategy free is selected for the play

If we check with top and the c and V option to get the full command info and tree information we can see the hung ansible-playbook has a defunct child proces and some sleeping child processes while the main proces never returns. It keeps on running with 75% CPU usage but does nothing anymore.

@gsikorski
Copy link
Author

@mkrizek @s-hertel @sivel Any hope for new year's issue resolution?

@bcoca bcoca changed the title Ansible hangs indefinitely due forking and threads Ansible hangs indefinitely due forking and threads and third party library Jan 14, 2022
@bcoca
Copy link
Member

bcoca commented Jan 14, 2022

@gsikorski as stated in #59642 this is not a problem in Ansible as shipped (we have had those and fixed in the past), it is only an issue if you use a library that it itself creates the undesired behavior. So we don't have any immediate plans to redo this as it requires a complete redesign of the core Ansible engine. We might address this as part of other plans of revamping said engine, but that still won't happen anytime soon.

@gsikorski
Copy link
Author

@bcoca Of course it is a problem in Ansible. As explained, the problem is always present whenever a new library is loaded (using standard import keyword) in any plugin. The bug is not a problem of a bad plugin, but the core of Ansible, as it forks the subprocess carelessly. I also do not like the way the issue is being prioritised down by changing the title with no explanation. Could you provide any?

@gsikorski gsikorski changed the title Ansible hangs indefinitely due forking and threads and third party library Ansible hangs indefinitely due to critical design flaw Feb 3, 2022
@bcoca
Copy link
Member

bcoca commented Feb 3, 2022

@gsikorski my changing of the title (mostly adding to) was just to be more specific for when i or other maintainers review our lists and don't think this is a more generic issue. It is not a de-prioritization, titles are not how we prioritize, the P1-P3 labels are, none are assigned by default. As to 'no explanation' ... there is my own comment above done at the same time, so it really puzzles me that you state these things.

So I'm going to 'ADD' information to the title, again, so we can more easily recognize the issue, you can change it again as you just did, that won't change the priority of it.

@bcoca bcoca changed the title Ansible hangs indefinitely due to critical design flaw Ansible hangs indefinitely due to critical design flaw when 3rd party library in 3rd party plugin uses threads Feb 3, 2022
@jborean93 jborean93 added affects_2.14 and removed affects_2.9 This issue/PR affects Ansible v2.9 labels Jul 27, 2022
@jborean93
Copy link
Contributor

At least with the crypt example this will be going away as it's usage was deprecated in Python 3.11 and we are removing it's use in Ansible in newer releases. It won't fix the underlying problem but removes a usage of a problematic plugin in Ansible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.16 affects_2.17 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
None yet
Development

No branches or pull requests

9 participants