Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible controller exponential memory usage when using handler listeners in collection #83392

Closed
1 task done
ShawnHardwick opened this issue Jun 6, 2024 · 5 comments · Fixed by #83400
Closed
1 task done
Labels
affects_2.18 bug This issue/PR relates to a bug. has_pr This issue has an associated PR.

Comments

@ShawnHardwick
Copy link

ShawnHardwick commented Jun 6, 2024

Summary

In Ansible 2.17 and 2.18, when using the listen parameter for handlers within an Ansible collection, the memory usage on the Ansible controller increases exponentially with the amount of handlers in the playbook. Depending on the playbook tasks, this can cause the Ansible controller to consume all memory on the host until the operating system handles the process by killing it.

I believe the issue was introduce as part of this commit, which I will go into more detail in the reproduction steps:
#82854

Image of ansible-playbook consuming all memory right before the kernel sends a SIGKILL:
20240603_171554_RemoteDesktopManager

Only workaround I have at the moment is to use Ansible 2.16 or remove all listen parameter usage for handlers.

Issue Type

Bug Report

Component Name

core

Ansible Version

ansible [core 2.18.0.dev0]
  config file = /home/shawn.hardwick/.ansible.cfg
  configured module search path = ['/home/shawn.hardwick/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/shawn.hardwick/code/venv/ansible-latest/lib/python3.10/site-packages/ansible
  ansible collection location = /home/shawn.hardwick/code/roles:/home/shawn.hardwick/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles:/home/shawn.hardwick/code:/home/shawn.hardwick/code/ansible_collections
  executable location = /home/shawn.hardwick/code/venv/ansible-latest/bin/ansible
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/home/shawn.hardwick/code/venv/ansible-latest/bin/python)
  jinja version = 3.1.3
  libyaml = True

Configuration

CONFIG_FILE() = /home/shawn.hardwick/.ansible.cfg

OS / Environment

Ansible controller:
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"

Steps to Reproduce

Create a collection, lets name it foo.test.

In this collection, create the following files:
roles/test_role/tasks/main.yml

---
- name: Task 1
  ansible.builtin.debug:
    msg: Task 1 executed
  changed_when: true
  notify: handler_2_listen

- name: Task 2
  ansible.builtin.debug:
    msg: Task 2 executed
  changed_when: true
  notify: handler_2_listen

- name: Task 3
  ansible.builtin.debug:
    msg: Task 3 executed
  changed_when: true
  notify: Handler 3

roles/test_role/handlers/main.yml

---
- name: Handler 1
ansible.builtin.debug:
  msg: Handler 1 executed

- name: Handler 2
listen:
  - handler_2_listen
ansible.builtin.debug:
  msg: Handler 2 executed

- name: Handler 3
ansible.builtin.debug:
  msg: Handler 3 executed

Create a playbook, lets call it playbook.yml:
Notes:

  • The behavior is reproducible against localhost and a remote host, but we choose localhost here for simplicity of reproducing.
  • This behavior is not unique to executing the same role over and over, but we choose in this case so that we do not have to create multiple unique roles. It is important that we call the role multiple times though so that the memory usage balloons and is observable at scale.
- name: Test Handlers
  hosts: localhost
  become: false
  gather_facts: false
  tasks:
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role
    - name: Import a role
      ansible.builtin.import_role:
        name: flatiron.test.test_role

Execute the playbook with the below command:
ansible-playbook ./playbook.yml

In your process manager of preference, observe the memory usage of the ansible-playbook process.
I use htop.

For some additional debugging context, I added the below display.display lines to lib/ansible/plugins/strategy/__init__.py for additional context of what is happening to the array that is causing the memory leak:

        for handler in handlers:
+           display.display(f"Parsing handler: {handler.name}")
            if listeners := handler.listen:
+               display.display(f"Listeners array length: {len(listeners)}")
                listeners = handler.get_validated_value(
                    'listen',
                    handler.fattributes.get('listen'),
                    listeners,
                    templar,
                )
+               display.display(f"Listeners array length after validated value: {len(listeners)}")
                if handler._role is not None:
                    for listener in listeners.copy():
+                       display.display(f"Parsing listener {listener} in listeners array length {len(listeners)}")
                        listeners.extend([
                            handler._role.get_name(include_role_fqcn=True) + ' : ' + listener,
                            handler._role.get_name(include_role_fqcn=False) + ' : ' + listener
                        ])

Expected Results

Memory usage to be small.

Actual Results

Memory is consumed until either the listeners are fully resolved (dependent on handler list) or the machine runs out of memory. On my machine, it will consume half of my total CPU and increasing RAM usage by 2GB every minute.

Using the display.display statements from the reproduce steps, the output might look like this:

[truncated for brevity]
Parsing listener test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : test_role : test_role : handler_2_listen in listeners array length 2956403
Parsing listener foo.test.test_role : test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : test_role : test_role : handler_2_listen in listeners array length 2956405
Parsing listener test_role : test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : test_role : test_role : handler_2_listen in listeners array length 2956407
Parsing listener foo.test.test_role : foo.test.test_role : test_role : foo.test.test_role : foo.test.test_role : foo.test.test_role : test_role : test_role : handler_2_listen in listeners array length 2956409
[truncated for brevity]

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot ansibot added bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. affects_2.18 labels Jun 6, 2024
@ansibot
Copy link
Contributor

ansibot commented Jun 6, 2024

Files identified in the description:

None

If these files are incorrect, please update the component name section of the description or use the component bot command.

@sivel
Copy link
Member

sivel commented Jun 6, 2024

Looks like we just need to ensure we are operating on a copy of the handler.listen list before extending it for evaluation:

diff --git a/lib/ansible/plugins/strategy/__init__.py b/lib/ansible/plugins/strategy/__init__.py
index efd69efe9b..9cba974d07 100644
--- a/lib/ansible/plugins/strategy/__init__.py
+++ b/lib/ansible/plugins/strategy/__init__.py
@@ -558,7 +558,7 @@ class StrategyBase:
                     handler.fattributes.get('listen'),
                     listeners,
                     templar,
-                )
+                ).copy()
                 if handler._role is not None:
                     for listener in listeners.copy():
                         listeners.extend([

@ShawnHardwick
Copy link
Author

Seems like that suggested fix works.
playbook_output.txt

@ansibot ansibot added the has_pr This issue has an associated PR. label Jun 6, 2024
@mattclay mattclay removed the needs_triage Needs a first human triage before being processed. label Jun 6, 2024
@mkrizek
Copy link
Contributor

mkrizek commented Jun 7, 2024

Looks like we just need to ensure we are operating on a copy of the handler.listen list before extending it for evaluation:

I was wondering if we could/should take it a step further and just do the work of validating/extending listen just once on task validation since listen is static #83400?

@briantist
Copy link
Contributor

Since this problem affects 2.17 as well as 2.18, it would be great if whichever PR is chosen is a candidate for backporting. I can't really tell whether @mkrizek 's #83400 or @ShawnHardwick 's #83393 is more or less of a candidate for that, just mentioning it since it's important to us and affecting all of our internal collection testing.

mkrizek added a commit to mkrizek/ansible that referenced this issue Jun 10, 2024
mkrizek added a commit to mkrizek/ansible that referenced this issue Jun 10, 2024
sivel pushed a commit that referenced this issue Jun 10, 2024
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jun 18, 2024
v2.17.1
=======

Minor Changes
-------------
- ansible-test - Update ``pypi-test-container`` to version 3.1.0.

Bugfixes
--------
- Fix rapid memory usage growth when notifying handlers using the ``listen`` keyword (ansible/ansible#83392)
- Fix the task attribute ``resolved_action`` to show the FQCN instead of ``None`` when ``action`` or ``local_action`` is used in the playbook.
- Fix using ``module_defaults`` with ``local_action``/``action`` (ansible/ansible#81905).
- fixed unit test test_borken_cowsay to address mock not been properly applied when existing unix system already have cowsay installed.
- powershell - Implement more robust deletion mechanism for C# code compilation temporary files. This should avoid scenarios where the underlying temporary directory may be temporarily locked by antivirus tools or other IO problems. A failure to delete one of these temporary directories will result in a warning rather than an outright failure.
- shell plugin - properly quote all needed components of shell commands (ansible/ansible#82535)
@ansible ansible locked and limited conversation to collaborators Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.18 bug This issue/PR relates to a bug. has_pr This issue has an associated PR.
Projects
None yet
6 participants