Traceback in random location near SSH timeout with many target hosts #77325

bluikko · 2022-03-21T09:58:57Z

Summary

Traceback is shown in a seemingly random place but seems to be often immediately or shortly after some hosts were unreachable due to SSH timeout. Can be duplicated consistently with about 20 managed machines with a site.yml that works totally reliably with a small amount of hosts.

Checking available RAM in a 1 second resolution shows minimum of 600 MB available RAM. So unless ansible-playbook uses all that in less than 1 second there should be enough RAM. No log messages anywhere about oom-killer or other memory issues.
In some runs the crash happens after just a few tasks: only fact gathering + simple checks with fail and assert + debug msg tasks were executed before the crash. That lead me to the test case below that is simplified version of the beginning of site.yml.

The below test case causes crash sometimes but not always. Minimum free RAM during this run is about 1 GB.

It seems important that some hosts get SSH unreachable (in test case 3 timeouts, 1 invalid host key).

Issue Type

Bug Report

Component Name

ansible

Ansible Version

$ ansible --version
ansible [core 2.11.9]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/x/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /y/lib64/python3.6/site-packages/ansible
  ansible collection location = /x/.ansible/collections:/usr/share/ansible/collections
  executable location = /y/bin/ansible
  python version = 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
  jinja version = 3.0.2
  libyaml = True

Configuration

$ ansible-config dump --only-changed
CACHE_PLUGIN(/etc/ansible/ansible.cfg) = community.general.yaml
CACHE_PLUGIN_CONNECTION(/etc/ansible/ansible.cfg) = /tmp/.ansible-fact.$USER
CACHE_PLUGIN_TIMEOUT(/etc/ansible/ansible.cfg) = 5184000
CALLBACKS_ENABLED(/etc/ansible/ansible.cfg) = ['ansible.posix.timer']
CONTROLLER_PYTHON_WARNING(/etc/ansible/ansible.cfg) = False
DEFAULT_FORKS(/etc/ansible/ansible.cfg) = 10
DEFAULT_LOCAL_TMP(/etc/ansible/ansible.cfg) = /tmp/.ansible.x/ansible-local-6785jea8d1zb
DEFAULT_LOG_PATH(/etc/ansible/ansible.cfg) = /var/log/ansible.log
DEFAULT_STDOUT_CALLBACK(/etc/ansible/ansible.cfg) = community.general.yaml
DEFAULT_STRATEGY(/etc/ansible/ansible.cfg) = ansible.builtin.free
DEFAULT_TIMEOUT(/etc/ansible/ansible.cfg) = 50
DEFAULT_VAULT_IDENTITY_LIST(env: ANSIBLE_VAULT_IDENTITY_LIST) = ['x@v1', 'y@v2']
DEFAULT_VAULT_ID_MATCH(/etc/ansible/ansible.cfg) = True
INJECT_FACTS_AS_VARS(/etc/ansible/ansible.cfg) = False
RETRY_FILES_SAVE_PATH(/etc/ansible/ansible.cfg) = /tmp/.ansible-retry.x

OS / Environment

EL7

strategy=free

fact caching enabled

netbox inventory plugin

Steps to Reproduce

Include extra variable xxx with value > 100, ansible-playbook -e 'xxx=900' [...]

---
- hosts: all
  gather_facts: true
  tags: always
  vars:
    ansible_hostlist:
      - ansible.example.com
  tasks:
    - name: Confirm variable is defined
      ansible.builtin.fail:
        msg: Variable xxx is not defined while required
      when: (xxx is not defined) and (required | default(true) | bool)
    - name: Check variable xxx
      ansible.builtin.fail:
        msg: The xxx variable is not proper
      when: xxx | int < 100

    - name: Test other variables
      ansible.builtin.assert:
        that: ansible_facts.eth0 is defined

    - name: Status report
      ansible.builtin.debug:
        msg: "Running for {{ inventory_hostname }}/{{ ansible_host }}"

    - name: Ensure target is right host
      ansible.builtin.assert:
        that: inventory_hostname == ansible_facts.fqdn
        fail_msg: hostname not match

    - name: Setup for controller machines
      ansible.builtin.setup:
        gather_subset: interfaces
      become: false
      delegate_facts: true
      delegate_to: "{{ item }}"
      ignore_errors: true
      ignore_unreachable: true
      loop: "{{ ansible_hostlist }}"
      throttle: 1
      when: hostvars[item].ansible_facts.gather_subset | default([]) | intersect(['all', 'interfaces', 'network']) | length == 0

    - name: Check controller machine addresses exist
      ansible.builtin.assert:
        that:
          - (item | select('defined') | length) == (item | length)
          - (item | length) == (ansible_hostlist | length)
        fail_msg: Ansible hosts IP address facts not complete
      loop:
        - "{{ ansible_hostlist | map('extract', hostvars, ['ansible_facts', 'eth0', 'ipv4', 'address']) }}"
        - "{{ ansible_hostlist | map('extract', hostvars, ['ansible_facts', 'eth0', 'ipv6', 0, 'address']) }}"

Expected Results

Expect the playbook run the same with 5 or 20 managed hosts and whether some of the hosts couldn't be reached due to SSH timeout.

Actual Results

ERROR! Unexpected Exception, this is probably a bug: list index out of range
the full traceback was:

Traceback (most recent call last):
  File "/x/bin/ansible-playbook", line 135, in <module>
    exit_code = cli.run()
  File "/x/lib64/python3.6/site-packages/ansible/cli/playbook.py", line 137, in run
    results = pbex.run()
  File "/x/lib64/python3.6/site-packages/ansible/executor/playbook_executor.py", line 189, in run
    result = self._tqm.run(play=play)
  File "/x/lib64/python3.6/site-packages/ansible/executor/task_queue_manager.py", line 315, in run
    play_return = strategy.run(iterator, play_context)
  File "/x/lib64/python3.6/site-packages/ansible/plugins/strategy/free.py", line 114, in run
    host = hosts_left[last_host]
IndexError: list index out of range

Code of Conduct

I agree to follow the Ansible Code of Conduct

The text was updated successfully, but these errors were encountered:

ansibot · 2022-03-21T10:01:42Z

Files identified in the description:
None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

jacindagreen · 2022-04-17T15:49:47Z

Hello,
I was going to try to look into this issue more but I am a bit new to the repository. I was wondering if anyone has some more in-depth advice on recreating this error on my local machine. Thanks!

ansibot added affects_2.11 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback. labels Mar 21, 2022

mkrizek added needs_verified This issue needs to be verified/reproduced by maintainer P3 Priority 3 - Approved, No Time Limitation and removed needs_triage Needs a first human triage before being processed. labels Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traceback in random location near SSH timeout with many target hosts #77325

Traceback in random location near SSH timeout with many target hosts #77325

bluikko commented Mar 21, 2022 •

edited

ansibot commented Mar 21, 2022

jacindagreen commented Apr 17, 2022 •

edited

Traceback in random location near SSH timeout with many target hosts #77325

Traceback in random location near SSH timeout with many target hosts #77325

Comments

bluikko commented Mar 21, 2022 • edited

Summary

Issue Type

Component Name

Ansible Version

Configuration

OS / Environment

Steps to Reproduce

Expected Results

Actual Results

Code of Conduct

ansibot commented Mar 21, 2022

jacindagreen commented Apr 17, 2022 • edited

bluikko commented Mar 21, 2022 •

edited

jacindagreen commented Apr 17, 2022 •

edited