Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traceback in random location near SSH timeout with many target hosts #77325

Open
1 task done
bluikko opened this issue Mar 21, 2022 · 2 comments
Open
1 task done

Traceback in random location near SSH timeout with many target hosts #77325

bluikko opened this issue Mar 21, 2022 · 2 comments
Labels
affects_2.11 bug This issue/PR relates to a bug. needs_verified This issue needs to be verified/reproduced by maintainer P3 Priority 3 - Approved, No Time Limitation support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback.

Comments

@bluikko
Copy link
Contributor

bluikko commented Mar 21, 2022

Summary

Traceback is shown in a seemingly random place but seems to be often immediately or shortly after some hosts were unreachable due to SSH timeout. Can be duplicated consistently with about 20 managed machines with a site.yml that works totally reliably with a small amount of hosts.

Checking available RAM in a 1 second resolution shows minimum of 600 MB available RAM. So unless ansible-playbook uses all that in less than 1 second there should be enough RAM. No log messages anywhere about oom-killer or other memory issues.
In some runs the crash happens after just a few tasks: only fact gathering + simple checks with fail and assert + debug msg tasks were executed before the crash. That lead me to the test case below that is simplified version of the beginning of site.yml.

The below test case causes crash sometimes but not always. Minimum free RAM during this run is about 1 GB.

It seems important that some hosts get SSH unreachable (in test case 3 timeouts, 1 invalid host key).

Issue Type

Bug Report

Component Name

ansible

Ansible Version

$ ansible --version
ansible [core 2.11.9]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/x/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /y/lib64/python3.6/site-packages/ansible
  ansible collection location = /x/.ansible/collections:/usr/share/ansible/collections
  executable location = /y/bin/ansible
  python version = 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
  jinja version = 3.0.2
  libyaml = True

Configuration

$ ansible-config dump --only-changed
CACHE_PLUGIN(/etc/ansible/ansible.cfg) = community.general.yaml
CACHE_PLUGIN_CONNECTION(/etc/ansible/ansible.cfg) = /tmp/.ansible-fact.$USER
CACHE_PLUGIN_TIMEOUT(/etc/ansible/ansible.cfg) = 5184000
CALLBACKS_ENABLED(/etc/ansible/ansible.cfg) = ['ansible.posix.timer']
CONTROLLER_PYTHON_WARNING(/etc/ansible/ansible.cfg) = False
DEFAULT_FORKS(/etc/ansible/ansible.cfg) = 10
DEFAULT_LOCAL_TMP(/etc/ansible/ansible.cfg) = /tmp/.ansible.x/ansible-local-6785jea8d1zb
DEFAULT_LOG_PATH(/etc/ansible/ansible.cfg) = /var/log/ansible.log
DEFAULT_STDOUT_CALLBACK(/etc/ansible/ansible.cfg) = community.general.yaml
DEFAULT_STRATEGY(/etc/ansible/ansible.cfg) = ansible.builtin.free
DEFAULT_TIMEOUT(/etc/ansible/ansible.cfg) = 50
DEFAULT_VAULT_IDENTITY_LIST(env: ANSIBLE_VAULT_IDENTITY_LIST) = ['x@v1', 'y@v2']
DEFAULT_VAULT_ID_MATCH(/etc/ansible/ansible.cfg) = True
INJECT_FACTS_AS_VARS(/etc/ansible/ansible.cfg) = False
RETRY_FILES_SAVE_PATH(/etc/ansible/ansible.cfg) = /tmp/.ansible-retry.x

OS / Environment

EL7

strategy=free

fact caching enabled

netbox inventory plugin

Steps to Reproduce

Include extra variable xxx with value > 100, ansible-playbook -e 'xxx=900' [...]

---
- hosts: all
  gather_facts: true
  tags: always
  vars:
    ansible_hostlist:
      - ansible.example.com
  tasks:
    - name: Confirm variable is defined
      ansible.builtin.fail:
        msg: Variable xxx is not defined while required
      when: (xxx is not defined) and (required | default(true) | bool)
    - name: Check variable xxx
      ansible.builtin.fail:
        msg: The xxx variable is not proper
      when: xxx | int < 100

    - name: Test other variables
      ansible.builtin.assert:
        that: ansible_facts.eth0 is defined

    - name: Status report
      ansible.builtin.debug:
        msg: "Running for {{ inventory_hostname }}/{{ ansible_host }}"

    - name: Ensure target is right host
      ansible.builtin.assert:
        that: inventory_hostname == ansible_facts.fqdn
        fail_msg: hostname not match

    - name: Setup for controller machines
      ansible.builtin.setup:
        gather_subset: interfaces
      become: false
      delegate_facts: true
      delegate_to: "{{ item }}"
      ignore_errors: true
      ignore_unreachable: true
      loop: "{{ ansible_hostlist }}"
      throttle: 1
      when: hostvars[item].ansible_facts.gather_subset | default([]) | intersect(['all', 'interfaces', 'network']) | length == 0

    - name: Check controller machine addresses exist
      ansible.builtin.assert:
        that:
          - (item | select('defined') | length) == (item | length)
          - (item | length) == (ansible_hostlist | length)
        fail_msg: Ansible hosts IP address facts not complete
      loop:
        - "{{ ansible_hostlist | map('extract', hostvars, ['ansible_facts', 'eth0', 'ipv4', 'address']) }}"
        - "{{ ansible_hostlist | map('extract', hostvars, ['ansible_facts', 'eth0', 'ipv6', 0, 'address']) }}"

Expected Results

Expect the playbook run the same with 5 or 20 managed hosts and whether some of the hosts couldn't be reached due to SSH timeout.

Actual Results

ERROR! Unexpected Exception, this is probably a bug: list index out of range
the full traceback was:

Traceback (most recent call last):
  File "/x/bin/ansible-playbook", line 135, in <module>
    exit_code = cli.run()
  File "/x/lib64/python3.6/site-packages/ansible/cli/playbook.py", line 137, in run
    results = pbex.run()
  File "/x/lib64/python3.6/site-packages/ansible/executor/playbook_executor.py", line 189, in run
    result = self._tqm.run(play=play)
  File "/x/lib64/python3.6/site-packages/ansible/executor/task_queue_manager.py", line 315, in run
    play_return = strategy.run(iterator, play_context)
  File "/x/lib64/python3.6/site-packages/ansible/plugins/strategy/free.py", line 114, in run
    host = hosts_left[last_host]
IndexError: list index out of range

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot
Copy link
Contributor

ansibot commented Mar 21, 2022

Files identified in the description:
None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.11 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback. labels Mar 21, 2022
@mkrizek mkrizek added needs_verified This issue needs to be verified/reproduced by maintainer P3 Priority 3 - Approved, No Time Limitation and removed needs_triage Needs a first human triage before being processed. labels Mar 22, 2022
@jacindagreen
Copy link

jacindagreen commented Apr 17, 2022

Hello,
I was going to try to look into this issue more but I am a bit new to the repository. I was wondering if anyone has some more in-depth advice on recreating this error on my local machine. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.11 bug This issue/PR relates to a bug. needs_verified This issue needs to be verified/reproduced by maintainer P3 Priority 3 - Approved, No Time Limitation support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback.
Projects
None yet
Development

No branches or pull requests

4 participants