Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure in run_once task while using strategy free leads to all other hosts finishing immediately #80737

Open
1 task done
GlysVenture opened this issue May 8, 2023 · 8 comments
Labels
affects_2.14 bug This issue/PR relates to a bug. P3 Priority 3 - Approved, No Time Limitation

Comments

@GlysVenture
Copy link

Summary

When I use the run_once keyword with strategy: free and the keyworded task fails, instead of it having no effect on the play, it makes all other running hosts finish their last task and terminates the play.

This is especially problematic when including roles from others, where the run_once keyword could be somewhere you didn't know, and it could break your plays if you are using the free strategy.

Issue Type

Bug Report

Component Name

run_once

Ansible Version

$ ansible --version
ansible [core 2.14.5]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.11/site-packages/ansible
  ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/user/.local/bin/ansible
  python version = 3.11.3 (main, Apr  5 2023, 15:52:25) [GCC 12.2.1 20230201] (/usr/bin/python)
  jinja version = 3.1.2
  libyaml = True

Configuration

# if using a version older than ansible-core 2.12 you should omit the '-t all'
$ ansible-config dump --only-changed -t all
CONFIG_FILE() = /etc/ansible/ansible.cfg

OS / Environment

Archcraft x86_64 - kernel 6.3.1-arch1-1 (Arch)

But I also experienced the same issue using
Debian 5.10.103-1

Steps to Reproduce

playbook.yml

- name: Test
  hosts: all
  gather_facts: false
  strategy: free
  tasks:
    - name: fail if success_run_once is false
      fail:
      run_once: true
      when: success_run_once == false

    - name: wait some time
      ansible.builtin.wait_for:
        timeout: 10
      delegate_to: localhost

    - name: simulate say hi
      debug:
        msg: "hi from {{ inventory_hostname }}"

    - name: simulate do fail
      fail:
      when: inventory_hostname == 'host1'

    - name: say hi
      debug:
        msg: "{{ inventory_hostname }} just finished its tasks"

inventory.yml

all:
  hosts:
    host1:
      ansible_hostname: 127.0.0.1
      success_run_once: true
    host2:
      ansible_hostname: 127.0.0.1
      success_run_once: true
    host3:
      ansible_hostname: 127.0.0.1
      success_run_once: false

Run ansible-playbook:
ansible-playbook -i inventory.yml playbook.yml

Expected Results

I expected run_once to not have any effect on the play and tasks, as if it wasn't there.

Sample expected output obtained by removing the run_once keyword:

PLAY [Test] *************************************************************************************************************************************************************************************************

TASK [fail if success_run_once is false] ********************************************************************************************************************************************************************
skipping: [host1]
skipping: [host2]
fatal: [host3]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}

TASK [wait some time] ***************************************************************************************************************************************************************************************
ok: [host1 -> localhost]
ok: [host2 -> localhost]

TASK [simulate say hi] **************************************************************************************************************************************************************************************
ok: [host1] => {
    "msg": "hi from host1"
}
ok: [host2] => {
    "msg": "hi from host2"
}

TASK [simulate do fail] *************************************************************************************************************************************************************************************
fatal: [host1]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}
skipping: [host2]

TASK [say hi] ***********************************************************************************************************************************************************************************************
ok: [host2] => {
    "msg": "host2 just finished its tasks"
}

PLAY RECAP **************************************************************************************************************************************************************************************************
host1                      : ok=2    changed=0    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
host2                      : ok=3    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   
host3                      : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Actual Results

After one of the hosts fails in the run_once task, all the other finish their current task, and the play ends reporting stats, with no indication that something went wrong with the other hosts (A part from the lower total of tasks done).


PLAY [Test] *************************************************************************************************************************************************************************************************
[WARNING]: Using run_once with the free strategy is not currently supported. This task will still be executed for every host in the inventory list.

TASK [fail if success_run_once is false] ********************************************************************************************************************************************************************
skipping: [host1]
skipping: [host2]
fatal: [host3]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}

TASK [wait some time] ***************************************************************************************************************************************************************************************
ok: [host2 -> localhost]
ok: [host1 -> localhost]

PLAY RECAP **************************************************************************************************************************************************************************************************
host1                      : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
host2                      : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
host3                      : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot
Copy link
Contributor

ansibot commented May 8, 2023

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.14 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. labels May 8, 2023
@bcoca
Copy link
Member

bcoca commented May 8, 2023

run_once implies 'apply result to all hosts', which includes the task failure.

@GlysVenture
Copy link
Author

Ok, I can understand the logic and it is written in the documentation. But I was led to believe, by the warning, that it was not not implemented for strategy free and was ignoring the keyword.
Seems wording is important, my bad.
I do think it should be ignored if its not "supported" to prevent these kind of silent "failures".

@GlysVenture GlysVenture closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2023
@bcoca bcoca reopened this May 8, 2023
@bcoca
Copy link
Member

bcoca commented May 8, 2023

leaving open to verify the behavior and reconsider if this is a bug for 'partially working' , the part of run_once that is not supported in the strategy is limiting to only one host run ... so it should either create a synch point or not apply the result of the first host to the rest, which can give weird results even with success.

@mkrizek
Copy link
Contributor

mkrizek commented May 9, 2023

We need to guard run_once with strategies that actually support it whenever we check for it in the base strategy, this fixes the issue:

diff --git a/lib/ansible/plugins/strategy/__init__.py b/lib/ansible/plugins/strategy/__init__.py
index edab7aed0b4..8815769867d 100644
--- a/lib/ansible/plugins/strategy/__init__.py
+++ b/lib/ansible/plugins/strategy/__init__.py
@@ -590,7 +590,7 @@ class StrategyBase:
                     # save the current state before failing it for later inspection
                     state_when_failed = iterator.get_state_for_host(original_host.name)
                     display.debug("marking %s as failed" % original_host.name)
-                    if original_task.run_once:
+                    if original_task.run_once and iterator._play.strategy in add_internal_fqcns(('linear',)):
                         # if we're using run_once, we have to fail every host here
                         for h in self._inventory.get_hosts(iterator._play.hosts):
                             if h.name not in self._tqm._unreachable_hosts:

We already do the same in the task debugger:

if task.run_once and iterator._play.strategy in add_internal_fqcns(('linear',)) and result.is_failed():

Although a more of a "system fix" would be better, related #73483.

@bcoca
Copy link
Member

bcoca commented May 9, 2023

@mkrizek something like adding a '_supports_run_once = False' property to the base class and making it true for linear?

@s-hertel s-hertel removed the needs_triage Needs a first human triage before being processed. label May 9, 2023
@jborean93
Copy link
Contributor

I think we need to decide on what the desired behaviour is for this particular scenario. If this is the desired behaviour today it should be at least documented.

@jborean93 jborean93 added the P3 Priority 3 - Approved, No Time Limitation label Jan 10, 2024
@bcoca
Copy link
Member

bcoca commented Jan 10, 2024

i see a few options for run_once:

  • it should be 'totally' ignored when 'unsupported' emit a warning or error
  • it becomes a sync point and functions 'normally'
  • mostly as is, allow to set facts for all hosts, but not status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.14 bug This issue/PR relates to a bug. P3 Priority 3 - Approved, No Time Limitation
Projects
None yet
Development

No branches or pull requests

6 participants