max_fail_percentage incorrectly aborts playbook execution #32255

mtnbikenc · 2017-10-27T14:30:26Z

ISSUE TYPE

Bug Report

COMPONENT NAME

max_fail_percentage

ANSIBLE VERSION

ansible 2.5.0 (devel 7553c42e09) last updated 2017/10/27 08:35:00 (GMT -400)
  config file = /home/rteague/git/clusters/aws-c1/ansible.cfg
  configured module search path = [u'/home/rteague/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/rteague/git/clusters/aws-c1/ansible/lib/ansible
  executable location = /home/rteague/git/clusters/aws-c1/ansible/bin/ansible
  python version = 2.7.13 (default, Sep  5 2017, 08:53:59) [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)]

NOTE: This issue exists back to v2.1.0.0-1
I tested against older version and found:
v2.0.2.0-1 Good
v2.1.0.0-1 Bad

CONFIGURATION

N/A

OS / ENVIRONMENT

N/A

SUMMARY

max_fail_percentage incorrectly aborts playbook execution after single host failure in a batch

STEPS TO REPRODUCE

# test-fail.yml
- hosts: nodes
  gather_facts: no
  serial: "{{ nodes_serial | default(1) }}"
  max_fail_percentage: "{{ nodes_max_fail_percentage | default(0) }}"

  tasks:
  - debug:
      var: inventory_hostname
    failed_when: inventory_hostname == 'host2'

  - debug:
      msg: "Next task"

  - debug:
      msg: "Last task"

# test-inv
[nodes]
host[1:40] ansible_host=127.0.0.1

# Command
ansible-playbook -i test-inv test-fail.yml -e nodes_serial=4 -e nodes_max_fail_percentage=30

EXPECTED RESULTS

The expectation is that the playbook will continue running all batches unless a single batch has more than one failure. (One failure = 25%, Two failures = 50%)

ACTUAL RESULTS

The playbook exited after the Next task after failure and did not process any further batches.

If setting mfp to 24, the playbook will exit after the first failed task in the first batch. Correct.
If setting mfp to 33, the playbook will exit after the next task (only three hosts) in the first batch. Incorrect.
If setting mfp to 34, the playbook will exit after all tasks in that batch. Incorrect.
(Note that there appears to be some impact to task execution because changing mfp between 33 to 34 has some effect, even though four hosts are run in this batch. It should only be the 24/25 boundary to effect a single host failure on a four host batch.
If setting mfp to 99, the playbook will exit after all tasks in that batch. Incorrect.
If setting mfp to 100, the playbook will continue through all tasks for all batches. Correct.

$ ansible-playbook -i test-inv test-fail.yml -e nodes_serial=4 -e nodes_max_fail_percentage=30 

PLAY [nodes] *********************************************************************************************************************************

TASK [debug] *********************************************************************************************************************************
ok: [host1] => {
    "failed_when_result": false, 
    "inventory_hostname": "host1"
}
ok: [host3] => {
    "failed_when_result": false, 
    "inventory_hostname": "host3"
}
fatal: [host2]: FAILED! => {
    "changed": false, 
    "failed_when_result": true, 
    "inventory_hostname": "host2"
}
ok: [host4] => {
    "failed_when_result": false, 
    "inventory_hostname": "host4"
}

TASK [debug] *********************************************************************************************************************************
ok: [host1] => {
    "msg": "Next task"
}
ok: [host3] => {
    "msg": "Next task"
}
ok: [host4] => {
    "msg": "Next task"
}

NO MORE HOSTS LEFT ***************************************************************************************************************************

NO MORE HOSTS LEFT ***************************************************************************************************************************
	to retry, use: --limit @/home/rteague/git/clusters/aws-c1/test-fail.retry

PLAY RECAP ***********************************************************************************************************************************
host1                      : ok=2    changed=0    unreachable=0    failed=0   
host2                      : ok=0    changed=0    unreachable=0    failed=1   
host3                      : ok=2    changed=0    unreachable=0    failed=0   
host4                      : ok=2    changed=0    unreachable=0    failed=0

The text was updated successfully, but these errors were encountered:

mtnbikenc · 2017-10-27T14:47:18Z

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1504075

mtnbikenc · 2017-10-27T15:05:58Z

Running git bisect found 9602e43:

9602e439525e5e14b1e062e18bbdee3b47af8248 is the first bad commit
commit 9602e439525e5e14b1e062e18bbdee3b47af8248
Author: Vincent Roy <vincentroy8@gmail.com>
Date:   Tue Apr 12 22:36:08 2016 -0300

    Don't stop executing plays after failure.
    
    https://github.com/ansible/ansible/pull/13750/files

:040000 040000 e788d7b583c0b8c77af417c649ffb2d011560c4a 6d18e95e5da6c429cd8022a7a808be9f4c4561d5 M	lib
bisect run success

This may only be part of the overall problem.

jctanner · 2017-10-30T20:45:44Z

#32362

currently it is doing only from the 'active' hosts in the batch which means the percentage goes up as hosts fail instead of staying the same. added debug info for max fail fixes ansible#32255

currently it is doing only from the 'active' hosts in the batch which means the percentage goes up as hosts fail instead of staying the same. added debug info for max fail fixes #32255 (cherry picked from commit 4fb9e54)

currently it is doing only from the 'active' hosts in the batch which means the percentage goes up as hosts fail instead of staying the same. added debug info for max fail fixes #32255

ansibot added affects_2.5 This issue/PR affects Ansible v2.5 bug_report needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Oct 27, 2017

s-hertel added c:plugins/strategy c:executor/playbook_executor and removed needs_triage Needs a first human triage before being processed. labels Oct 27, 2017

bcoca mentioned this issue Oct 30, 2017

calculate current batch #32362

Merged

bcoca added affects_2.3 This issue/PR affects Ansible v2.3 affects_2.4 This issue/PR affects Ansible v2.4 labels Oct 30, 2017

abadger closed this as completed in #32362 Nov 1, 2017

ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 7, 2018

ansible locked and limited conversation to collaborators Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_fail_percentage incorrectly aborts playbook execution #32255

max_fail_percentage incorrectly aborts playbook execution #32255

mtnbikenc commented Oct 27, 2017 •

edited

mtnbikenc commented Oct 27, 2017

mtnbikenc commented Oct 27, 2017 •

edited

jctanner commented Oct 30, 2017

max_fail_percentage incorrectly aborts playbook execution #32255

max_fail_percentage incorrectly aborts playbook execution #32255

Comments

mtnbikenc commented Oct 27, 2017 • edited

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

mtnbikenc commented Oct 27, 2017

mtnbikenc commented Oct 27, 2017 • edited

jctanner commented Oct 30, 2017

mtnbikenc commented Oct 27, 2017 •

edited

mtnbikenc commented Oct 27, 2017 •

edited