Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_fail_percentage incorrectly aborts playbook execution #32255

Closed
mtnbikenc opened this issue Oct 27, 2017 · 3 comments · Fixed by #32362
Closed

max_fail_percentage incorrectly aborts playbook execution #32255

mtnbikenc opened this issue Oct 27, 2017 · 3 comments · Fixed by #32362
Labels
affects_2.3 This issue/PR affects Ansible v2.3 affects_2.4 This issue/PR affects Ansible v2.4 affects_2.5 This issue/PR affects Ansible v2.5 bug This issue/PR relates to a bug. c:executor/playbook_executor c:plugins/strategy support:core This issue/PR relates to code supported by the Ansible Engineering Team.

Comments

@mtnbikenc
Copy link
Contributor

mtnbikenc commented Oct 27, 2017

ISSUE TYPE
  • Bug Report
COMPONENT NAME

max_fail_percentage

ANSIBLE VERSION
ansible 2.5.0 (devel 7553c42e09) last updated 2017/10/27 08:35:00 (GMT -400)
  config file = /home/rteague/git/clusters/aws-c1/ansible.cfg
  configured module search path = [u'/home/rteague/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/rteague/git/clusters/aws-c1/ansible/lib/ansible
  executable location = /home/rteague/git/clusters/aws-c1/ansible/bin/ansible
  python version = 2.7.13 (default, Sep  5 2017, 08:53:59) [GCC 7.1.1 20170622 (Red Hat 7.1.1-3)]

NOTE: This issue exists back to v2.1.0.0-1
I tested against older version and found:
v2.0.2.0-1 Good
v2.1.0.0-1 Bad

CONFIGURATION

N/A

OS / ENVIRONMENT

N/A

SUMMARY

max_fail_percentage incorrectly aborts playbook execution after single host failure in a batch

STEPS TO REPRODUCE
# test-fail.yml
- hosts: nodes
  gather_facts: no
  serial: "{{ nodes_serial | default(1) }}"
  max_fail_percentage: "{{ nodes_max_fail_percentage | default(0) }}"

  tasks:
  - debug:
      var: inventory_hostname
    failed_when: inventory_hostname == 'host2'

  - debug:
      msg: "Next task"

  - debug:
      msg: "Last task"
# test-inv
[nodes]
host[1:40] ansible_host=127.0.0.1

# Command
ansible-playbook -i test-inv test-fail.yml -e nodes_serial=4 -e nodes_max_fail_percentage=30
EXPECTED RESULTS

The expectation is that the playbook will continue running all batches unless a single batch has more than one failure. (One failure = 25%, Two failures = 50%)

ACTUAL RESULTS

The playbook exited after the Next task after failure and did not process any further batches.

If setting mfp to 24, the playbook will exit after the first failed task in the first batch. Correct.
If setting mfp to 33, the playbook will exit after the next task (only three hosts) in the first batch. Incorrect.
If setting mfp to 34, the playbook will exit after all tasks in that batch. Incorrect.
(Note that there appears to be some impact to task execution because changing mfp between 33 to 34 has some effect, even though four hosts are run in this batch. It should only be the 24/25 boundary to effect a single host failure on a four host batch.
If setting mfp to 99, the playbook will exit after all tasks in that batch. Incorrect.
If setting mfp to 100, the playbook will continue through all tasks for all batches. Correct.

$ ansible-playbook -i test-inv test-fail.yml -e nodes_serial=4 -e nodes_max_fail_percentage=30 

PLAY [nodes] *********************************************************************************************************************************

TASK [debug] *********************************************************************************************************************************
ok: [host1] => {
    "failed_when_result": false, 
    "inventory_hostname": "host1"
}
ok: [host3] => {
    "failed_when_result": false, 
    "inventory_hostname": "host3"
}
fatal: [host2]: FAILED! => {
    "changed": false, 
    "failed_when_result": true, 
    "inventory_hostname": "host2"
}
ok: [host4] => {
    "failed_when_result": false, 
    "inventory_hostname": "host4"
}

TASK [debug] *********************************************************************************************************************************
ok: [host1] => {
    "msg": "Next task"
}
ok: [host3] => {
    "msg": "Next task"
}
ok: [host4] => {
    "msg": "Next task"
}

NO MORE HOSTS LEFT ***************************************************************************************************************************

NO MORE HOSTS LEFT ***************************************************************************************************************************
	to retry, use: --limit @/home/rteague/git/clusters/aws-c1/test-fail.retry

PLAY RECAP ***********************************************************************************************************************************
host1                      : ok=2    changed=0    unreachable=0    failed=0   
host2                      : ok=0    changed=0    unreachable=0    failed=1   
host3                      : ok=2    changed=0    unreachable=0    failed=0   
host4                      : ok=2    changed=0    unreachable=0    failed=0   

@ansibot ansibot added affects_2.5 This issue/PR affects Ansible v2.5 bug_report needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Oct 27, 2017
@mtnbikenc
Copy link
Contributor Author

@mtnbikenc
Copy link
Contributor Author

mtnbikenc commented Oct 27, 2017

Running git bisect found 9602e43:

9602e439525e5e14b1e062e18bbdee3b47af8248 is the first bad commit
commit 9602e439525e5e14b1e062e18bbdee3b47af8248
Author: Vincent Roy <vincentroy8@gmail.com>
Date:   Tue Apr 12 22:36:08 2016 -0300

    Don't stop executing plays after failure.
    
    https://github.com/ansible/ansible/pull/13750/files

:040000 040000 e788d7b583c0b8c77af417c649ffb2d011560c4a 6d18e95e5da6c429cd8022a7a808be9f4c4561d5 M	lib
bisect run success

This may only be part of the overall problem.

@s-hertel s-hertel added c:plugins/strategy c:executor/playbook_executor and removed needs_triage Needs a first human triage before being processed. labels Oct 27, 2017
@jctanner
Copy link
Contributor

#32362

bcoca added a commit to bcoca/ansible that referenced this issue Oct 30, 2017
currently it is doing only from the 'active' hosts in the batch which means
the percentage goes up as hosts fail instead of staying the same.
added debug info for max fail

fixes ansible#32255
@bcoca bcoca added affects_2.3 This issue/PR affects Ansible v2.3 affects_2.4 This issue/PR affects Ansible v2.4 labels Oct 30, 2017
abadger pushed a commit that referenced this issue Nov 1, 2017
currently it is doing only from the 'active' hosts in the batch which means
the percentage goes up as hosts fail instead of staying the same.
added debug info for max fail

fixes #32255

(cherry picked from commit 4fb9e54)
abadger pushed a commit that referenced this issue Nov 1, 2017
currently it is doing only from the 'active' hosts in the batch which means
the percentage goes up as hosts fail instead of staying the same.
added debug info for max fail

fixes #32255
@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 7, 2018
@ansible ansible locked and limited conversation to collaborators Apr 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.3 This issue/PR affects Ansible v2.3 affects_2.4 This issue/PR affects Ansible v2.4 affects_2.5 This issue/PR affects Ansible v2.5 bug This issue/PR relates to a bug. c:executor/playbook_executor c:plugins/strategy support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants