Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

"max_fail_percentage" does not allow continuation after failed hosts #5073

Closed
candlerb opened this Issue · 6 comments

4 participants

@candlerb

[ansible 1.4]

According to the documentation, if I set "serial" together with "max_fail_percentage: 100" then I would expect all hosts to be tried. However it aborts when the first host fails.

Demonstration:

$ cat test.inv
localhost ansible_connection=local
[vms]
vm1 vm_server=localhost
vm2 vm_server=localhost
vm3 vm_server=localhost

$ cat test.yml
- hosts: vms
  gather_facts: no
  serial: 1
  max_fail_percentage: 100
  tasks:
    - shell: /bin/false
      delegate_to: "{{vm_server}}"

$ ansible-playbook test.yml -i test.inv

PLAY [vms] ********************************************************************

TASK: [shell /bin/false] ******************************************************
failed: [vm3] => {"changed": true, "cmd": "/bin/false ", "delta": "0:00:00.002221", "end": "2013-11-27 12:04:08.994577", "rc": 1, "start": "2013-11-27 12:04:08.992356"}

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/home/nsrc/test.retry

vm3                        : ok=0    changed=0    unreachable=0    failed=1

Furthermore, the retry file only lists this one host.

$ cat /home/nsrc/test.retry
vm3
$ 
@mpdehaan

because one host failed, that last comment about there only being one host in the retry file is 100% correct.

@mpdehaan

The max fail percentage logic appears to just be a ">" needs to be ">=" issue.

                            if (hosts_count - len(host_list)) > int((play.max_fail_pct)/100.0 * hosts_count):
@candlerb

There are two lines like that in playbook/__init__.py. I have changed both, no difference.

Also no difference if I set max_fail_percentage: 150 or max_fail_percentage: 50

If I patch the code thus:

--- /usr/share/pyshared/ansible/playbook/__init__.py.orig   2013-11-27 13:03:13.698902535 +0000
+++ /usr/share/pyshared/ansible/playbook/__init__.py    2013-11-27 13:11:02.924281924 +0000
@@ -23,6 +23,7 @@
 from ansible import errors
 import ansible.callbacks
 import os
+import sys
 import shlex
 import collections
 from play import Play
@@ -553,10 +554,13 @@
                                 fired_names[handler_name] = 1

                                 host_list = self._list_available_hosts(play.hosts)
+                                sys.stderr.write("(A) play.max_fail_pct=%d\n" % play.max_fail_pct)
                                 if handler.any_errors_fatal and len(host_list) < hosts_count:
                                     play.max_fail_pct = 0
-                                if (hosts_count - len(host_list)) > int((play.max_fail_pct)/100.0 * hosts_count):
+                                    sys.stderr.write("(A) set play.max_fail_pct=0\n")
+                                if (hosts_count - len(host_list)) >= int((play.max_fail_pct)/100.0 * hosts_count):
                                     host_list = None
+                                    sys.stderr.write("(A) set host_list = None")
                                 if not host_list:
                                     self.callbacks.on_no_hosts_remaining()
                                     return False
@@ -598,12 +602,16 @@
                 host_list = self._list_available_hosts(play.hosts)

                 # Set max_fail_pct to 0, So if any hosts fails, bail out
+                sys.stderr.write("(B) play.max_fail_pct=%d\n" % play.max_fail_pct)
+                sys.stderr.write("(B) len(host_list)=%d, hosts_count=%d\n" % (len(host_list),hosts_count))
                 if task.any_errors_fatal and len(host_list) < hosts_count:
                     play.max_fail_pct = 0
+                    sys.stderr.write("(B) set play.max_fail_pct = 0")

                 # If threshold for max nodes failed is exceeded , bail out.
-                if (hosts_count - len(host_list)) > int((play.max_fail_pct)/100.0 * hosts_count):
+                if (hosts_count - len(host_list)) >= int((play.max_fail_pct)/100.0 * hosts_count):
                     host_list = None
+                    sys.stderr.write("(B) set host_list = None")

                 # if no hosts remain, drop out
                 if not host_list:

I get the following:

(B) play.max_fail_pct=100
(B) len(host_list)=0, hosts_count=1
(B) set host_list = None

And if I set max_fail_percentage: 300 then I just get:

(B) play.max_fail_pct=300
(B) len(host_list)=0, hosts_count=1
@candlerb

Workaround: instead of serial: 1 use ansible-playbook -f 1. This seems to do the job.

@lechat

Maximum failure percentage is calculated per batch size, and not across all hosts. You've instructed to have batch size to be one host, and ansible fails when first batch reaches 100% failure.

In your workaround you restrict ansible from creating multiple batches, so it is forced to put all hosts into one batch and then playbook fails when threshold is reached.

I don't think that this is a bug.

@jimi-c
Owner

This actually has nothing to do with max failure %. Ansible will never continue if 100% of the hosts in a batch fail, even if you set the max% over 100. The relevant section of code being hit is here (in lib/ansible/playbook/__init__.py):

                # if no hosts remain, drop out
                if not host_list:
                    self.callbacks.on_no_hosts_remaining()
                    return False

So I'd agree with @lechat that this is not a bug, so we're going to go ahead and close this. If you have any further questions about this, please let us know. Thanks!

@jimi-c jimi-c closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.