ansible · zopar · Mar 19, 2018 · Jun 5, 2019 · jimi-c · Mar 19, 2018
diff --git a/docs/docsite/rst/reference_appendices/glossary.rst b/docs/docsite/rst/reference_appendices/glossary.rst
@@ -398,7 +398,8 @@ when a term comes up on the mailing list.
         default is to address the batch size all at once, so this is something
         that you must opt-in to.  OS configuration (such as making sure config
         files are correct) does not typically have to use the rolling update
-        model, but can do so if desired.
+        model, but can do so if desired. This option permit also to ignore
+        unreachable nodes in the group of machines.
 
     Serial
         .. seealso::

diff --git a/docs/docsite/rst/reference_appendices/test_strategies.rst b/docs/docsite/rst/reference_appendices/test_strategies.rst
@@ -216,6 +216,23 @@ the pool.
 In the event of a problem, fix the few servers that fail using Ansible's automatically generated 
 retry file to repeat the deploy on just those servers.
 
+You can also ignore unreachable node to go ahead with your job, suppose for example you have 6 nodes and you
+want to dived it in groups of 2, an example of serialization option should be::
+
+    ---
+
+    - hosts: webservers
+      serial: [[2, 1], 2, 2]
+
+
+In this case, the first group represented by the list [2, 1] has a value of 1 (True) regarding the question
+"Do you want to ignroe unreachable?". For others groups the implicit answer to the question is 0 (False).
+At this point if one or both machines in the first group of 2 are unreachable, ansible go ahead with the
+second group and does not stop the job. If in the second group one machine is unreachable, ansible does not
+stop (this is the default behavoir also in old versions of ansible). If in the second group both machines
+are unreachable, ansible stop the job.
+
+
 Achieving Continuous Deployment
 ```````````````````````````````
 

diff --git a/docs/docsite/rst/user_guide/guide_rolling_upgrade.rst b/docs/docsite/rst/user_guide/guide_rolling_upgrade.rst
@@ -237,6 +237,23 @@ Here is the next part of the update play::
    - The ``serial`` keyword forces the play to be executed in 'batches'. Each batch counts as a full play with a subselection of hosts.
      This has some consequences on play behavior. For example, if all hosts in a batch fails, the play fails, which in turn fails the entire run. You should consider this when combining with ``max_fail_percentage``.
 
+To prevent that unreachable are counted as failed and stop the play, you can ignore unreachable nodes. You need to use a list of list. Suppose for example that you have 6 nodes and you
+want to divide it in groups of 2, an example of serialization option should be::
+
+    ---
+
+    - hosts: webservers
+      serial: [[2, 1], 2, 2]
+
+
+In this case, the first group represented by the list [2, 1] has a value of 1 (True) regarding the question
+"Do you want to ignroe unreachable?". For others groups the implicit answer to the question is 0 (False).
+At this point if one or both machines in the first group of 2 are unreachable, ansible go ahead with the
+second group and does not stop the play. If in the second group one machine is unreachable, ansible does not
+stop (this is the default behavoir also in old versions of ansible). If in the second group both machines
+are unreachable, ansible stop the play. Valid value are only 0 and 1, False and True, other values will be
+ignored and 0 will be used.
+
 The ``pre_tasks`` keyword just lets you list tasks to run before the roles are called. This will make more sense in a minute. If you look at the names of these tasks, you can see that we are disabling Nagios alerts and then removing the webserver that we are currently updating from the HAProxy load balancing pool.
 
 The ``delegate_to`` and ``loop`` arguments, used together, cause Ansible to loop over each monitoring server and load balancer, and perform that operation (delegate that operation) on the monitoring or load balancing server, "on behalf" of the webserver. In programming terms, the outer loop is the list of web servers, and the inner loop is the list of monitoring servers.

diff --git a/docs/docsite/rst/user_guide/playbooks_delegation.rst b/docs/docsite/rst/user_guide/playbooks_delegation.rst
@@ -111,6 +111,32 @@ You can also mix and match the values::
 .. note::
      No matter how small the percentage, the number of hosts per pass will always be 1 or greater.
 
+You can also ignore unreachable node to go ahead with your job, suppose for example you have 6 nodes and you
+want to divide it in groups of 2, an example of serialization option should be::
+
+    ---
+
+    - hosts: webservers
+      serial: [[2, 1], 2, 2]
+
+
+In this case, the first group represented by the list [2, 1] has a value of 1 (True) regarding the question
+"Do you want to ignroe unreachable?". For others groups the implicit answer to the question is 0 (False).
+At this point if one or both machines in the first group of 2 are unreachable, ansible go ahead with the
+second group and does not stop the play. If in the second group one machine is unreachable, ansible does not
+stop (this is the default behavoir also in old versions of ansible). If in the second group both machines
+are unreachable, ansible stop the play.
+
+.. note::
+   Valid value are only 0 and 1, False and True, other values will be ignored and 0 will be used. Example::
+
+   ---
+
+    - hosts: webservers
+      serial: [[2, 1], [2,0], 2, [4, False], [3, True], [2, "False"]]
+
+
+Last value will result as [2, 0] because "False" in this case is a string and not a boolean.
 
 .. _maximum_failure_percentage:
 
@@ -132,6 +158,7 @@ In the above example, if more than 3 of the 10 servers in the group were to fail
 
      The percentage set must be exceeded, not equaled. For example, if serial were set to 4 and you wanted the task to abort 
      when 2 of the systems failed, the percentage should be set at 49 rather than 50.
+     Unreachables machine are always not considered as failed if you use max_fail_percentage.
 
 .. _delegation:
 

diff --git a/lib/ansible/executor/playbook_executor.py b/lib/ansible/executor/playbook_executor.py
@@ -158,11 +158,11 @@ def run(self):
 
                         break_play = False
                         # we are actually running plays
-                        batches = self._get_serialized_batches(play)
+                        batches, ignores = self._get_serialized_batches(play)
                         if len(batches) == 0:
                             self._tqm.send_callback('v2_playbook_on_play_start', play)
                             self._tqm.send_callback('v2_playbook_on_no_hosts_matched')
-                        for batch in batches:
+                        for batch, ignore in zip(batches, ignores):
                             # restrict the inventory to the hosts in the serialized batch
                             self._inventory.restrict_to_hosts(batch)
                             # and run it...
@@ -176,9 +176,13 @@ def run(self):
                             # check the number of failures here, to see if they're above the maximum
                             # failure percentage allowed, or if any errors are fatal. If either of those
                             # conditions are met, we break out, otherwise we only break out if the entire
-                            # batch failed
-                            failed_hosts_count = len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts) - \
-                                (previously_failed + previously_unreachable)
+                            # batch failed. If ignore value is 1 we do not count unreachable hosts as failed.
+                            # We have an ignore value for every hosts group.
+                            if ignore == 1:
+                                failed_hosts_count = len(self._tqm._failed_hosts) - previously_failed
+                            else:
+                                failed_hosts_count = len(self._tqm._failed_hosts) + len(self._tqm._unreachable_hosts) - \
+                                                        (previously_failed + previously_unreachable)
 
                             if len(batch) == failed_hosts_count:
                                 break_play = True
@@ -249,22 +253,29 @@ def run(self):
 
     def _get_serialized_batches(self, play):
         '''
-        Returns a list of hosts, subdivided into batches based on
-        the serial size specified in the play.
+        Returns a list of hosts subdivided into batches based on the serial size specified in the play
+        and a list of 0 and 1 values, used to ignore or not unreachable hosts during the play.
         '''
 
         # make sure we have a unique list of hosts
         all_hosts = self._inventory.get_hosts(play.hosts, order=play.order)
         all_hosts_len = len(all_hosts)
 
+        # Extract serial batch list
+        serial_batch_list = [i[0] if isinstance(i, list) else i for i in play.serial]
+
+        # ignore_unreachable_list contains 0,1 value, if 0, host unreachable are counted as failed, othewise
+        # are not counted as failed. If a value is not 0 or 1, we pass 0 as standard
+        ignore_unreachable_list = [i[1] if isinstance(i, list) and i[1] == 1 else 0 for i in play.serial]
+
         # the serial value can be listed as a scalar or a list of
         # scalars, so we make sure it's a list here
-        serial_batch_list = play.serial
         if len(serial_batch_list) == 0:
             serial_batch_list = [-1]
 
         cur_item = 0
         serialized_batches = []
+        ignore_unreachable = []
 
         while len(all_hosts) > 0:
             # get the serial value from current item in the list
@@ -275,6 +286,7 @@ def _get_serialized_batches(self, play):
             # to the current serial item size
             if serial <= 0:
                 serialized_batches.append(all_hosts)
+                ignore_unreachable.append(0)
                 break
             else:
                 play_hosts = []
@@ -283,6 +295,7 @@ def _get_serialized_batches(self, play):
                         play_hosts.append(all_hosts.pop(0))
 
                 serialized_batches.append(play_hosts)
+                ignore_unreachable.append(ignore_unreachable_list[cur_item])
 
             # increment the current batch list item number, and if we've hit
             # the end keep using the last element until we've consumed all of
@@ -291,7 +304,7 @@ def _get_serialized_batches(self, play):
             if cur_item > len(serial_batch_list) - 1:
                 cur_item = len(serial_batch_list) - 1
 
-        return serialized_batches
+        return serialized_batches, ignore_unreachable
 
     def _generate_retry_inventory(self, retry_path, replay_hosts):
         '''