Skip to content

Conversation

@s-hertel
Copy link
Contributor

@s-hertel s-hertel commented Jun 17, 2022

SUMMARY

Fixes #35086
Fixes #86208

On devel, unreachable hosts pick up where they left off.

On this branch, unreachable hosts pick up at the next play.

I'm not sure if this is the right behavior, because I can imagine content relying on the behavior of devel (it's the only mechanism to recover an unreachable host, and it could be intuitive using the free strategy), and people using the linear strategy could expect the host should start at the task following the next applicable task following the clear_host_errors task instead of the next play.

I'd also be open to just documenting the current behavior or making it configurable.

ISSUE TYPE
  • Bugfix Pull Request

@ansibot ansibot added WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers. affects_2.14 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:community This issue/PR relates to code supported by the Ansible community. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Jun 17, 2022
@mkrizek
Copy link
Contributor

mkrizek commented Jun 17, 2022

As a note I looked at fixing 31543 in the past and had this in git stash:

diff --git a/lib/ansible/plugins/strategy/linear.py b/lib/ansible/plugins/strategy/linear.py
index d90d347d3e3..01c2bbfeb01 100644
--- a/lib/ansible/plugins/strategy/linear.py
+++ b/lib/ansible/plugins/strategy/linear.py
@@ -411,23 +411,20 @@ class StrategyModule(StrategyBase):
                 for res in results:
                     # execute_meta() does not set 'failed' in the TaskResult
                     # so we skip checking it with the meta tasks and look just at the iterator
-                    if (res.is_failed() or res._task.action in C._ACTION_META) and iterator.is_failed(res._host):
+                    #if (res.is_failed() or res._task.action in C._ACTION_META) and iterator.is_failed(res._host):
+                    if res.is_failed():
                         failed_hosts.append(res._host.name)
                     elif res.is_unreachable():
                         unreachable_hosts.append(res._host.name)

                 # if any_errors_fatal and we had an error, mark all hosts as failed
                 if any_errors_fatal and (len(failed_hosts) > 0 or len(unreachable_hosts) > 0):
-                    dont_fail_states = frozenset([IteratingStates.RESCUE, IteratingStates.ALWAYS])
+                    # would probably need a new flag to abort the play (and for PlayIterator to revert the flag when rescue)
+                    iterator.end_play = True
+                    # FIXME self._tqm.RUN_FAILED_BREAK_PLAY
                     for host in hosts_left:
-                        (s, _) = iterator.get_next_task_for_host(host, peek=True)
-                        # the state may actually be in a child state, use the get_active_state()
-                        # method in the iterator to figure out the true active state
-                        s = iterator.get_active_state(s)
-                        if s.run_state not in dont_fail_states or \
-                           s.run_state == IteratingStates.RESCUE and s.fail_state & FailedStates.RESCUE != 0:
-                            self._tqm._failed_hosts[host.name] = True
-                            result |= self._tqm.RUN_FAILED_BREAK_PLAY
+                        if host.name not in failed_hosts:
+                            #self._tqm._failed_hosts[host.name] = True
+                            iterator.mark_host_failed(host)
                 display.debug("done checking for any_errors_fatal")

                 display.debug("checking for max_fail_percentage")

Basically the idea was to utilize iterator.mark_host_errors() to fail other hosts and for PlayIterator to handle the case where rescue section is present to undo the failure(s). You can see that I commented out some lines, it probably broke something but I am not sure what it was. One thing I know is that with the diff the return code of ansible-playbook changes since strategy does not return self._tqm.RUN_FAILED_BREAK_PLAY anymore.

It would also fix #73246 it seems.

I feel like there are too many fixes in the current (devel) implemetation (dont_fail_states, using iterator.get_active_state) at this point.

There is probably a reason why this wasn't done like this when it was originally written though? 🧐

Anyway just thought I'd share this in case you wanted to experiment with that idea as well.

@s-hertel s-hertel removed the needs_triage Needs a first human triage before being processed. label Jun 21, 2022
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jun 29, 2022
@ansibot ansibot added the needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html label Aug 24, 2022
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 3bee862 to d2c8520 Compare March 28, 2024 22:46
@s-hertel s-hertel changed the title fix task execution/lockstep issues with any_errors_fatal and clear_host_errors fix task execution/lockstep issues with clear_host_errors Mar 28, 2024
@ansibot ansibot removed needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Mar 28, 2024
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Apr 11, 2024
@ansibot ansibot added the stale_pr This PR has not been pushed to for more than one year. label Apr 1, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from d2c8520 to 2b0e9fb Compare July 14, 2025 17:02
@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed stale_pr This PR has not been pushed to for more than one year. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Jul 14, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch 2 times, most recently from a746289 to 872d5a1 Compare July 14, 2025 18:57
@ansibot

This comment was marked as outdated.

@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 872d5a1 to 16d95ad Compare July 17, 2025 14:41
@ansibot ansibot added the module This issue/PR relates to a module. label Jul 17, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch 2 times, most recently from f1c97eb to cedc6fe Compare July 17, 2025 16:54
@s-hertel s-hertel changed the title fix task execution/lockstep issues with clear_host_errors clear_host_errors - fix unreachable host task lockstep Jul 17, 2025
@s-hertel s-hertel added the ci_verified Changes made in this PR are causing tests to fail. label Jul 17, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from cedc6fe to 02728dd Compare July 17, 2025 18:30
@ansibot ansibot removed the ci_verified Changes made in this PR are causing tests to fail. label Jul 17, 2025
@s-hertel s-hertel added the ci_verified Changes made in this PR are causing tests to fail. label Jul 17, 2025
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jul 28, 2025
…bsequent play

capture nuance in the documentation

ci_complete
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 02728dd to 9a7a78b Compare November 17, 2025 14:51
@ansibot ansibot removed ci_verified Changes made in this PR are causing tests to fail. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Nov 17, 2025
@s-hertel s-hertel added the ci_verified Changes made in this PR are causing tests to fail. label Nov 17, 2025
@ansibot ansibot removed the ci_verified Changes made in this PR are causing tests to fail. label Nov 18, 2025
@s-hertel s-hertel added the ci_verified Changes made in this PR are causing tests to fail. label Nov 18, 2025
@ansibot ansibot removed the ci_verified Changes made in this PR are causing tests to fail. label Nov 18, 2025
@s-hertel s-hertel changed the title clear_host_errors - fix unreachable host task lockstep clear_host_errors - fix handling unreachable errors Nov 18, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch 3 times, most recently from 971ae7d to 04f5877 Compare November 18, 2025 23:26
@s-hertel s-hertel marked this pull request as ready for review November 18, 2025 23:49
@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Nov 19, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 04f5877 to 8317893 Compare November 19, 2025 00:38
@ansibot ansibot added the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Nov 19, 2025
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 8317893 to fbf0563 Compare November 19, 2025 02:05
@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Nov 19, 2025
@ansibot ansibot added the needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html label Nov 26, 2025
@mkrizek
Copy link
Contributor

mkrizek commented Nov 26, 2025

I'd also be open to just documenting the current behavior or making it configurable.

I suppose another option would be to document the current behavior and add a new meta task meta: clear_unreachable_hosts (likely a different name) that would implement the same behavior as in this PR?

@s-hertel
Copy link
Contributor Author

Do you think the current behavior of clear_host_errors on unreachable hosts is potentially useful? I wanted to be cautious about changing the current behavior since this is one of only two mechanisms for handling unreachable errors, but I think it's the best option. #86208 (comment) changed my mind that it's useful for the free strategy.

On a tangential note, I'm questioning how useful clear_host_errors is in general. I searched Github for use cases, but largely found it used in strange places, like at the beginning of plays, in roles, in rescue... Since it affects all play hosts, and interacts with block/rescue/always, it seems brittle (even with linear, if dynamic, conditional includes are involved). To keep content portable, I think it's best to handle unreachable errors using ignore_unreachable + task conditionals, and non-unreachable errors using rescue (or ignore_errors + task conditionals).

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects_2.14 bug This issue/PR relates to a bug. has_issue module This issue/PR relates to a module. needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. support:community This issue/PR relates to code supported by the Ansible community. support:core This issue/PR relates to code supported by the Ansible Engineering Team. WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers.

Projects

None yet

3 participants