Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix task execution/lockstep issues with clear_host_errors #78075

Draft
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

s-hertel
Copy link
Contributor

@s-hertel s-hertel commented Jun 17, 2022

SUMMARY

Fixes #35086

Updating the failed host state's also helps fix a clear_host_errors issue where hosts are not brought back into play in lockstep.

Also fixes clearing host errors across plays.

ISSUE TYPE
  • Bugfix Pull Request

@ansibot ansibot added WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers. affects_2.14 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. support:community This issue/PR relates to code supported by the Ansible community. support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Jun 17, 2022
@mkrizek
Copy link
Contributor

mkrizek commented Jun 17, 2022

As a note I looked at fixing 31543 in the past and had this in git stash:

diff --git a/lib/ansible/plugins/strategy/linear.py b/lib/ansible/plugins/strategy/linear.py
index d90d347d3e3..01c2bbfeb01 100644
--- a/lib/ansible/plugins/strategy/linear.py
+++ b/lib/ansible/plugins/strategy/linear.py
@@ -411,23 +411,20 @@ class StrategyModule(StrategyBase):
                 for res in results:
                     # execute_meta() does not set 'failed' in the TaskResult
                     # so we skip checking it with the meta tasks and look just at the iterator
-                    if (res.is_failed() or res._task.action in C._ACTION_META) and iterator.is_failed(res._host):
+                    #if (res.is_failed() or res._task.action in C._ACTION_META) and iterator.is_failed(res._host):
+                    if res.is_failed():
                         failed_hosts.append(res._host.name)
                     elif res.is_unreachable():
                         unreachable_hosts.append(res._host.name)

                 # if any_errors_fatal and we had an error, mark all hosts as failed
                 if any_errors_fatal and (len(failed_hosts) > 0 or len(unreachable_hosts) > 0):
-                    dont_fail_states = frozenset([IteratingStates.RESCUE, IteratingStates.ALWAYS])
+                    # would probably need a new flag to abort the play (and for PlayIterator to revert the flag when rescue)
+                    iterator.end_play = True
+                    # FIXME self._tqm.RUN_FAILED_BREAK_PLAY
                     for host in hosts_left:
-                        (s, _) = iterator.get_next_task_for_host(host, peek=True)
-                        # the state may actually be in a child state, use the get_active_state()
-                        # method in the iterator to figure out the true active state
-                        s = iterator.get_active_state(s)
-                        if s.run_state not in dont_fail_states or \
-                           s.run_state == IteratingStates.RESCUE and s.fail_state & FailedStates.RESCUE != 0:
-                            self._tqm._failed_hosts[host.name] = True
-                            result |= self._tqm.RUN_FAILED_BREAK_PLAY
+                        if host.name not in failed_hosts:
+                            #self._tqm._failed_hosts[host.name] = True
+                            iterator.mark_host_failed(host)
                 display.debug("done checking for any_errors_fatal")

                 display.debug("checking for max_fail_percentage")

Basically the idea was to utilize iterator.mark_host_errors() to fail other hosts and for PlayIterator to handle the case where rescue section is present to undo the failure(s). You can see that I commented out some lines, it probably broke something but I am not sure what it was. One thing I know is that with the diff the return code of ansible-playbook changes since strategy does not return self._tqm.RUN_FAILED_BREAK_PLAY anymore.

It would also fix #73246 it seems.

I feel like there are too many fixes in the current (devel) implemetation (dont_fail_states, using iterator.get_active_state) at this point.

There is probably a reason why this wasn't done like this when it was originally written though? 🧐

Anyway just thought I'd share this in case you wanted to experiment with that idea as well.

@s-hertel s-hertel removed the needs_triage Needs a first human triage before being processed. label Jun 21, 2022
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jun 29, 2022
@ansibot ansibot added the needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html label Aug 24, 2022
@s-hertel s-hertel force-pushed the fix_linear_any_errors_fatal branch from 3bee862 to d2c8520 Compare March 28, 2024 22:46
@s-hertel s-hertel changed the title fix task execution/lockstep issues with any_errors_fatal and clear_host_errors fix task execution/lockstep issues with clear_host_errors Mar 28, 2024
@ansibot ansibot removed needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Mar 28, 2024
@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.14 bug This issue/PR relates to a bug. has_issue stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. support:community This issue/PR relates to code supported by the Ansible community. support:core This issue/PR relates to code supported by the Ansible Engineering Team. WIP This issue/PR is a work in progress. Nevertheless it was shared for getting input from peers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clear_host_errors causes unexpected behaviour with unreachable hosts
3 participants