Skip to content

Conversation

@s-hertel
Copy link
Contributor

@s-hertel s-hertel commented Jun 6, 2024

SUMMARY

Reattempt for #83363

Fix the issue(#83292) of playbook with any_errors_fatal: true, the rescue section only executes on the host where a fail task with run_once: true is triggered, not affecting other hosts as expected.
Fixes #83292

ISSUE TYPE
  • Bugfix Pull Request

@ansibot ansibot added bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. has_issue needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Jun 6, 2024
@s-hertel s-hertel marked this pull request as ready for review June 6, 2024 16:14
@s-hertel
Copy link
Contributor Author

s-hertel commented Jun 6, 2024

Fixing unrelated test failures in #83391.

@mattclay mattclay removed the needs_triage Needs a first human triage before being processed. label Jun 6, 2024
@webknjaz webknjaz added the ci_verified Changes made in this PR are causing tests to fail. label Jun 6, 2024
@s-hertel
Copy link
Contributor Author

s-hertel commented Jun 6, 2024

@webknjaz I believe all the errors are due to #83362 and unrelated to my change. Do you see something I broke since you added the ci_verified label?

@s-hertel s-hertel removed the ci_verified Changes made in this PR are causing tests to fail. label Jun 6, 2024
@s-hertel
Copy link
Contributor Author

s-hertel commented Jun 6, 2024

Rebased to pull in bdc1cdf and 68638f4.

@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Jun 6, 2024
@webknjaz
Copy link
Member

webknjaz commented Jun 7, 2024

I must've misinterpreted them, you're right.

@s-hertel s-hertel requested a review from mkrizek June 7, 2024 13:06
if task.args.get('_raw_params', None) not in ('noop', 'reset_connection', 'end_host', 'role_complete', 'flush_handlers'):
run_once = True
if (task.any_errors_fatal or run_once) and not task.ignore_errors:
if task.any_errors_fatal and not task.ignore_errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is correct. There are several things at play here. Meta tasks' results do not go through the normal results processing in StrategyBase._process_pending_results() (until possibly #83252 is merged anyway) which contains the code to handle the run_once functionality (if the task fails and not ignore_errors, fail all other hosts) so that means that run_once is not implemented for meta. I think that is why the local variable any_errors_fatal is set also for run_once so the code that handles any_errors_fatal is used to "implement" run_once for meta tasks because effectively both features just fails all other hosts.

But even if the meta tasks' results went through the normal results processing, it still would not work because there we check original_task.run_once to trigger the run_once functionality but that value does not reflect what we want because above we just set run_once to a local variable here for hardcoded set of meta tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, StrategyBase._execute_meta() does not appear to create a failing TaskResult ever. That would mean that the any_errors_fatal code would not be triggered because for that it requires a failing task result and therefore run_once on meta task only means that the task would be run on only one host and the second part of the feature (failing other hosts in case of a task failure) is not needed? So this change is in the end fine??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip this particular change for this PR (that we want to backport).

run_once = templar.template(task.run_once) or action and getattr(action, 'BYPASS_HOST_LOOP', False)

if (task.any_errors_fatal or run_once) and not task.ignore_errors:
if task.any_errors_fatal and not task.ignore_errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the second paragraph from my first comment above applies here in some way too because here we declare that run_once is equivalent to templar.template(task.run_once) or action and getattr(action, 'BYPASS_HOST_LOOP', False) but in StrategyBase._process_pending_results() we check only against original_task.run_once.

Both of my comments however make me wonder in what scenarios is the code to handle run_once in StrategyBase._process_pending_results() even needed since it is currently handled, it seems, by the any_errors_fatal code.

Note that the code for run_once and any_errors_fatal are similar but not the same, it appears to be different for unreachable hosts - #82852.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is correct change but as I mentioned above it would break the following playbook:

- hosts: h1,h2
  gather_facts: false
  tasks:
    - name: missing task args, this fails
      add_host:

    - debug:
        msg: add_host is BYPASS_HOST_LOOP action and implies run_once, this should not happen

because we don't check for BYPASS_HOST_LOOP in StrategyBase to fail all other hosts, we just check for run_once. So I think that BYPASS_HOST_LOOP check needs to be somehow added there.

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jun 18, 2024
Copy link
Contributor

@mkrizek mkrizek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got back to thinking about this and found a couple of issues that need to be addressed before merging it.

@@ -0,0 +1,19 @@
- hosts: testhost,testhost2
gather_facts: false
any_errors_fatal: "{{ any_errors_fatal | default(omit) }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100 for testing multiple combinations of run_once and any_errors_fatal. Unfortunately due to #61025 we need to structure the tests differently (using any_errors_fatal env var?) because as it is currently written, any_errors_fatal: "{{ any_errors_fatal | default(omit) }}" always evaluates to truthy value (until data tagging probably).

if task.args.get('_raw_params', None) not in ('noop', 'reset_connection', 'end_host', 'role_complete', 'flush_handlers'):
run_once = True
if (task.any_errors_fatal or run_once) and not task.ignore_errors:
if task.any_errors_fatal and not task.ignore_errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip this particular change for this PR (that we want to backport).

run_once = templar.template(task.run_once) or action and getattr(action, 'BYPASS_HOST_LOOP', False)

if (task.any_errors_fatal or run_once) and not task.ignore_errors:
if task.any_errors_fatal and not task.ignore_errors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is correct change but as I mentioned above it would break the following playbook:

- hosts: h1,h2
  gather_facts: false
  tasks:
    - name: missing task args, this fails
      add_host:

    - debug:
        msg: add_host is BYPASS_HOST_LOOP action and implies run_once, this should not happen

because we don't check for BYPASS_HOST_LOOP in StrategyBase to fail all other hosts, we just check for run_once. So I think that BYPASS_HOST_LOOP check needs to be somehow added there.

@ansibot ansibot added the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Aug 30, 2024
@ansibot ansibot added the stale_pr This PR has not been pushed to for more than one year. label Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug This issue/PR relates to a bug. has_issue needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. stale_pr This PR has not been pushed to for more than one year.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ansible 2.17 returns code 2 when some hosts are failing and rescued

6 participants