Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any_errors_fatal should mark all nodes failed in the summary #57138

Open
joelgriffiths opened this issue May 29, 2019 · 17 comments

Comments

@joelgriffiths
Copy link

commented May 29, 2019

SUMMARY

any_errors_fatal = True stops plays on all hosts, but the summary doesn't show 'failed=1'

Even though the ansible-playbook results include "NO MORE HOSTS LEFT", the RECAP is misinformative because it says 'failed=0' for all of the nodes that did not fail even though the play failed and any_errors_fatal is True. I would expect all of the hosts to be marked failed in this case. At the very least, I think it would be helpful if there was a more informative output?

ISSUE TYPE
  • Bug Report
  • Documentation Report
COMPONENT NAME
ANSIBLE VERSION
ansible 2.8.0
  config file = /Users/joelg/.ansible.cfg
  configured module search path = [u'/Users/joelg/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/joelg/.pyenv/versions/2.7.16/lib/python2.7/site-packages/ansible
  executable location = /Users/joelg/.pyenv/versions/2.7.16/bin/ansible
  python version = 2.7.16 (default, May 29 2019, 11:07:40) [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)]
CONFIGURATION
$ ansible-config dump --only-changed
ANSIBLE_SSH_RETRIES(/Users/joelg/.ansible.cfg) = 3
DEFAULT_FORKS(/Users/joelg/.ansible.cfg) = 30
DISPLAY_SKIPPED_HOSTS(/Users/joelg/.ansible.cfg) = False
HOST_KEY_CHECKING(/Users/joelg/.ansible.cfg) = False
OS / ENVIRONMENT
STEPS TO REPRODUCE
- hosts: all
  gather_facts: no
  any_errors_fatal: True

  tasks:

    - name: Grab the datasource URL
      shell: "grep '^connector.url'  /opt/app/app-base/.config"
      register: data_source
      become: yes

    - name: Check for escape characters
      fail: msg="URL is escaped"
      when: "'\\\\' in data_source.stdout"

    - name: All tests have passed
      debug: msg="All tests have passed"
EXPECTED RESULTS

I would expect all the servers to say 'failed=1' in this case. At least the "NO HOSTS LEFT" should be displayed in RED or some other obvious indicator should be provided so that it's not overlooked.

ACTUAL RESULTS
$ ansible-playbook -i inventories/production/hosts --limit server* -k check-escape.yml
SSH password:

PLAY [all] ***********************************************************************************************************************************************

TASK [Grab the datasource URL] ***************************************************************************************************************************
changed: [server11.testdomain.com]
changed: [server14.testdomain.com]
changed: [server15.testdomain.com]
fatal: [server17.testdomain.com]: FAILED! => {"changed": true, "cmd": "grep '^connector.url' /opt/app/app-base/.config", "delta": "0:00:00.033841", "end": "2019-05-29 10:18:03.618201", "msg": "non-zero return code", "rc": 2, "start": "2019-05-29 10:18:03.584360", "stderr": "grep: /opt/app/app-base/.config: No such file or directory", "stderr_lines": ["grep: /opt/app/app-base/.config: No such file or directory"], "stdout": "", "stdout_lines": []}
changed: [server13.testdomain.com]
fatal: [server18.testdomain.com]: FAILED! => {"changed": true, "cmd": "grep '^connector.url' /opt/app/app-base/.config", "delta": "0:00:00.029421", "end": "2019-05-29 10:18:03.844642", "msg": "non-zero return code", "rc": 2, "start": "2019-05-29 10:18:03.815221", "stderr": "grep: /opt/app/app-base/.config: No such file or directory", "stderr_lines": ["grep: /opt/app/app-base/.config: No such file or directory"], "stdout": "", "stdout_lines": []}
fatal: [server-app1.novalocal]: FAILED! => {"changed": true, "cmd": "grep '^connector.url' /opt/app/app-base/.config", "delta": "0:00:00.035457", "end": "2019-05-29 10:18:04.912428", "msg": "non-zero return code", "rc": 2, "start": "2019-05-29 10:18:04.876971", "stderr": "grep: /opt/app/app-base/.config: No such file or directory", "stderr_lines": ["grep: /opt/app/app-base/.config: No such file or directory"], "stdout": "", "stdout_lines": []}
fatal: [server-app2.novalocal]: FAILED! => {"changed": true, "cmd": "grep '^connector.url' /opt/app/app-base/.config", "delta": "0:00:00.055259", "end": "2019-05-29 10:18:05.305324", "msg": "non-zero return code", "rc": 2, "start": "2019-05-29 10:18:05.250065", "stderr": "grep: /opt/app/app-base/.config: No such file or directory", "stderr_lines": ["grep: /opt/app/app-base/.config: No such file or directory"], "stdout": "", "stdout_lines": []}
fatal: [server-app3.novalocal]: FAILED! => {"changed": true, "cmd": "grep '^connector.url' /opt/app/app-base/.config", "delta": "0:00:00.037922", "end": "2019-05-29 10:18:05.352416", "msg": "non-zero return code", "rc": 2, "start": "2019-05-29 10:18:05.314494", "stderr": "grep: /opt/app/app-base/.config: No such file or directory", "stderr_lines": ["grep: /opt/app/app-base/.config: No such file or directory"], "stdout": "", "stdout_lines": []}
changed: [server10.testdomain.com]

NO MORE HOSTS LEFT ***************************************************************************************************************************************

NO MORE HOSTS LEFT ***************************************************************************************************************************************
    to retry, use: --limit @/Users/joelg/PROD/scripts/playbooks/check-escape.retry

PLAY RECAP ***********************************************************************************************************************************************
server-app1.novalocal : ok=0    changed=0    unreachable=0    failed=1
server-app2.novalocal : ok=0    changed=0    unreachable=0    failed=1
server-app3.novalocal : ok=0    changed=0    unreachable=0    failed=1
server10.testdomain.com : ok=1    changed=1    unreachable=0    failed=0
server11.testdomain.com : ok=1    changed=1    unreachable=0    failed=0
server13.testdomain.com : ok=1    changed=1    unreachable=0    failed=0
server14.testdomain.com : ok=1    changed=1    unreachable=0    failed=0
server15.testdomain.com : ok=1    changed=1    unreachable=0    failed=0
server17.testdomain.com : ok=0    changed=0    unreachable=0    failed=1
server18.testdomain.com : ok=0    changed=0    unreachable=0    failed=1
@mkrizek

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

I don't think we can set failed=1 on all hosts since failed means failed tasks, not failed hosts.

This could be solved as part of #14367.

@goneri goneri removed the needs_triage label May 30, 2019
@sivel

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Question About This One

Hi!

Thanks very much for your submission to Ansible. It sincerely means a lot to us that you've taken time to contribute.

Unfortunately, we're not sure if we want this feature in the program, and I don't want this to seem confrontational. Our reasons for this are:

  • any_errors_fatal is meant to end the play when any task results in an error, to stop execution of the play. The design intentions are not to mark every host as failed if one host fails. A custom callback plugin may be able to suit your needs.

However, we're absolutely always up for discussion. Since this is a really busy project, we don't always see comments on closed tickets, but want to encourage
open dialog. You can stop by the development list, and we'd be glad to talk about it - and we might even be persuaded otherwise!

In the future, sometimes starting a discussion on the development list prior to implementing a feature can make getting things included a little easier, but it's not always necessary.

Thank you once again for this and your interest in Ansible!

@sivel sivel closed this May 30, 2019
@joelgriffiths

This comment has been minimized.

Copy link
Author

commented May 30, 2019

Your documentation says this:

The any_errors_fatal play option will mark all hosts as failed if any fails, causing an immediate abort:

  • hosts: somehosts
    any_errors_fatal: true
    roles:
    • myrole
@sivel

This comment has been minimized.

Copy link
Member

commented May 30, 2019

I will re-open as a documentation issue. The consensus amongst the core devs is that the documentation is incorrect.

@sivel sivel reopened this May 30, 2019
@sivel sivel added docs and removed bug labels May 30, 2019
@joelgriffiths

This comment has been minimized.

Copy link
Author

commented May 30, 2019

Thank you for the response. I've started a conversation in your group. My biggest concern is that the RECAP provides no indication the play did not finish running. It looks like the play completed properly when it wasn't completed at all. This has bitten some admins at my company.

@bcoca

This comment has been minimized.

Copy link
Member

commented May 30, 2019

it does mark hosts as failed, but not any tasks , which what you end up seeing in summary are 'tasks per host'.

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented May 30, 2019

Maybe I missing something, but the "PLAY RECAP" in the "Actual Results" (above) doesn't indicate that the playbook did not finish. Surely that should be indicated in the "PLAY RECAP" somewhere?

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented Jun 7, 2019

it does mark hosts as failed, but not any tasks , which what you end up seeing in summary are 'tasks per host'.

I'm trying to find a way to work around this. If it marks the hosts failed, where can I access that information in the playbook so I can generate a failure for the remaining nodes? Is there any way to do this? I could possible set hostvars if something failed, but I would have to check for failure and set it for every single task in the playbook. In addition, I would need to override the failures with failed_when for every task including the imported roles (I think - could be wrong about this).

I obviously disagree with the way it behaves, but I still need to have a solution regardless of how it behaves.

@samccann

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Documentation note - Under the following section in documentation:
https://docs.ansible.com/ansible/devel/user_guide/playbooks_error_handling.html#aborting-the-play

Remove the reference to marking all hosts as failed (as per discussion above). Change to something like:
The any_errors_fatal play option will end the play when any task results in an error and stop execution of the play.

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

So the fact that there is no easy way to communicate to the user that none of the plays completed is okay? I'm going to go find a hard wall and bang my head for awhile. That is unconscionable.

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

Here is a workaround for anybody following this post

- hosts: all
  gather_facts: no
  become: False

  tasks:
  - name: Use block to catch errors
    block:
      - name: Generate a random failue
        shell: 'echo $(( $RANDOM % 10 ))'
        register: random
        failed_when: random.stdout|int == 2

      - debug:
          msg: 'This node PASSED'

    rescue:
      - debug:
          msg: 'This node FAILED'

      - name: "Create a dummy host to hold a failure flag"
        add_host:
          name: "BAD_HOST"

  - name: Force EVERYTHING to fail
    fail: msg="One host failed"
    when: hostvars['BAD_HOST'] is defined
@bcoca

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

@joelgriffiths ansible-playbook returns a 'non 0' rc when a play fails, that does not mean it flags all tasks as failed, it is not obvious from the stdout reading but you should also notice you are missing all the tasks that didn't get executed by the interruption.

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

"... you should also notice you are missing all the tasks that didn't get executed by the interruption."

How? I really hate to be contrarian about this, but I have people running the playbooks they did not write. They are not going to notice that some of the tasks were missing from the end of the play because they don't know how many tasks were in the play. They could open the playbook and figure it out, yes. It doesn't make sense to do that every time you run a different playbook though. When something doesn't finish, it should be obvious without having to examine the code for dozens (hundreds?) of different playbooks.

As far as the rc!=0 thing. That would be the same if one task failed or all the hosts failed. Even if it provided useful information in this context, nobody is going to check the return code to determine if a command just failed. That's the same as checking if 'ls -la' failed by running 'echo $?' every time you ran it. Most people rely on the output of the command to determine what it did, not the return code.

I really appreciate you putting so much work into developing Ansible, so I don't want to be "that guy." Unfortunately, the approach seems so counter-intuitive that I don't understand why we even need to have this discussion. I've explained my concerns in as detailed a fashion as possible, and I'm completely baffled by the response: "We don't need to indicate that hosts did not complete the play."

When I'm this confused by an argument, it often means I'm missing part of the puzzle. If somebody wants to point out what I'm missing, that would be great.

Again, I am very sorry for pushing this so hard on this. I have a workaround that works. Unless there is a better solution, I will just put a block around any playbook that needs to fail all nodes when any node fails.

@bcoca

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

Don't worry about pushing it, it is thanks to feedback like this that the tool gets better.

For most of those used to dealing with command line utilities, rc is the normal way to figure out success/failure. I do see your point about having a more visual representation of this in the output, but failing the rest of the tasks does not seem (to me) the correct way of handling it as it is misleading (hosts didn't really fail the tasks, they never got attempted). You should already have some failed tasks/host combination to get to this point and the 'retry file' should have the hosts that didn't complete the play.

That said, I can see you still won't feel that this is enough, so at this point i'm thinking adding a new event to callbacks that will do something like PLAYBOOK FAILED (in red) at the end of the summary to add an additional indication (over rc or needing to know the play to read the 'fine print' of the stats).

@samccann samccann removed the easyfix label Jun 12, 2019
@samccann

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

Removed the easyfix label for now as this may morph into something more than a doc fix.

@joelgriffiths

This comment has been minimized.

Copy link
Author

commented Jun 12, 2019

I was thinking of something similar to this change:

PLAY RECAP ***********************************************************************************************************************************************
server10.testdomain.com : ok=1 changed=1 unreachable=0 failed=0
server11.testdomain.com : ok=1 changed=1 unreachable=0 failed=0
server12.testdomain.com : ok=1 changed=1 unreachable=0 failed=1
server13.testdomain.com : ok=1 changed=1 unreachable=0 failed=0

Change to something like this:

PLAY RECAP ***********************************************************************************************************************************************
server10.testdomain.com : ok=1 changed=1 unreachable=0 failed=0 incomplete=1
server11.testdomain.com : ok=1 changed=1 unreachable=0 failed=0 incomplete=1
server12.testdomain.com : ok=1 changed=1 unreachable=0 failed=1 incomplete=0
server13.testdomain.com : ok=1 changed=1 unreachable=0 failed=0 incomplete=1

@bcoca

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

Not sure that is the way to go since 'incomplete' can be interpreted many ways and the count can be 'unstable', specially with include_tasks/roles down the play.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
7 participants
You can’t perform that action at this time.