New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Play still fails for "UNREACHABLE" hosts even with "ignore_errors: true" set on tasks #18075

Closed
karlmdavis opened this Issue Oct 17, 2016 · 19 comments

Comments

Projects
None yet
@karlmdavis
Copy link

karlmdavis commented Oct 17, 2016

ISSUE TYPE
  • Bug Report
COMPONENT NAME

(core functionality)

ANSIBLE VERSION
(from my requirements.txt, which I use to setup a clean venv every time)
ansible==2.1.1.0
CONFIGURATION

I don't think I've modified any relevant config properties, but here's my ansible.cfg: https://github.com/HHSIDEAlab/bluebutton-data-pipeline/blob/master/bluebutton-data-pipeline-benchmarks/src/test/ansible/ansible.cfg

OS / ENVIRONMENT

Management host is Ubuntu 14.04, and the systems being managed are RHEL 7.0.

SUMMARY

I've got the following play, which is still failing if the host is unreachable. The playbook it's in is a "teardown" script that I need to keep (trying to) march on, no matter what, to ensure that my AWS resources are always removed (so they aren't wasting money):

- name: Collect FHIR Server Log
  hosts: fhir
  user: "{{ ssh_user }}"
  gather_facts: false
  vars:
    ansible_ssh_pipelining: false

  tasks:

    - shell: "journalctl --unit=bbonfhir-server-app.service --no-pager &> /usr/local/bbonfhir-server-app/bbonfhir-server-app-{{ iteration_index }}.log"
      args:
        executable: /bin/bash
      become: true
      ignore_errors: true

When the "fhir" host failed to setup correctly, I observed the following error:

Running Ansible playbook...
 [WARNING]: Optional dependency 'cryptography' raised an exception, falling
back to 'Crypto'

PLAY [Collect FHIR Server Log] *************************************************

TASK [command] *****************************************************************
Monday 17 October 2016  15:37:40 -0400 (0:00:00.027)       0:00:00.027 ******** 
fatal: [fhir]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}

PLAY RECAP *********************************************************************
fhir                       : ok=0    changed=0    unreachable=1    failed=0   

Monday 17 October 2016  15:37:41 -0400 (0:00:00.134)       0:00:00.161 ******** 
=============================================================================== 
command ----------------------------------------------------------------- 0.13s
STEPS TO REPRODUCE

Run that above play or one similar against a host in your inventory that is turned off or otherwise unreachable.

EXPECTED RESULTS

I expect the playbook run to continue on to the next task.

ACTUAL RESULTS

It went boom too early.

USE CASE / JUSTIFICATION

I've got two playbooks I use here as part of a benchmark setup: The first playbook sets up a bunch of systems in EC2 and starts my application running on it, waiting for it to finish the processing that I'm trying to benchmark. The second playbook scrapes the log files off of those systems, then terminates the hosts, to keep from spending unnecessary money.

This generally works okay, except that in one benchmark run just now I had an iteration fail in the first playbook because EC2 didn't respond that the instance was ready within the expected amount of time. Turns out, it did eventually finish setting up, but whatever, EC2 is just unexpectedly slow sometimes. But because of this error, when my second playbook ran to try and grab results (which wouldn't have been there: no big deal) and then remove the EC2 instances, it went boom, and AWS continued to burn money.

For a use case like this, I really need to trust that that teardown playbook will at least attempt to run every single task in it.

You can see my whole benchmarking project here: bluebutton-data-pipeline-benchmarks/src/test/ansible.

@alikins

This comment has been minimized.

Copy link
Contributor

alikins commented Oct 19, 2016

Would it be possible to try to reproduce this with a 2.2 release candidate (https://github.com/ansible/ansible/releases/tag/v2.2.0.0-0.2.rc2 is current rc)? There have been some recent fixes to partial failure handling.

But I think 'ignore_error' only comes into play if a task was able to connect, run the module, and got back a 'failed' result. In this case, it looks like the task fails to connect to the host, so ignore_errors wont apply.
I haven't reproduce this myself, but the docs (http://docs.ansible.com/ansible/playbooks_error_handling.html#ignoring-failed-commands) suggest:

Note that the above system only governs the return value of failure of the particular task, so if you have an undefined variable used, it will still raise an error that users will need to address. Neither will this prevent failures on connection nor execution issues, the task must be able to run and return a value of ‘failed’.

@bcoca

This comment has been minimized.

Copy link
Member

bcoca commented Oct 20, 2016

Possible Misunderstanding

Hi!

Thanks very much for your submission to Ansible. It sincerely means a lot to us.

We believe the ticket you have filed is being somewhat misunderstood, as one thing works a little differently than stated.

As @alikins states above, the ignore_errors feature is designed to capture task errors, connection errors are not affected by it as the task never had a chance to execute.

If you need to clear unreachable errors you can use a meta: clear_host_errors task to reinstate all unreachable hosts into the play.

In the future, this might be a topic more well suited for the user list, which you can also post here if you'd like some more help with the above.

Thank you once again for this and your interest in Ansible!

@karlmdavis

This comment has been minimized.

Copy link
Author

karlmdavis commented Oct 20, 2016

@bcoca, some thoughts:

  1. As a user of... 9 months, I hadn't realized that Ansible was supposed to handle errors due to "unreachable hosts" differently from "errors within tasks on reachable hosts." It'd be helpful if those error classes, how they're handled by default, and how to adjust that handling when needed was explained on http://docs.ansible.com/ansible/playbooks_error_handling.html. As things stand now, that page only covers the "errors on reachable hosts" case (aside from the excerpt that @alikins mentioned, which doesn't explain how to handle the other error classes).
  2. It looks like my problem is also related to some other issues: #14665 and #13750. I'll bump to v2.2.0.0-0.2.rc2 and see if that helps at all.

Anywho, I love Ansible as it's by far the most reliable CM tool I've used (out of Puppet, Chef, and Terraform), so thanks very kindly for your work on it.

@bcoca

This comment has been minimized.

Copy link
Member

bcoca commented Oct 20, 2016

@karlmdavis the meta option is pretty recent, I just documented it in the module docs, I'll make note of adding it to the error handling page.

@karlmdavis

This comment has been minimized.

Copy link
Author

karlmdavis commented Oct 21, 2016

👍

@mojo-iconrad

This comment has been minimized.

Copy link

mojo-iconrad commented Oct 24, 2016

With a fresh pull of the devel branch of ansible, the "meta" module's "clear_host_errors" is not restoring hosts back into play.
Example playbook:

- name: Create git-backed wiki page for each host in environment.
  hosts: "{{ myhosts | default('default') }}"
  become: true
  become_method: sudo
  become_user: root
  gather_facts: false


  tasks:
    - name: Define local host/path for ease of use.
      set_fact:
        base_path: "~/work/ansible-playbooks.wiki/{{ inventory_hostname }}"
        today: "{{ ansible_date_time.date }}"
        anchors: "\n * [Running Processes](#running-processes)  \n * [Installed Packages](#installed-packages)  \n * [Upgradable Packages](#upgradable-packages)  \n * [Diskspace Usage](#diskspace-usage)  \n * [Ansible Facts](#ansible-facts)  \n"

    - name: Obtain setup information for printout.
      setup:
        filter: '*'
      register: facts

    - name: Set "was_accessible" fact to 'up' for hosts that the last step succeeded on.
      set_fact:
        was_accessible: "up"

    - meta: clear_host_errors

    - name: Set "was_accessible" fact to 'down' for hosts that do not currently have it set to 'true'.
      set_fact:
        was_accessible: "down"
      when: " '{{ was_accessible | default('NA') }}' == 'NA'"

    - name: Obtain running processes (with inheritance).
      shell: "pstree 2>/dev/null || ps auxf"
      when: " '{{ was_accessible}}' == 'up'"
      register: pstree

    - name: Clean out old yum cache
      command: "yum clean all"
      when: " '{{ was_accessible }}' == 'up'"

    - name: Obtain currently-installed package list.
      shell: "rpm -qa | sort"
      when: " '{{ was_accessible }}' == 'up'"
      register: rpms

    - name: Obtain currently-available package update list
      command: "yum upgrade --assumeno"
      when: " '{{ was_accessible }}' == 'up'"
      register: yum_upgrade
      ignore_errors: true
      failed_when: "'Loaded plugins' not in '{{ yum_upgrade.stdout }}'"


    - name: Obtain diskspace utilization
      shell: "lsblk; df -h"
      register: df
      async: 15
      poll: 5
      when: " '{{ was_accessible }}' == 'up'"

    - debug: "msg='{{ ansible_play_hosts }} ; {{ ansible_play_batch }}'"
    - meta: clear_host_errors
    - debug: "msg='{{ ansible_play_hosts }} ; {{ ansible_play_batch }}'"

    - name: Ensure necessary directory exists.
      file:
        name: "{{ base_path }}"
        state: directory
        mode: "0750"
      delegate_to: localhost
      connection: local
      become: false


    - name: Write outputs to files
      copy: "content={{ item.content }} dest={{ base_path }}/{{ item.variable_name }}.part"
      delegate_to: localhost
      connection: local
      become: false
      no_log: true
      with_items:
        - { content: "# {{ inventory_hostname }}  as of {{ today }}  \n",           variable_name: "01_H1_header"       }
        - { content: " {{ anchors }} ",                                             variable_name: "02_anchors"         }
        - { content: "### Running Processes:  \n",                                  variable_name: "03_H3_header"       }
        - { content: "```\n{{ pstree.stdout | default('NA') }}\n```  \n",           variable_name: "05_pstree"          }
        - { content: "\n --- \n",                                                   variable_name: "09_rule"            }
        - { content: "### Installed packages  \n",                                  variable_name: "13_H3_header"       }
        - { content: "```\n{{ rpms.stdout | default('NA') }}\n```  \n",             variable_name: "15_rpms"            }
        - { content: "\n --- \n",                                                   variable_name: "19_rule"            }
        - { content: "### Upgradable packages  \n",                                 variable_name: "23_H3_header"       }
        - { content: "```\n{{ yum_upgrade.stdout | default('NA') }}\n```  \n",      variable_name: "25_yum_upgrade"     }
        - { content: "\n --- \n",                                                   variable_name: "29_rule"            }
        - { content: "### Diskspace usage  \n",                                     variable_name: "33_H3_header"       }
        - { content: "```\n{{df.stdout | default('NA') }}\n```  \n",                variable_name: "35_df"              }
        - { content: "\n --- \n",                                                   variable_name: "39_rule"            }
        - { content: "### ansible facts  \n",                                       variable_name: "43_H3_header"       }
        - { content: "```json\n{{ facts | to_nice_json | default('NA') }}\n```",    variable_name: "45_facts"           }

    - name: Ensure host entry is present in Sidebar.
      lineinfile:
        dest: "~/work/ansible-playbooks.wiki/_Sidebar.md"
        line: "[{{ inventory_hostname }}]({{ inventory_hostname }})  \n"
        regexp: "\\[{{ inventory_hostname }}\\]\\({{ inventory_hostname }}\\)"
        state: present
      delegate_to: localhost
      connection: local
      become: false

    - name: Concatenate created page files.
      assemble:
        src: "{{ base_path }}"
        dest: "{{ base_path }}.md"
      delegate_to: localhost
      connection: local
      become: false

    - name: Purge .part files
      file:
        name: "{{ base_path }}"
        state: absent
      delegate_to: localhost
      connection: local
      become: false
@kcd83

This comment has been minimized.

Copy link
Contributor

kcd83 commented Mar 30, 2017

I agree this could be clearer. I have a use case where a new host has port 22 available before ssh is ready (awaiting cloud-init) and it would be great if the following worked.

  - name: Wait for ssh port
    wait_for: host={{ ansible_host }} port=22 timeout=360
    delegate_to: localhost
  - name: Test ssh access
    ping:
    register: result
    ignore_errors: yes
    until: result is defined and result.ping is defined and result.ping == 'pong'
    retries: 20
    delay: 5

And block rescue does not work either

  - block:
    - name: Test ssh access
      ping:
      register: result
    - debug: var=result
    rescue:
    - debug: msg='clear error never runs'
    - meta: clear_host_errors
    always:
     - debug: msg='this only runs on successful hosts and does not clear failed hosts'
     - meta: clear_host_errors
@dryan1963

This comment has been minimized.

Copy link

dryan1963 commented May 8, 2017

I am running into a similar problem with multiple plays in a playbook. The meta: clear_host_errors is not honored and any plays that are to be called after the host UNREACHABLE error is thrown by a upstream play, all remaining plays are ignored or not run.

Using ansible 2.2.1.0

[localhost]

  • do some tasks

[remote_hosts]

  • meta: clear_hosts_errors
  • do some tasks but they fail because remote_host is UNREACHABLE

[locathosts again]

  • do some task whether [remote_hosts] fails or not. This is not executed because of the upstream UNREACHABLE host error.
@shadycuz

This comment has been minimized.

Copy link

shadycuz commented May 31, 2017

I have the same problem of @kcd83 . Maybe I'm doing it wrong but I use Ansible to check application health. I have fact checking off and I use Block and Rescue to check for application health and if a script fails I use Rescue to attempt to bring the application into a healthy state. I also use it to send a slack webhook using the uri module. When a server is unreachable I still want that slack hook to fire but rescue is skipped.

@shadycuz

This comment has been minimized.

Copy link

shadycuz commented Jun 9, 2017

@bcoca any input on this behavior? In my search I have found a couple similar issues opened from people.

Thanks,
Levi

@ajaybhatnagar

This comment has been minimized.

Copy link

ajaybhatnagar commented Jun 13, 2017

Ability to skip unreachable nodes will be very useful in clustered application deployments. At times a node in the cluster may be down and if some tasks are to be performed on running nodes, currently play will abort if any one node is unreachable (when tasks are executed in serial one node at a time - for example apply security patches and reboot servers one at a time so that cluster remains operational). One would like to keep track of nodes unreachable in last run and be able to sync them when they come back online again. Minimum , play should have option to skip a node if not reachable.

@abroglesc

This comment has been minimized.

Copy link

abroglesc commented Jun 26, 2017

I agree with @ajaybhatnagar it would be great to have an option in the playbook to skip unreachable hosts or have that be able to be passed in via a command line argument. So +1 to that idea.

@mihai-satmarean

This comment has been minimized.

Copy link

mihai-satmarean commented Jun 30, 2017

we found ourselves in the need for this functionality:
we have some company merged with us, they have a lot of machines.
Now our task is to take them under our ansible configuration.
We would like to use ansible ping module and retry by inserting our keys on failure then try again.
How can we achieve this with current status?
Thank you!

@vijaynirm

This comment has been minimized.

Copy link

vijaynirm commented Jul 14, 2017

Yes. This requirement is must. Ansible should be able to skip the unreachable and go forward.

@OscarLopezEns

This comment has been minimized.

Copy link

OscarLopezEns commented Feb 8, 2018

Need this functionality to not fail playbook if a host was unreachable

@chopraaa

This comment has been minimized.

Copy link
Contributor

chopraaa commented May 19, 2018

Is there a solution for this?

@zsolt-erl

This comment has been minimized.

Copy link

zsolt-erl commented May 31, 2018

You could check if the host is reachable on port 22 with the wait_for module (delegate to localhost and ignore errors) and run the rest of the play based on the result.
Not a nice solution but it should work.

@chopraaa

This comment has been minimized.

Copy link
Contributor

chopraaa commented May 31, 2018

@zsolt-erl This will return true, but the host is still unreachable.

@dikeert

This comment has been minimized.

Copy link

dikeert commented Jul 3, 2018

Folks, I'm having hard time to understand what's the problem?

I want to perform initial configuration on hosts that I dynamically add to a cluster. After this config applied, initial_user and initial_password are no longer valid, so ansible would fail with "unreachable"

This how you overcome that:

---
- hosts: all
  vars:
    ansible_ssh_user: "{{ initial_user }}"
    ansible_ssh_pass: "{{ initial_pass }}"
  connection: local
  gather_facts: no
  tasks:
    - block:
        - name: check hosts
          wait_for_connection:
            timeout: 10
        - name: clear hosts
          group_by:
            key: "working_hosts"
      rescue:
        - debug: msg="cannot connect to"

- hosts: working_hosts
  vars:
    ansible_ssh_user: "{{ initial_user }}"
    ansible_ssh_pass: "{{ initial_pass }}"
  vars_files:
    - vaults/staging.vault
  roles:
    - jumpbox

See this blogpost

I have no relation to the author of the blogpost btw =).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment