Ansible hosts are randomly unreachable #18188

benapetr · 2016-10-25T13:48:20Z

ISSUE TYPE

Bug Report

COMPONENT NAME

SSH connectivity

ANSIBLE VERSION

ansible 2.0.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides

CONFIGURATION

[defaults]
host_key_checking = False
timeout = 10
retry_files_save_path = /home/ansible/retry/
# Do not put more jobs here, or ssh will fail
forks = 10
remote_user = root
log_path=/var/log/ansible.log

[ssh_connection]
pipelining=True

OS / ENVIRONMENT

RedHat 7.2

SUMMARY

We have about 2000 hosts managed by ansible and everytime I run any playbook or command on all of them, I always have about 3% of them as "UNREACHABLE", when I restart the task, some other random servers are UNREACHABLE, they however are not unreachable and there is no network outage or anything like that.

If I create a loop ssh connection (for loop in bash) that connects to every one of these 2000 servers it works without troubles, so there is clearly no issues related to SSH or network connectivity itself.

I almost believe that this is some problem with timeouts and the way how ansible determines that host is unreachable.

STEPS TO REPRODUCE

ansible all -a 'echo test'

EXPECTED RESULTS

I expect the command to execute on 100% of hosts

ACTUAL RESULTS

It gets executed only on some hosts and random hosts are considered unreachable

Output is too long pretty much something like
server_name | SUCCESS | rc=0 | (stdout) ok
about 1900 lines of SUCCESS and 100 UNREACHABLE even though the servers are perfectly reachable

Note that there were similar bugreports found on different forums on internet in regards of amazon EC2 http://stackoverflow.com/questions/39973103/ansible-ec2-random-ssh-connection-failures-after-provision

The text was updated successfully, but these errors were encountered:

ryansb · 2016-10-25T17:12:25Z

Can you add some output with -vvvv to show the SSH commands being run to get out to the remote hosts? My first instinct would be to raise the timeout in your ansible config, because it may not be possible to handle that many connections with 10 workers on that timeout.

ryansb · 2016-10-25T17:12:30Z

needs_info

benapetr · 2016-11-03T18:24:18Z

This really isn't so easy, I might try to do what you want but keep in mind that is hundreds of thousands of debug lines out there. We are managing about 2000 servers with ansible and this can be easily reproduced only when I start it on all of them.

On other hand, using "retries" option in config file, and setting it to high value, fixed this as workaround, with -vv I am now getting lot of

ssh_retry: attempt: 1, ssh return code is 255. cmd (/bin/sh -c 'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python'...),

hackermd · 2017-06-24T20:17:32Z

I observed the same problem with a large number of hosts. Reducing the number of forks (to the default value) solved the problem for me.

dachshund-digital · 2017-09-19T15:03:03Z

I see this same behavior with just 3 hosts, using ansible 1.7.2 on debian 9 (stretch) on 2 VMware VMs, and debian 8 (jessie) on a raspberry pi. Completely random when it happens.

rishibamba · 2017-10-04T06:43:07Z

Facing random "UNREACHABLE!" error on hosts.
Return Message: Failed to connect to the host via ssh: Couldn't read packet: Connection reset by peer

Ansible version: 2.3.1.0

arigesher · 2018-02-21T19:25:42Z

I think this can be solved with retries parameter in ansible.cfg.

Resetting Unreachable Hosts
New in version 2.2.

Connection failures set hosts as ‘UNREACHABLE’, which will remove them from the list of active hosts for the run. To recover from these issues you can use meta: clear_host_errors to have all currently flagged hosts reactivated, so subsequent tasks can try to use them again.

https://docs.ansible.com/ansible/latest/intro_configuration.html#retries

Add this to your ansible.cfg:

[ssh_connection]
retries=10

sivel · 2019-03-01T16:26:55Z

This problem comes down to local considerations. Many solutions have been proposed here, that can help in reducing failures.

reduce fork count
configure ssh retries
Use a controller that is closer to the hosts being managed
Adjust ssh settings such as ServerAliveInterval, TCPKeepAlive
Adjust timeout settings

If you have further questions please stop by IRC or the mailing list:

IRC: #ansible on irc.freenode.net
mailing list: https://groups.google.com/forum/#!forum/ansible-project

ansibot added bug_report affects_2.0 This issue/PR affects Ansible v2.0 labels Oct 25, 2016

ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Oct 25, 2016

ansibot removed the needs_info This issue requires further information. Please answer any outstanding questions. label Nov 3, 2016

ansibot added the support:core This issue/PR relates to code supported by the Ansible Engineering Team. label Jun 29, 2017

ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 1, 2018

piyushkv1 mentioned this issue Jan 19, 2019

Running prerequisites.yml one node lost connection. add retries to fix. openshift/openshift-ansible#11024

Closed

sivel closed this as completed Mar 1, 2019

ansible locked and limited conversation to collaborators Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ansible hosts are randomly unreachable #18188

Ansible hosts are randomly unreachable #18188

benapetr commented Oct 25, 2016

ryansb commented Oct 25, 2016

ryansb commented Oct 25, 2016

benapetr commented Nov 3, 2016

hackermd commented Jun 24, 2017

dachshund-digital commented Sep 19, 2017

rishibamba commented Oct 4, 2017

arigesher commented Feb 21, 2018

sivel commented Mar 1, 2019

Ansible hosts are randomly unreachable #18188

Ansible hosts are randomly unreachable #18188

Comments

benapetr commented Oct 25, 2016

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ryansb commented Oct 25, 2016

ryansb commented Oct 25, 2016

benapetr commented Nov 3, 2016

hackermd commented Jun 24, 2017

dachshund-digital commented Sep 19, 2017

rishibamba commented Oct 4, 2017

arigesher commented Feb 21, 2018

sivel commented Mar 1, 2019