Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible hosts are randomly unreachable #18188

Closed
benapetr opened this issue Oct 25, 2016 · 8 comments
Closed

Ansible hosts are randomly unreachable #18188

benapetr opened this issue Oct 25, 2016 · 8 comments
Labels
affects_2.0 This issue/PR affects Ansible v2.0 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team.

Comments

@benapetr
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME

SSH connectivity

ANSIBLE VERSION
ansible 2.0.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
CONFIGURATION
[defaults]
host_key_checking = False
timeout = 10
retry_files_save_path = /home/ansible/retry/
# Do not put more jobs here, or ssh will fail
forks = 10
remote_user = root
log_path=/var/log/ansible.log

[ssh_connection]
pipelining=True
OS / ENVIRONMENT

RedHat 7.2

SUMMARY

We have about 2000 hosts managed by ansible and everytime I run any playbook or command on all of them, I always have about 3% of them as "UNREACHABLE", when I restart the task, some other random servers are UNREACHABLE, they however are not unreachable and there is no network outage or anything like that.

If I create a loop ssh connection (for loop in bash) that connects to every one of these 2000 servers it works without troubles, so there is clearly no issues related to SSH or network connectivity itself.

I almost believe that this is some problem with timeouts and the way how ansible determines that host is unreachable.

STEPS TO REPRODUCE
ansible all -a 'echo test'
EXPECTED RESULTS

I expect the command to execute on 100% of hosts

ACTUAL RESULTS

It gets executed only on some hosts and random hosts are considered unreachable

Output is too long pretty much something like
server_name | SUCCESS | rc=0 | (stdout) ok
about 1900 lines of SUCCESS and 100 UNREACHABLE even though the servers are perfectly reachable

Note that there were similar bugreports found on different forums on internet in regards of amazon EC2 http://stackoverflow.com/questions/39973103/ansible-ec2-random-ssh-connection-failures-after-provision

@ansibot ansibot added bug_report affects_2.0 This issue/PR affects Ansible v2.0 labels Oct 25, 2016
@ryansb
Copy link
Contributor

ryansb commented Oct 25, 2016

Can you add some output with -vvvv to show the SSH commands being run to get out to the remote hosts? My first instinct would be to raise the timeout in your ansible config, because it may not be possible to handle that many connections with 10 workers on that timeout.

@ryansb
Copy link
Contributor

ryansb commented Oct 25, 2016

needs_info

@ansibot ansibot added the needs_info This issue requires further information. Please answer any outstanding questions. label Oct 25, 2016
@benapetr
Copy link
Author

benapetr commented Nov 3, 2016

This really isn't so easy, I might try to do what you want but keep in mind that is hundreds of thousands of debug lines out there. We are managing about 2000 servers with ansible and this can be easily reproduced only when I start it on all of them.

On other hand, using "retries" option in config file, and setting it to high value, fixed this as workaround, with -vv I am now getting lot of

ssh_retry: attempt: 1, ssh return code is 255. cmd (/bin/sh -c 'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python'...),

@ansibot ansibot removed the needs_info This issue requires further information. Please answer any outstanding questions. label Nov 3, 2016
@hackermd
Copy link

I observed the same problem with a large number of hosts. Reducing the number of forks (to the default value) solved the problem for me.

@ansibot ansibot added the support:core This issue/PR relates to code supported by the Ansible Engineering Team. label Jun 29, 2017
@dachshund-digital
Copy link

I see this same behavior with just 3 hosts, using ansible 1.7.2 on debian 9 (stretch) on 2 VMware VMs, and debian 8 (jessie) on a raspberry pi. Completely random when it happens.

@rishibamba
Copy link

Facing random "UNREACHABLE!" error on hosts.
Return Message: Failed to connect to the host via ssh: Couldn't read packet: Connection reset by peer

Ansible version: 2.3.1.0

@arigesher
Copy link

I think this can be solved with retries parameter in ansible.cfg.

Resetting Unreachable Hosts
New in version 2.2.

Connection failures set hosts as ‘UNREACHABLE’, which will remove them from the list of active hosts for the run. To recover from these issues you can use meta: clear_host_errors to have all currently flagged hosts reactivated, so subsequent tasks can try to use them again.

https://docs.ansible.com/ansible/latest/intro_configuration.html#retries

Add this to your ansible.cfg:

[ssh_connection]
retries=10

@sivel
Copy link
Member

sivel commented Mar 1, 2019

This problem comes down to local considerations. Many solutions have been proposed here, that can help in reducing failures.

  1. reduce fork count
  2. configure ssh retries
  3. Use a controller that is closer to the hosts being managed
  4. Adjust ssh settings such as ServerAliveInterval, TCPKeepAlive
  5. Adjust timeout settings

If you have further questions please stop by IRC or the mailing list:

@sivel sivel closed this as completed Mar 1, 2019
@ansible ansible locked and limited conversation to collaborators Jul 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.0 This issue/PR affects Ansible v2.0 bug This issue/PR relates to a bug. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
None yet
Development

No branches or pull requests

8 participants