Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task fails when ControlPersist is about to expire #16731

Closed
kustodian opened this issue Jul 15, 2016 · 11 comments
Closed

Task fails when ControlPersist is about to expire #16731

kustodian opened this issue Jul 15, 2016 · 11 comments
Labels
affects_2.1 This issue/PR affects Ansible v2.1 bug This issue/PR relates to a bug. c:plugins/connection/ssh

Comments

@kustodian
Copy link
Contributor

kustodian commented Jul 15, 2016

ISSUE TYPE

Bug Report

COMPONENT NAME

ssh controlpersist

ANSIBLE VERSION
ansible 2.1.1.0
Tried it on stable-2.1 and some other branches.
CONFIGURATION

forks = 200
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
ssh_args above is the default but I put it here since its important for this bug.

OS / ENVIRONMENT

CentOS 6/Ubuntu 16.04

SUMMARY

If you run a play with a lot of hosts (our play has 130 hosts) and if there is a task which will be executed just when the ControlPersist time out is about to expire, random hosts can fail with the following SSH error:

Failed to connect to the host via ssh. (255, '', 'OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /home/user/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
mux_client_hello_exchange: write packet: Broken pipe)

My guess is that Ansible is just about to reuse the SSH socket to connect to the remote host, but at that moment ControlMaster closes the socket to the server. The playbook which can reproduce this will make it more clear what I mean here.
Also a big thanks to @jctanner for providing a patch so that we could see this SSH error, since Ansible hides them. I reported a bug about it in #16732.

STEPS TO REPRODUCE

To reproduce this bug you will need as many hosts in the playbook as you can come up with and they all have to have different IP addresses, so that SSH's ControlMaster (CM) creates a new socket for each of hosts. We have 130 hosts in our play and this play manages to fail every time, the more hosts you have in a play the easier is to catch this bug. It should fail with less, but it will be harder capture it.

---
- hosts: large_group_of_hosts
  gather_facts: no
  tasks:
    - ping:
    - ping:
    - pause: seconds=55
    - ping:
      with_items: [ 1, 2, 3]

The idea behind this play is that we run ping module two times in a row so that the first time ping connects to all the host, while the second time it will run faster since CM will hold an SSH connection already to each host, which means all hosts will execute ping very fast and in a similar time frame. Then we pause for 55 seconds, so that when we can run ping after that because within those 5 seconds ControlPersist=60s time out will be expiring on all hosts while ping is being executed and random hosts Ansible will report as Unreachable because of the SSH error above.
I used with_items so that ping is executed on each hosts 3 times, because this made it even easier for the error to occur. I guess lowering the number of forks to less than the number of hosts (e.g. 50 in this situation) would also help, since for me it failed with forks = 50 as well.
You might need to tweak number of seconds of pause depending on how fast ping executes on the hosts, but you should be able to reproduce this issue if you have enough hosts in the play.
If we disable ControlMaster or if we set ControlPersist to a higher number (e.g. 120s) all hosts finish without a problem.

EXPECTED RESULTS

Playbook should finish execution.

ACTUAL RESULTS

Random hosts return as unreachable.

@kustodian
Copy link
Contributor Author

I guess this is an SSH client bug, but it's bad that because of it Ansible fails on a host. I think the best solution would be that Ansible detects this error while it's connecting to a host and if it happens, retries the connections.

@alikins
Copy link
Contributor

alikins commented Jul 15, 2016

Would be curious if 9b7d782 affects this. I don't think it would fix it, seems like it could interact possibly poorly.

@bcoca
Copy link
Member

bcoca commented Jul 15, 2016

there is a 'retries' configuration for ssh_connection that should mitigate this issue.

@kustodian
Copy link
Contributor Author

@bcoca setting retries = 1 helps of course, but it's more a workaround.

@ansibot ansibot added the affects_2.1 This issue/PR affects Ansible v2.1 label Sep 8, 2016
@ansibot ansibot added needs_info This issue requires further information. Please answer any outstanding questions. needs_template This issue/PR has an incomplete description. Please fill in the proposed template correctly. and removed needs_info This issue requires further information. Please answer any outstanding questions. needs_template This issue/PR has an incomplete description. Please fill in the proposed template correctly. labels Mar 29, 2017
@abadger
Copy link
Contributor

abadger commented Mar 30, 2017

This feels like a bug/missing feature in the ssh client. I don't think we can satisfactorily solve it once it's gotten to the Ansible code because we don't know for sure whether any part of the command was sent to the remote side and executed (it could be dangerous for us to automatically retry if some part of the command was already executed remotely).

Maybe we should document that ControlPersist should be set higher than the expected maximum length of time spent on any one task (ie: 1000 hosts, 20 forks, 5 minute task... ControlPersist needs to be greater than 250minutes). Along with that, it would probably be good to add a config switch to have Ansible try to clean up the sockets when it is done. that way the user can set a high ControlPersist and still choose not to have the sockets laying around on the filesystem (protected by Unix file permissions but still...)

@bcoca
Copy link
Member

bcoca commented Mar 31, 2017

FYI, meta: refresh_connection will remove those sockets, so you can end the play with it currently to avoid lingering connections.

@kustodian
Copy link
Contributor Author

I just ran the test playbook again, but because Ansible became a lot faster, I had to increase the pause to 58s. But the same bug is still there. I tested it on SSH 5.3p1 and 7.4p1 and it happens on both.

@bcoca that is again a workaround and not a convenient one, because you have to put that meta tasks in a few places all the time.

@abadger I don't understand why is it a problem even if some part of the command was executed, because Ansible modules were written to do critical operations atomically, but they are also written to be idempotent, so it's not a big issue if you retry them. But let's ignore all that, how would this retry because of an SSH bug, be different then setting retries = 1?

Also I checked how SSH ControlMaster works and mux_client_hello_exchange is done when establishing the connection to the host when ControlPersist is enabled, and this fix only retries if SSH throws exactly mux_client_hello_exchange: write packet: Broken pipe error. This means no partial command would be executed by Ansible when this error is thrown, which makes it safe. I would even go that far and say that this is safer then setting retries = 1, because retries will always retry a connection, no matter what the reason for the disconnect.

@abadger
Copy link
Contributor

abadger commented Apr 6, 2017

@nitzmahone ^ What do you think of kustodian's explanation?

@nitzmahone
Copy link
Member

(commented on review for #16787)

kustodian added a commit to kustodian/ansible that referenced this issue Apr 7, 2017
Ansible will now automatically retry a connection if SSH returns an error:

mux_client_hello_exchange: write packet: Broken pipe

This is probably a bug in SSH, but because it's safe to retry this
connection there is no need for Ansible to fail because of it.
@abadger abadger closed this as completed in 9f0be5a Apr 7, 2017
abadger pushed a commit that referenced this issue Apr 7, 2017
Ansible will now automatically retry a connection if SSH returns an error:

mux_client_hello_exchange: write packet: Broken pipe

This is probably a bug in SSH, but because it's safe to retry this
connection there is no need for Ansible to fail because of it.

(cherry picked from commit 9f0be5a)
abadger pushed a commit that referenced this issue Apr 13, 2017
Ansible will now automatically retry a connection if SSH returns an error:

mux_client_hello_exchange: write packet: Broken pipe

This is probably a bug in SSH, but because it's safe to retry this
connection there is no need for Ansible to fail because of it.

(cherry picked from commit 9f0be5a)
abadger pushed a commit that referenced this issue Apr 13, 2017
Ansible will now automatically retry a connection if SSH returns an error:

mux_client_hello_exchange: write packet: Broken pipe

This is probably a bug in SSH, but because it's safe to retry this
connection there is no need for Ansible to fail because of it.

(cherry picked from commit 9f0be5a)
@karniemi
Copy link

This bug ticket is closed, so I assume that the fix is already in some release? Which one? I tried to find it out by following all those commit&merge links.... but that seems to be pretty difficult and I could not find it out.

@kustodian
Copy link
Contributor Author

kustodian commented Jun 15, 2017 via email

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bug_report labels Mar 7, 2018
@ansible ansible locked and limited conversation to collaborators Apr 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.1 This issue/PR affects Ansible v2.1 bug This issue/PR relates to a bug. c:plugins/connection/ssh
Projects
None yet
Development

No branches or pull requests

8 participants