New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task fails when ControlPersist is about to expire #16731
Comments
|
I guess this is an SSH client bug, but it's bad that because of it Ansible fails on a host. I think the best solution would be that Ansible detects this error while it's connecting to a host and if it happens, retries the connections. |
|
Would be curious if 9b7d782 affects this. I don't think it would fix it, seems like it could interact possibly poorly. |
|
there is a 'retries' configuration for ssh_connection that should mitigate this issue. |
|
@bcoca setting |
|
This feels like a bug/missing feature in the ssh client. I don't think we can satisfactorily solve it once it's gotten to the Ansible code because we don't know for sure whether any part of the command was sent to the remote side and executed (it could be dangerous for us to automatically retry if some part of the command was already executed remotely). Maybe we should document that ControlPersist should be set higher than the expected maximum length of time spent on any one task (ie: 1000 hosts, 20 forks, 5 minute task... ControlPersist needs to be greater than 250minutes). Along with that, it would probably be good to add a config switch to have Ansible try to clean up the sockets when it is done. that way the user can set a high ControlPersist and still choose not to have the sockets laying around on the filesystem (protected by Unix file permissions but still...) |
|
FYI, |
|
I just ran the test playbook again, but because Ansible became a lot faster, I had to increase the pause to @bcoca that is again a workaround and not a convenient one, because you have to put that @abadger I don't understand why is it a problem even if some part of the command was executed, because Ansible modules were written to do critical operations atomically, but they are also written to be idempotent, so it's not a big issue if you retry them. But let's ignore all that, how would this retry because of an SSH bug, be different then setting Also I checked how SSH ControlMaster works and |
|
@nitzmahone ^ What do you think of kustodian's explanation? |
|
(commented on review for #16787) |
Ansible will now automatically retry a connection if SSH returns an error: mux_client_hello_exchange: write packet: Broken pipe This is probably a bug in SSH, but because it's safe to retry this connection there is no need for Ansible to fail because of it.
Ansible will now automatically retry a connection if SSH returns an error: mux_client_hello_exchange: write packet: Broken pipe This is probably a bug in SSH, but because it's safe to retry this connection there is no need for Ansible to fail because of it. (cherry picked from commit 9f0be5a)
Ansible will now automatically retry a connection if SSH returns an error: mux_client_hello_exchange: write packet: Broken pipe This is probably a bug in SSH, but because it's safe to retry this connection there is no need for Ansible to fail because of it. (cherry picked from commit 9f0be5a)
Ansible will now automatically retry a connection if SSH returns an error: mux_client_hello_exchange: write packet: Broken pipe This is probably a bug in SSH, but because it's safe to retry this connection there is no need for Ansible to fail because of it. (cherry picked from commit 9f0be5a)
|
This bug ticket is closed, so I assume that the fix is already in some release? Which one? I tried to find it out by following all those commit&merge links.... but that seems to be pretty difficult and I could not find it out. |
|
It will be released with Ansible 2.4.
But you got a good point that it's not obvious from this issue in which
version it was fixed.
…On Wed, Jun 14, 2017, 12:33 karniemi ***@***.***> wrote:
This bug ticket is closed, so I assume that the fix is already in some
release? Which one? I tried to find it out by following all those
commit&merge links.... but that seems to be pretty difficult and I could
not find it out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16731 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2-Lbpzbj2hBG3AbmyjiwRWGjGzeS8Yks5sD7cMgaJpZM4JNy_s>
.
|
ISSUE TYPE
Bug Report
COMPONENT NAME
ssh controlpersist
ANSIBLE VERSION
CONFIGURATION
forks = 200
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
ssh_args above is the default but I put it here since its important for this bug.
OS / ENVIRONMENT
CentOS 6/Ubuntu 16.04
SUMMARY
If you run a play with a lot of hosts (our play has 130 hosts) and if there is a task which will be executed just when the
ControlPersisttime out is about to expire, random hosts can fail with the following SSH error:My guess is that Ansible is just about to reuse the SSH socket to connect to the remote host, but at that moment ControlMaster closes the socket to the server. The playbook which can reproduce this will make it more clear what I mean here.
Also a big thanks to @jctanner for providing a patch so that we could see this SSH error, since Ansible hides them. I reported a bug about it in #16732.
STEPS TO REPRODUCE
To reproduce this bug you will need as many hosts in the playbook as you can come up with and they all have to have different IP addresses, so that SSH's ControlMaster (CM) creates a new socket for each of hosts. We have 130 hosts in our play and this play manages to fail every time, the more hosts you have in a play the easier is to catch this bug. It should fail with less, but it will be harder capture it.
The idea behind this play is that we run
pingmodule two times in a row so that the first timepingconnects to all the host, while the second time it will run faster since CM will hold an SSH connection already to each host, which means all hosts will executepingvery fast and in a similar time frame. Then we pause for 55 seconds, so that when we can run ping after that because within those 5 secondsControlPersist=60stime out will be expiring on all hosts while ping is being executed and random hosts Ansible will report as Unreachable because of the SSH error above.I used
with_itemsso that ping is executed on each hosts 3 times, because this made it even easier for the error to occur. I guess lowering the number of forks to less than the number of hosts (e.g. 50 in this situation) would also help, since for me it failed withforks = 50as well.You might need to tweak number of seconds of pause depending on how fast ping executes on the hosts, but you should be able to reproduce this issue if you have enough hosts in the play.
If we disable
ControlMasteror if we setControlPersistto a higher number (e.g.120s) all hosts finish without a problem.EXPECTED RESULTS
Playbook should finish execution.
ACTUAL RESULTS
Random hosts return as unreachable.
The text was updated successfully, but these errors were encountered: