New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task failures with rc -13 (SIGPIPE) related to SSH ControlPersist expiry #81777
Comments
As an aside, I believe the codepath in https://github.com/ansible/ansible/blob/v2.15.3/lib/ansible/plugins/connection/ssh.py#L1186-L1193 as originally commited in #16787 as a fix for #16731 doesn't really apply anymore in any recent versions of OpenSSH. The openssh Or exit with |
Files identified in the description: If these files are incorrect, please update the |
so are you suggesting something like: diff --git a/lib/ansible/plugins/connection/ssh.py b/lib/ansible/plugins/connection/ssh.py
index 49b2ed22fc..3fb7bb1e80 100644
--- a/lib/ansible/plugins/connection/ssh.py
+++ b/lib/ansible/plugins/connection/ssh.py
@@ -1216,6 +1216,9 @@ class Connection(ConnectionBase):
elif in_data and checkrc:
raise AnsibleConnectionFailure('Data could not be sent to remote host "%s". Make sure this host can be reached over ssh: %s'
% (self.host, additional))
+ elif p.returncode == '-13':
+ # handle openssh/python race condition on control persist
+ raise AnsibleControlPersistBrokenPipeError('Data could not be sent because of ControlPersist broken pipe: %s' % to_native(stderr))
return (p.returncode, b_stdout, b_stderr) ? |
I think that would fix this case (with TBH this is really an openssh bug fixed in OpenSSH 9.1, but it's going to be a while before that's in widespread deployment (RHEL 9 is on OpenSSH 8.7?), so ideally Ansible could ship some kind of workaround, particularly since the problematic One way to look at this is that based on my understanding of python/cpython#45993, Ansible running on versions of Python prior to 3.2 will have exec'd the EDIT: With Python 3.2+ and the |
Considering that connection plugins run on controller and Ansible never supported running on Python 3.0-3.2, I think it is safe to ignore that part. |
There are various bugs mentioning rc=-13, but I'd like to link this one which has a nice analysis to this other one which provides a workaround of increasing ControlPersist timeout. |
Summary
After upgrading from Ansible 4.10 -> Ansible 8.2.0 (ansible-core 2.15.3), our Jenkins pipelines started encountering occasional module failures with several different ansible-playbooks in various
command
,file
or "Gathering facts" ->setup
tasks, all with"rc": -13
(i.e.SIGPIPE
):Investigating the issues over a longer period of time, it seems like the issue was occurning when the interval between task executions on a given host coincided with the Ansible ssh plugin's
ControlPersist=60s
idle timeout:I believe the
ssh
client process on the Ansible controller is racing themux_client_hello_exchange
UNIX socket connect -> write with the auto-mux server closing the control socket. This seems to lead to the SSH client process dying with a SIGPIPE signal from the kernel, and the Ansible ssh connection plugin sees a-13
process exit code.This issue is probably related to #77450 (same root cause) and #16731 (earlier fix for the same issue).
The upstream OpenSSH bz#3454 and openssh/openssh-portable@96faa0d fix to
ssh
SIGPIPE handling for this exact case as released in OpenSSH 9.1 are also related.The Python subprocess issue (python/cpython#45993) mentioned in #77450 and related changes to
subprocess.Popen(restore_signals=True)
behavior in Python 3.2 as regards the effective SIGPIPE handler in thessh
child process forked by Ansible is probably also related.The end result appears to be that for specific combinations of Python and OpenSSH, the
ssh
client forked by the Ansible ssh connection plugins executes themux_client_hello_exchange
socket write with the defaultSIGPIPE
handler, leading to an occasional race condition with a -13 exit status code and resulting sporadic Ansible task failures.Issue Type
Bug Report
Component Name
ssh
Ansible Version
Configuration
OS / Environment
CentOS Stream 8, from the Docker
quay.io/centos/centos:stream8
image @sha256:e4e81a5e6be8f8f7eb511a8df3afcd4e7123e68c56bc03efc40fbd0ab5b2e4fd
OpenSSH_8.0p1
Steps to Reproduce
This is a timing-related race-condition, and thus difficult to reproduce reliably. The following playbook is an attempt to trigger the ssh control-master race on the "Test 2" task with a variable sleep, but I was not able to repro the actual
SIGPIPE
-> rc-13
on the secondcommand
task - what did work was to run the same playbook several times in a row in our Jenkins pipeline, and the failure would sometimes occur during the "Gathering Facts" ->setup
task on a subsequent playbook run, where the previous playbook run completed for that host approximately 10 seconds earlier.Expected Results
I expect the playbook tasks to run reliably, falling back to a direct SSH connection as necessary.
Based on the
ansible-playbook -vvvvv
->ssh -v
debug logs, the repro case is sufficient to exercise all of the followingssh
codepaths:In addition, I would expect to also see a fourth case with
debug: mux_client_hello_exchange: write packet: Broken pipe
->muxclient: master hello exchange failed
->debug2: ssh_connect_direct
, but this case seems to result in thessh
client dying withSIGPIPE
instead. Thessh
debug logging in the failure case ends atdebug1: auto-mux: Trying existing master
->debug2: fd 3 setting O_NONBLOCK
, so I would assume theSIGPIPE
is occuring on themux_client_write_packet
in between themuxclient
connect(...)
->ENOENT
Control socket \"%.100s\" does not exist
andmux_client_read_packet
->read packet failed
calls.Actual Results
Code of Conduct
The text was updated successfully, but these errors were encountered: