Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task failures with rc -13 (SIGPIPE) related to SSH ControlPersist expiry #81777

Open
1 task done
SpComb opened this issue Sep 26, 2023 · 6 comments
Open
1 task done
Labels
affects_2.15 bug This issue/PR relates to a bug.

Comments

@SpComb
Copy link

SpComb commented Sep 26, 2023

Summary

After upgrading from Ansible 4.10 -> Ansible 8.2.0 (ansible-core 2.15.3), our Jenkins pipelines started encountering occasional module failures with several different ansible-playbooks in various command, file or "Gathering facts" -> setup tasks, all with "rc": -13 (i.e. SIGPIPE):

23:02:04  PLAY [x] *********************************************************************
23:02:04  
23:02:04  TASK [Gathering Facts] *********************************************************
23:02:07  fatal: [x]: FAILED! => {
23:02:07      "ansible_facts": {},
23:02:07      "changed": false,
23:02:07      "failed_modules": {
23:02:07          "ansible.legacy.setup": {
23:02:07              "failed": true,
23:02:07              "module_stderr": "",
23:02:07              "module_stdout": "",
23:02:07              "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
23:02:07              "rc": -13
23:02:07          }
23:02:07      }
23:02:07  }
23:02:07  
23:02:07  MSG:
23:02:07  
23:02:07  The following modules failed to execute: ansible.legacy.setup
10:29:49  TASK [x : Config directories] ******************************************
10:29:49  failed: [x] (item=/opt/x) => {
10:29:49      "ansible_loop_var": "item",
10:29:49      "changed": false,
10:29:49      "item": "/opt/x",
10:29:49      "rc": -13
10:29:49  }
10:29:49  
10:29:49  MSG:
10:29:49  
10:29:49  MODULE FAILURE
10:29:49  See stdout/stderr for the exact error 

Investigating the issues over a longer period of time, it seems like the issue was occurning when the interval between task executions on a given host coincided with the Ansible ssh plugin's ControlPersist=60s idle timeout:

[2023-09-22T19:07:42.331Z] TASK [supercronic : Deploy supercronic container] ***********
[2023-09-22T19:07:44.850Z] �[0;31m--- before�[0m
[2023-09-22T19:07:44.850Z] �[0;31m�[0m�[0;32m+++ after�[0m
[2023-09-22T19:07:44.850Z] �[0;32m�[0m�[0;36m@@ -1,3 +1,3 @@�[0m
[2023-09-22T19:07:44.850Z] �[0;36m�[0m {
[2023-09-22T19:07:44.850Z] �[0;31m-    "running": false�[0m
[2023-09-22T19:07:44.850Z] �[0;31m�[0m�[0;32m+    "running": true�[0m
[2023-09-22T19:07:44.850Z] �[0;32m�[0m }
[2023-09-22T19:07:44.850Z] 
[2023-09-22T19:07:44.850Z] �[0;33mchanged: [xxx.example.com]�[0m`
syys 22 19:08:42 xxx.example.com sshd[29728]: Received disconnect from 10.52.8.41 port 39688:11: disconnected by user
syys 22 19:08:42 xxx.example.com sshd[29728]: Disconnected from 10.52.8.41 port 39688
syys 22 19:08:42 xxx.example.com sshd[29728]: pam_unix(sshd:session): session closed for user root
[2023-09-22T19:08:43.079Z] TASK [Gathering Facts] *********************************************************
[2023-09-22T19:08:44.524Z] �[0;31mfatal: [xxx.example.com]: FAILED! => {�[0m
[2023-09-22T19:08:44.524Z] �[0;31m    "ansible_facts": {},�[0m
[2023-09-22T19:08:44.524Z] �[0;31m    "changed": false,�[0m
[2023-09-22T19:08:44.524Z] �[0;31m    "failed_modules": {�[0m
[2023-09-22T19:08:44.524Z] �[0;31m        "ansible.legacy.setup": {�[0m
[2023-09-22T19:08:44.524Z] �[0;31m            "failed": true,�[0m
[2023-09-22T19:08:44.524Z] �[0;31m            "module_stderr": "",�[0m
[2023-09-22T19:08:44.524Z] �[0;31m            "module_stdout": "",�[0m
[2023-09-22T19:08:44.524Z] �[0;31m            "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",�[0m
[2023-09-22T19:08:44.524Z] �[0;31m            "rc": -13�[0m
[2023-09-22T19:08:44.524Z] �[0;31m        }�[0m
[2023-09-22T19:08:44.524Z] �[0;31m    }�[0m
[2023-09-22T19:08:44.524Z] �[0;31m}�[0m
[2023-09-22T19:08:44.524Z] �[0;31m�[0m
[2023-09-22T19:08:44.524Z] �[0;31mMSG:�[0m
[2023-09-22T19:08:44.524Z] �[0;31m�[0m
[2023-09-22T19:08:44.524Z] �[0;31mThe following modules failed to execute: ansible.legacy.setup�[0m
[2023-09-22T19:08:44.524Z] �[0;31m�[0m

I believe the ssh client process on the Ansible controller is racing the mux_client_hello_exchange UNIX socket connect -> write with the auto-mux server closing the control socket. This seems to lead to the SSH client process dying with a SIGPIPE signal from the kernel, and the Ansible ssh connection plugin sees a -13 process exit code.

This issue is probably related to #77450 (same root cause) and #16731 (earlier fix for the same issue).

The upstream OpenSSH bz#3454 and openssh/openssh-portable@96faa0d fix to ssh SIGPIPE handling for this exact case as released in OpenSSH 9.1 are also related.

The Python subprocess issue (python/cpython#45993) mentioned in #77450 and related changes to subprocess.Popen(restore_signals=True) behavior in Python 3.2 as regards the effective SIGPIPE handler in the ssh child process forked by Ansible is probably also related.

The end result appears to be that for specific combinations of Python and OpenSSH, the ssh client forked by the Ansible ssh connection plugins executes the mux_client_hello_exchange socket write with the default SIGPIPE handler, leading to an occasional race condition with a -13 exit status code and resulting sporadic Ansible task failures.

Issue Type

Bug Report

Component Name

ssh

Ansible Version

$ ansible --version
ansible [core 2.15.4]
  config file = /build/provisioning/ansible.cfg
  configured module search path = ['/home/build/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/build/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.17 (main, Aug 10 2023, 15:11:31) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)] (/usr/bin/python3.9)
  jinja version = 3.1.2
  libyaml = True

Configuration

# if using a version older than ansible-core 2.12 you should omit the '-t all'
$ ansible-config dump --only-changed -t all
CONFIG_FILE() = /build/provisioning/ansible.cfg
DEFAULT_FORKS(/build/provisioning/ansible.cfg) = 20
DEFAULT_STDOUT_CALLBACK(/build/provisioning/ansible.cfg) = debug
DEFAULT_VAULT_PASSWORD_FILE(/build/provisioning/ansible.cfg) = /build/provisioning/vault-password.secret
INVENTORY_ANY_UNPARSED_IS_FAILED(/build/provisioning/ansible.cfg) = True
INVENTORY_CACHE_PLUGIN(/build/provisioning/ansible.cfg) = jsonfile
INVENTORY_CACHE_PLUGIN_CONNECTION(/build/provisioning/ansible.cfg) = ~/.ansible/inventory-cache
INVENTORY_CACHE_TIMEOUT(/build/provisioning/ansible.cfg) = 3600
INVENTORY_ENABLED(/build/provisioning/ansible.cfg) = ['yaml', 'funidata.infra.aws_ec2', 'ini']
TRANSFORM_INVALID_GROUP_CHARS(/build/provisioning/ansible.cfg) = ignore

CONNECTION:
==========

ssh:
___
pipelining(/build/provisioning/ansible.cfg) = True

OS / Environment

CentOS Stream 8, from the Docker quay.io/centos/centos:stream8 image @ sha256:e4e81a5e6be8f8f7eb511a8df3afcd4e7123e68c56bc03efc40fbd0ab5b2e4fd

OpenSSH_8.0p1

Steps to Reproduce

This is a timing-related race-condition, and thus difficult to reproduce reliably. The following playbook is an attempt to trigger the ssh control-master race on the "Test 2" task with a variable sleep, but I was not able to repro the actual SIGPIPE -> rc -13 on the second command task - what did work was to run the same playbook several times in a row in our Jenkins pipeline, and the failure would sometimes occur during the "Gathering Facts" -> setup task on a subsequent playbook run, where the previous playbook run completed for that host approximately 10 seconds earlier.

- hosts: all
  strategy: free
  vars:
    ansible_ssh_args: "-o ControlMaster=auto -o ControlPersist=10s -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile={{ playbook_dir }}/ssh/known_hosts.d/{{ inventory_hostname }} -o HashKnownHosts=no -o ServerAliveInterval=5 -o ServerAliveCountMax=1"
  tasks:
    - name: Test 1
      command: date
    
    - name: Pause
      command: sleep "{{ 8.0 + (1000 | random / 1000.0) * 3.0 }}"
      connection: local

    - name: Test 2
      command: date

Expected Results

I expect the playbook tasks to run reliably, falling back to a direct SSH connection as necessary.

Based on the ansible-playbook -vvvvv -> ssh -v debug logs, the repro case is sufficient to exercise all of the following ssh codepaths:

debug1: auto-mux: Trying existing master\r\ndebug1: Control socket "/home/build/.ansible/cp/4bf82cb2a5" does not exist\r\ndebug2: ssh_connect_direct
ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: mux_client_hello_exchange: read packet failed\r\nmuxclient: master hello exchange failed\r\ndebug2: ssh_connect_direct
debug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\n\debug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering

In addition, I would expect to also see a fourth case with debug: mux_client_hello_exchange: write packet: Broken pipe -> muxclient: master hello exchange failed -> debug2: ssh_connect_direct, but this case seems to result in the ssh client dying with SIGPIPE instead. The ssh debug logging in the failure case ends at debug1: auto-mux: Trying existing master -> debug2: fd 3 setting O_NONBLOCK, so I would assume the SIGPIPE is occuring on the mux_client_write_packet in between the muxclient connect(...) -> ENOENT Control socket \"%.100s\" does not exist and mux_client_read_packet -> read packet failed calls.

Actual Results

14:20:18  <10.52.9.192> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=10s -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx -o HashKnownHosts=no -o ServerAliveInterval=5 -o ServerAliveCountMax=1 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o 'ControlPath="/home/build/.ansible/cp/74b21c1c30"' 10.52.9.192 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
...
14:20:19  <10.52.9.192> (0, b'\n{"changed": true, "stdout": "Tue Sep 26 11:20:18 UTC 2023", "stderr": "", "rc": 0, "cmd": ["date"], "start": "2023-09-26 11:20:18.233335", "end": "2023-09-26 11:20:18.236518", "delta": "0:00:00.003183", "msg": "", "invocation": {"module_args": {"_raw_params": "date", "_uses_shell": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 52: Including file /etc/ssh/ssh_config.d/05-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf\r\ndebug2: checking match for \'final all\' host 10.52.9.192 originally 10.52.9.192\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 3: not matched \'final\'\r\ndebug2: match not found\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1 (parse only)\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-,gss-gex-sha1-,gss-group14-sha1-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1]\r\ndebug1: configuration requests final Match pass\r\ndebug2: resolve_canonicalize: hostname 10.52.9.192 is address\r\ndebug1: re-parsing configuration\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 52: Including file /etc/ssh/ssh_config.d/05-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf\r\ndebug2: checking match for \'final all\' host 10.52.9.192 originally 10.52.9.192\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 3: matched \'final\'\r\ndebug2: match found\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-,gss-gex-sha1-,gss-group14-sha1-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1]\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: mux_client_hello_exchange: read packet failed\r\nmuxclient: master hello exchange failed\r\ndebug2: ssh_connect_direct\r\ndebug1: Connecting to 10.52.9.192 [10.52.9.192] port 22.\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: fd 3 clearing O_NONBLOCK\r\ndebug1: Connection established.\r\ndebug3: timeout: 9966 ms remain after connect\r\ndebug1: identity file /home/build/.ssh/id_rsa type -1\r\ndebug1: identity file /home/build/.ssh/id_rsa-cert type -1\r\ndebug1: identity file /home/build/.ssh/id_dsa type -1\r\ndebug1: identity file /home/build/.ssh/id_dsa-cert type -1\r\ndebug1: identity file /home/build/.ssh/id_ecdsa type -1\r\ndebug1: identity file /home/build/.ssh/id_ecdsa-cert type -1\r\ndebug1: identity file /home/build/.ssh/id_ed25519 type -1\r\ndebug1: identity file /home/build/.ssh/id_ed25519-cert type -1\r\ndebug1: identity file /home/build/.ssh/id_xmss type -1\r\ndebug1: identity file /home/build/.ssh/id_xmss-cert type -1\r\ndebug1: Local version string SSH-2.0-OpenSSH_8.0\r\ndebug1: Remote protocol version 2.0, remote software version OpenSSH_7.4\r\ndebug1: match: OpenSSH_7.4 pat OpenSSH_7.0*,OpenSSH_7.1*,OpenSSH_7.2*,OpenSSH_7.3*,OpenSSH_7.4*,OpenSSH_7.5*,OpenSSH_7.6*,OpenSSH_7.7* compat 0x04000002\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: Authenticating to 10.52.9.192:22 as \'root\'\r\ndebug3: hostkeys_foreach: reading file "/srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx"\r\ndebug3: record_hostkey: found key type ECDSA in file /srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx:1\r\ndebug3: load_hostkeys: loaded 1 keys from 10.52.9.192\r\ndebug3: order_hostkeyalgs: have matching best-preference key type ecdsa-sha2-nistp256-cert-v01@openssh.com, using HostkeyAlgorithms verbatim\r\ndebug3: send packet: type 20\r\ndebug1: SSH2_MSG_KEXINIT sent\r\ndebug3: receive packet: type 20\r\ndebug1: SSH2_MSG_KEXINIT received\r\ndebug2: local client KEXINIT proposal\r\ndebug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,ext-info-c\r\ndebug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa\r\ndebug2: ciphers ctos: aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes256-cbc,aes128-gcm@openssh.com,aes128-ctr,aes128-cbc\r\ndebug2: ciphers stoc: aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes256-cbc,aes128-gcm@openssh.com,aes128-ctr,aes128-cbc\r\ndebug2: MACs ctos: hmac-sha2-256-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha2-256,hmac-sha1,umac-128@openssh.com,hmac-sha2-512\r\ndebug2: MACs stoc: hmac-sha2-256-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha2-256,hmac-sha1,umac-128@openssh.com,hmac-sha2-512\r\ndebug2: compression ctos: none,zlib@openssh.com,zlib\r\ndebug2: compression stoc: none,zlib@openssh.com,zlib\r\ndebug2: languages ctos: \r\ndebug2: languages stoc: \r\ndebug2: first_kex_follows 0 \r\ndebug2: reserved 0 \r\ndebug2: peer server KEXINIT proposal\r\ndebug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: host key algorithms: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519\r\ndebug2: ciphers ctos: aes128-ctr,aes192-ctr,aes256-ctr\r\ndebug2: ciphers stoc: aes128-ctr,aes192-ctr,aes256-ctr\r\ndebug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: compression ctos: none,zlib@openssh.com\r\ndebug2: compression stoc: none,zlib@openssh.com\r\ndebug2: languages ctos: \r\ndebug2: languages stoc: \r\ndebug2: first_kex_follows 0 \r\ndebug2: reserved 0 \r\ndebug1: kex: algorithm: curve25519-sha256\r\ndebug1: kex: host key algorithm: ecdsa-sha2-nistp256\r\ndebug1: kex: server->client cipher: aes256-ctr MAC: hmac-sha2-256-etm@openssh.com compression: none\r\ndebug1: kex: client->server cipher: aes256-ctr MAC: hmac-sha2-256-etm@openssh.com compression: none\r\ndebug1: kex: curve25519-sha256 need=32 dh_need=32\r\ndebug1: kex: curve25519-sha256 need=32 dh_need=32\r\ndebug3: send packet: type 30\r\ndebug1: expecting SSH2_MSG_KEX_ECDH_REPLY\r\ndebug3: receive packet: type 31\r\ndebug1: Server host key: ecdsa-sha2-nistp256 SHA256:L/ealGhXrXp8SxG407o0VuYoTryRpycepLtoF2HcNZk\r\ndebug3: hostkeys_foreach: reading file "/srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx"\r\ndebug3: record_hostkey: found key type ECDSA in file /srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx:1\r\ndebug3: load_hostkeys: loaded 1 keys from 10.52.9.192\r\ndebug1: Host \'10.52.9.192\' is known and matches the ECDSA host key.\r\ndebug1: Found key in /srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx:1\r\ndebug3: send packet: type 21\r\ndebug2: set_newkeys: mode 1\r\ndebug1: rekey out after 4294967296 blocks\r\ndebug1: SSH2_MSG_NEWKEYS sent\r\ndebug1: expecting SSH2_MSG_NEWKEYS\r\ndebug3: receive packet: type 21\r\ndebug1: SSH2_MSG_NEWKEYS received\r\ndebug2: set_newkeys: mode 0\r\ndebug1: rekey in after 4294967296 blocks\r\ndebug1: Will attempt key: jenkins@xxx-jenkins-dev ED25519 SHA256:mc+kKsOmURu4dJUep5ZfRHx9o7LRHZHMd7RPx2m9S2g agent\r\ndebug1: Will attempt key: /home/build/.ssh/id_rsa \r\ndebug1: Will attempt key: /home/build/.ssh/id_dsa \r\ndebug1: Will attempt key: /home/build/.ssh/id_ecdsa \r\ndebug1: Will attempt key: /home/build/.ssh/id_ed25519 \r\ndebug1: Will attempt key: /home/build/.ssh/id_xmss \r\ndebug2: pubkey_prepare: done\r\ndebug3: send packet: type 5\r\ndebug3: receive packet: type 7\r\ndebug1: SSH2_MSG_EXT_INFO received\r\ndebug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>\r\ndebug3: receive packet: type 6\r\ndebug2: service_accept: ssh-userauth\r\ndebug1: SSH2_MSG_SERVICE_ACCEPT received\r\ndebug3: send packet: type 50\r\ndebug3: receive packet: type 53\r\ndebug3: input_userauth_banner\r\nAuthorized uses only. All activity may be monitored and reported.\ndebug3: receive packet: type 51\r\ndebug1: Authentications that can continue: publickey\r\ndebug3: start over, passed a different list publickey\r\ndebug3: preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_lookup publickey\r\ndebug3: remaining preferred: ,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_is_enabled publickey\r\ndebug1: Next authentication method: publickey\r\ndebug1: Offering public key: jenkins@xxx-jenkins-dev ED25519 SHA256:mc+kKsOmURu4dJUep5ZfRHx9o7LRHZHMd7RPx2m9S2g agent\r\ndebug3: send packet: type 50\r\ndebug2: we sent a publickey packet, wait for reply\r\ndebug3: receive packet: type 60\r\ndebug1: Server accepts key: jenkins@xxx-jenkins-dev ED25519 SHA256:mc+kKsOmURu4dJUep5ZfRHx9o7LRHZHMd7RPx2m9S2g agent\r\ndebug3: sign_and_send_pubkey: ED25519 SHA256:mc+kKsOmURu4dJUep5ZfRHx9o7LRHZHMd7RPx2m9S2g\r\ndebug3: sign_and_send_pubkey: signing using ssh-ed25519\r\ndebug3: send packet: type 50\r\ndebug3: receive packet: type 52\r\ndebug1: Authentication succeeded (publickey).\r\nAuthenticated to 10.52.9.192 ([10.52.9.192]:22).\r\ndebug1: setting up multiplex master socket\r\ndebug3: muxserver_listen: temporary control path /home/build/.ansible/cp/74b21c1c30.cEtnsDBIAFIxv4xs\r\ndebug2: fd 4 setting O_NONBLOCK\r\ndebug3: fd 4 is O_NONBLOCK\r\ndebug3: fd 4 is O_NONBLOCK\r\ndebug1: channel 0: new [/home/build/.ansible/cp/74b21c1c30]\r\ndebug3: muxserver_listen: mux listener channel 0 fd 4\r\ndebug2: fd 3 setting TCP_NODELAY\r\ndebug3: ssh_packet_set_tos: set IP_TOS 0x20\r\ndebug1: control_persist_detach: backgrounding master process\r\ndebug2: control_persist_detach: background process is 1313\r\ndebug2: fd 4 setting O_NONBLOCK\r\ndebug1: forking to background\r\ndebug1: Entering interactive session.\r\ndebug1: pledge: id\r\ndebug2: set_control_persist_exit_time: schedule exit in 10 seconds\r\ndebug1: multiplexing control connection\r\ndebug2: fd 5 setting O_NONBLOCK\r\ndebug3: fd 5 is O_NONBLOCK\r\ndebug1: channel 1: new [mux-control]\r\ndebug3: channel_post_mux_listener: new mux channel 1 fd 5\r\ndebug3: mux_master_read_cb: channel 1: hello sent\r\ndebug2: set_control_persist_exit_time: cancel scheduled exit\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x00000001 len 4\r\ndebug2: mux_master_process_hello: channel 1 slave version 4\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x10000004 len 4\r\ndebug2: mux_master_process_alive_check: channel 1: alive check\r\ndebug3: mux_client_request_alive: done pid = 1315\r\ndebug3: mux_client_request_session: session request sent\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x10000002 len 115\r\ndebug2: mux_master_process_new_session: channel 1: request tty 0, X 0, agent 0, subsys 0, term "xterm", cmd "/bin/sh -c \'/usr/bin/python3 && sleep 0\'", env 2\r\ndebug3: mux_master_process_new_session: got fds stdin 6, stdout 7, stderr 8\r\ndebug2: fd 6 setting O_NONBLOCK\r\ndebug1: channel 2: new [client-session]\r\ndebug2: mux_master_process_new_session: channel_new: 2 linked to control channel 1\r\ndebug2: channel 2: send open\r\ndebug3: send packet: type 90\r\ndebug3: receive packet: type 80\r\ndebug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0\r\ndebug3: receive packet: type 91\r\ndebug2: channel_input_open_confirmation: channel 2: callback start\r\ndebug2: client_session2_setup: id 2\r\ndebug1: Sending environment.\r\ndebug1: Sending env LANG = C\r\ndebug2: channel 2: request env confirm 0\r\ndebug3: send packet: type 98\r\ndebug1: Sending env LC_CTYPE = en_US.UTF-8\r\ndebug2: channel 2: request env confirm 0\r\ndebug3: send packet: type 98\r\ndebug1: Sending command: /bin/sh -c \'/usr/bin/python3 && sleep 0\'\r\ndebug2: channel 2: request exec confirm 1\r\ndebug3: send packet: type 98\r\ndebug3: mux_session_confirm: sending success reply\r\ndebug2: channel_input_open_confirmation: channel 2: callback done\r\ndebug2: channel 2: open confirm rwindow 0 rmax 32768\r\ndebug2: channel 2: rcvd adjust 2097152\r\ndebug3: receive packet: type 99\r\ndebug2: channel_input_status_confirm: type 99 id 2\r\ndebug2: exec request accepted on channel 2\r\ndebug2: channel 2: read<=0 rfd 6 len 0\r\ndebug2: channel 2: read failed\r\ndebug2: channel 2: chan_shutdown_read (i0 o0 sock -1 wfd 6 efd 8 [write])\r\ndebug2: channel 2: input open -> drain\r\ndebug2: channel 2: ibuf empty\r\ndebug2: channel 2: send eof\r\ndebug3: send packet: type 96\r\ndebug2: channel 2: input drain -> closed\r\ndebug2: channel 2: rcvd adjust 65536\r\ndebug3: receive packet: type 98\r\ndebug1: client_input_channel_req: channel 2 rtype exit-status reply 0\r\ndebug3: mux_exit_message: channel 2: exit message, exitval 0\r\ndebug3: receive packet: type 96\r\ndebug2: channel 2: rcvd eof\r\ndebug2: channel 2: output open -> drain\r\ndebug2: channel 2: obuf empty\r\ndebug2: channel 2: chan_shutdown_write (i3 o1 sock -1 wfd 7 efd 8 [write])\r\ndebug2: channel 2: output drain -> closed\r\ndebug3: receive packet: type 97\r\ndebug2: channel 2: rcvd close\r\ndebug3: channel 2: will not send data after close\r\ndebug2: channel 2: send close\r\ndebug3: send packet: type 97\r\ndebug2: channel 2: is dead\r\ndebug2: channel 2: gc: notify user\r\ndebug3: mux_master_session_cleanup_cb: entering for channel 2\r\ndebug2: channel 1: rcvd close\r\ndebug2: channel 1: output open -> drain\r\ndebug2: channel 1: chan_shutdown_read (i0 o1 sock 5 wfd 5 efd -1 [closed])\r\ndebug2: channel 1: input open -> closed\r\ndebug2: channel 2: gc: user detached\r\ndebug2: channel 2: is dead\r\ndebug2: channel 2: garbage collecting\r\ndebug1: channel 2: free: client-session, nchannels 3\r\ndebug3: channel 2: status: The following connections are open:\r\n  #1 mux-control (t16 nr0 i3/0 o1/16 e[closed]/0 fd 5/5/-1 sock 5 cc -1)\r\n  #2 client-session (t4 r0 i3/0 o3/0 e[write]/0 fd -1/-1/8 sock -1 cc -1)\r\n\r\ndebug2: channel 1: obuf empty\r\ndebug2: channel 1: chan_shutdown_write (i3 o1 sock 5 wfd 5 efd -1 [closed])\r\ndebug2: channel 1: output drain -> closed\r\ndebug2: channel 1: is dead (local)\r\ndebug2: channel 1: gc: notify user\r\ndebug3: mux_master_control_cleanup_cb: entering for channel 1\r\ndebug2: channel 1: gc: user detached\r\ndebug2: channel 1: is dead (local)\r\ndebug2: channel 1: garbage collecting\r\ndebug1: channel 1: free: mux-control, nchannels 2\r\ndebug3: channel 1: status: The following connections are open:\r\n  #1 mux-control (t16 nr0 i3/0 o3/0 e[closed]/0 fd 5/5/-1 sock 5 cc -1)\r\n\r\ndebug2: set_control_persist_exit_time: schedule exit in 10 seconds\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
...
14:20:19  TASK [Test 2] ******************************************************************
14:20:19  task path: /srv/jenkins-agent/workspace/xxx-dev/provisioning/ansible-sigpipe.yml:13
14:20:19  changed: [xxx] => {
14:20:19 ...
14:20:19  }
...
14:20:21  PLAY RECAP *********************************************************************
14:20:21  xxx : ok=4    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
...

14:20:27  PLAYBOOK: ansible-sigpipe.yml **************************************************
...
14:20:27  Using module file /usr/local/lib/python3.9/site-packages/ansible/modules/setup.py
14:20:27  Pipelining is enabled.
14:20:27  <10.52.9.192> ESTABLISH SSH CONNECTION FOR USER: root
14:20:27  <10.52.9.192> SSH: ansible.cfg set ssh_args: (-o)(ControlMaster=auto)(-o)(ControlPersist=10s)(-o)(StrictHostKeyChecking=accept-new)(-o)(UserKnownHostsFile=/srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx)(-o)(HashKnownHosts=no)(-o)(ServerAliveInterval=5)(-o)(ServerAliveCountMax=1)
14:20:27  <10.52.9.192> SSH: ansible_password/ansible_ssh_password not set: (-o)(KbdInteractiveAuthentication=no)(-o)(PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey)(-o)(PasswordAuthentication=no)
14:20:27  <10.52.9.192> SSH: ANSIBLE_REMOTE_USER/remote_user/ansible_user/user/-u set: (-o)(User="root")
14:20:27  <10.52.9.192> SSH: ANSIBLE_TIMEOUT/timeout set: (-o)(ConnectTimeout=10)
14:20:27  <10.52.9.192> SSH: Set ssh_common_args: ()
14:20:27  <10.52.9.192> SSH: Set ssh_extra_args: ()
14:20:27  <10.52.9.192> SSH: found only ControlPersist; added ControlPath: (-o)(ControlPath="/home/build/.ansible/cp/74b21c1c30")
14:20:27  <10.52.9.192> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=10s -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/srv/jenkins-agent/workspace/xxx-dev/provisioning/ssh/known_hosts.d/xxx -o HashKnownHosts=no -o ServerAliveInterval=5 -o ServerAliveCountMax=1 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o 'ControlPath="/home/build/.ansible/cp/74b21c1c30"' 10.52.9.192 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
14:20:28  TASK [Gathering Facts] *********************************************************
14:20:28  task path: /srv/jenkins-agent/workspace/xxx-dev/provisioning/ansible-sigpipe.yml:1
14:20:28  fatal: [xxx]: FAILED! => {
14:20:28      "ansible_facts": {},
14:20:28      "changed": false,
14:20:28      "failed_modules": {
14:20:28          "ansible.legacy.setup": {
14:20:28              "failed": true,
14:20:28              "module_stderr": "OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 52: Including file /etc/ssh/ssh_config.d/05-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf\r\ndebug2: checking match for 'final all' host 10.52.9.192 originally 10.52.9.192\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 3: not matched 'final'\r\ndebug2: match not found\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1 (parse only)\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-,gss-gex-sha1-,gss-group14-sha1-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1]\r\ndebug1: configuration requests final Match pass\r\ndebug2: resolve_canonicalize: hostname 10.52.9.192 is address\r\ndebug1: re-parsing configuration\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 52: Including file /etc/ssh/ssh_config.d/05-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf\r\ndebug2: checking match for 'final all' host 10.52.9.192 originally 10.52.9.192\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 3: matched 'final'\r\ndebug2: match found\r\ndebug3: /etc/ssh/ssh_config.d/05-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-,gss-gex-sha1-,gss-group14-sha1-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1]\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\n",
14:20:28              "module_stdout": "",
14:20:28              "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
14:20:28              "rc": -13
14:20:28          }
14:20:28      }
14:20:28  }
14:20:28  
14:20:28  MSG:
14:20:28  
14:20:28  The following modules failed to execute: ansible.legacy.setup

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@SpComb
Copy link
Author

SpComb commented Sep 26, 2023

As an aside, I believe the codepath in https://github.com/ansible/ansible/blob/v2.15.3/lib/ansible/plugins/connection/ssh.py#L1186-L1193 as originally commited in #16787 as a fix for #16731 doesn't really apply anymore in any recent versions of OpenSSH.

The openssh mux_client_hello_exchange logging was changed upstream in openssh/openssh-portable@5b2f34a (OpenSSH release 7.6?). Instead of logging a fatal mux_client_hello_exchange: write packet: Broken pipe error, it should now continue with a fallback to a direct SSH connection.

Or exit with SIGPIPE, as reported on this issue, and fixed in upstream OpenSSH 9.1 -> openssh/openssh-portable@96faa0d (bz#3454).

@ansibot ansibot added bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. affects_2.15 labels Sep 26, 2023
@ansibot
Copy link
Contributor

ansibot commented Sep 26, 2023

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the component bot command.

@bcoca bcoca removed the needs_triage Needs a first human triage before being processed. label Sep 26, 2023
@bcoca
Copy link
Member

bcoca commented Sep 26, 2023

so are you suggesting something like:

diff --git a/lib/ansible/plugins/connection/ssh.py b/lib/ansible/plugins/connection/ssh.py
index 49b2ed22fc..3fb7bb1e80 100644
--- a/lib/ansible/plugins/connection/ssh.py
+++ b/lib/ansible/plugins/connection/ssh.py
@@ -1216,6 +1216,9 @@ class Connection(ConnectionBase):
             elif in_data and checkrc:
                 raise AnsibleConnectionFailure('Data could not be sent to remote host "%s". Make sure this host can be reached over ssh: %s'
                                                % (self.host, additional))
+        elif p.returncode == '-13':
+            # handle openssh/python race condition on control persist
+            raise AnsibleControlPersistBrokenPipeError('Data could not be sent because of ControlPersist broken pipe: %s' % to_native(stderr))
 
         return (p.returncode, b_stdout, b_stderr)

?
This should still trigger the retry code we have under ssh and stays compatible with previous versions that capture under rc == 255.

@SpComb
Copy link
Author

SpComb commented Sep 27, 2023

I think that would fix this case (with -13, not '-13'), but there could also be other cases that trigger the same SIGPIPE besides just mux_client_hello_exchange, and I don't think there's really any way to tell those apart 🤔 It looks like there's a limit on the number of retries for the AnsibleControlPersistBrokenPipeError handling, so I can't see how retrying on a p.returncode < 0 signal exit code could hurt.

TBH this is really an openssh bug fixed in OpenSSH 9.1, but it's going to be a while before that's in widespread deployment (RHEL 9 is on OpenSSH 8.7?), so ideally Ansible could ship some kind of workaround, particularly since the problematic -o ControlMaster=auto -o ControlPersist=60s is used in the default ssh connection plugin config.

One way to look at this is that based on my understanding of python/cpython#45993, Ansible running on versions of Python prior to 3.2 will have exec'd the ssh process with the SIGPIPE handler set to SIG_IGN. Explicitly setting that signal handler for the ssh subprocess could also be one kind of compatibility solution? I haven't tested that to verify the earlier behavior, or how that might affect ssh - I suppose with openssh/openssh-portable@96faa0d it will set the same handler itself fairly early on in main.

EDIT: With Python 3.2+ and the ssh process SIGPIPE handler set to SIG_DFL, I don't think the existing rc == 255 + mux_client_hello_exchange: write packet: Broken pipe condition will for older versions of OpenSSH - I'm guessing they will also die with SIGPIPE -> -13 rather than output that error message?

@bcoca
Copy link
Member

bcoca commented Sep 29, 2023

Considering that connection plugins run on controller and Ansible never supported running on Python 3.0-3.2, I think it is safe to ignore that part.

@markgoddard
Copy link
Contributor

There are various bugs mentioning rc=-13, but I'd like to link this one which has a nice analysis to this other one which provides a workaround of increasing ControlPersist timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.15 bug This issue/PR relates to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants