From f528358768f3ff1f5ad2b727099c085417984cfc Mon Sep 17 00:00:00 2001 From: Chris Dunlap Date: Tue, 27 Feb 2024 21:41:10 -0800 Subject: [PATCH] libmunge: Fix connect failure retry for full socket queue libmunge retries transient errors when connecting to munged. This should handle errors arising from the listening socket's queue being full. However, PR #139 uncovered a bug that did not handle EAGAIN which is returned on Linux when a nonblocking UNIX domain socket connection cannot be completed immediately. This commit fixes the while-loop in _m_msg_client_connect() that retries connect() so both EAGAIN (Linux) and ECONNREFUSED (BSD) are handled as transient errors that should be retried. This was tested by setting "--listen-backlog=1", running munged with the default 2 work threads, and running remunge with 64 threads. First, the while-loop was altered so connect() errors would not be retried. remunge could reproduce EAGAIN on Linux, and this behavior was dramatically more reproducible if vcpu > 1. remunge could reproduce ECONNREFUSED on NetBSD 9.3 with vcpu=1, and on FreeBSD 14.0 with vcpu=2. Adding back the retry logic for EAGAIN and ECONNREFUSED made this connect() failure difficult to reproduce even with "--listen-backlog=1". Tested: - AlmaLinux 9.3, 8.9 - Arch Linux - CentOS Linux Stream 9, Stream 8, 7.9.2009, 6.10 - Debian sid, 12.5, 11.9, 10.13, 9.13, 8.11, 7.11, 6.0.10, 5.0.10, 4.0 - Fedora 39, 38, 37 - FreeBSD 14.0, 13.3, 13.2 - NetBSD 9.3 - OpenBSD 7.4, 7.3 - openSUSE 15.5, 15.4 - Ubuntu 23.10, 22.04.4, 20.04.6, 18.04.6, 16.04.7, 14.04.6 --- src/libmunge/m_msg_client.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/src/libmunge/m_msg_client.c b/src/libmunge/m_msg_client.c index 164261b7..6feb07c3 100644 --- a/src/libmunge/m_msg_client.c +++ b/src/libmunge/m_msg_client.c @@ -214,10 +214,13 @@ _m_msg_client_connect (m_msg_t m, char *path) i = 1; while (1) { /* - * If a call to connect() for a Unix domain stream socket finds that - * the listening socket's queue is full, ECONNREFUSED is returned - * immediately. (cf, Stevens UNPv1, s14.4, p378) - * If ECONNREFUSED, try again up to MUNGE_SOCKET_CONNECT_ATTEMPTS. + * Retry transient errors up to MUNGE_SOCKET_CONNECT_ATTEMPTS after a + * linearly increasing backoff. + * Linux: connect() returns EAGAIN for nonblocking UNIX domain sockets + * if the connection cannot be completed immediately. + * [Linux man-pages 6.05.01] + * BSD: connect() returns ECONNREFUSED for UNIX domain stream sockets + * when the listening socket's queue is full. [Stevens UNPv1] */ n = connect (sd, (struct sockaddr *) &addr, sizeof (addr)); @@ -227,7 +230,7 @@ _m_msg_client_connect (m_msg_t m, char *path) if (errno == EINTR) { continue; } - if (errno != ECONNREFUSED) { + if ((errno != EAGAIN) && (errno != ECONNREFUSED)) { break; } if (i >= MUNGE_SOCKET_CONNECT_ATTEMPTS) {