Skip to content

Commit

Permalink
libmunge: Fix connect failure retry for full socket queue
Browse files Browse the repository at this point in the history
libmunge retries transient errors when connecting to munged.
This should handle errors arising from the listening socket's queue
being full.  However, PR #139 uncovered a bug that did not handle
EAGAIN which is returned on Linux when a nonblocking UNIX domain
socket connection cannot be completed immediately.

This commit fixes the while-loop in _m_msg_client_connect() that
retries connect() so both EAGAIN (Linux) and ECONNREFUSED (BSD)
are handled as transient errors that should be retried.

This was tested by setting "--listen-backlog=1", running munged
with the default 2 work threads, and running remunge with 64
threads.  First, the while-loop was altered so connect() errors
would not be retried.  remunge could reproduce EAGAIN on Linux,
and this behavior was dramatically more reproducible if vcpu > 1.
remunge could reproduce ECONNREFUSED on NetBSD 9.3 with vcpu=1, and
on FreeBSD 14.0 with vcpu=2.  Adding back the retry logic for EAGAIN
and ECONNREFUSED made this connect() failure difficult to reproduce
even with "--listen-backlog=1".

Tested:
- AlmaLinux 9.3, 8.9
- Arch Linux
- CentOS Linux Stream 9, Stream 8, 7.9.2009, 6.10
- Debian sid, 12.5, 11.9, 10.13, 9.13, 8.11, 7.11, 6.0.10, 5.0.10, 4.0
- Fedora 39, 38, 37
- FreeBSD 14.0, 13.3, 13.2
- NetBSD 9.3
- OpenBSD 7.4, 7.3
- openSUSE 15.5, 15.4
- Ubuntu 23.10, 22.04.4, 20.04.6, 18.04.6, 16.04.7, 14.04.6
  • Loading branch information
dun committed Mar 11, 2024
1 parent 36e17dc commit f528358
Showing 1 changed file with 8 additions and 5 deletions.
13 changes: 8 additions & 5 deletions src/libmunge/m_msg_client.c
Expand Up @@ -214,10 +214,13 @@ _m_msg_client_connect (m_msg_t m, char *path)
i = 1;
while (1) {
/*
* If a call to connect() for a Unix domain stream socket finds that
* the listening socket's queue is full, ECONNREFUSED is returned
* immediately. (cf, Stevens UNPv1, s14.4, p378)
* If ECONNREFUSED, try again up to MUNGE_SOCKET_CONNECT_ATTEMPTS.
* Retry transient errors up to MUNGE_SOCKET_CONNECT_ATTEMPTS after a
* linearly increasing backoff.
* Linux: connect() returns EAGAIN for nonblocking UNIX domain sockets
* if the connection cannot be completed immediately.
* [Linux man-pages 6.05.01]
* BSD: connect() returns ECONNREFUSED for UNIX domain stream sockets
* when the listening socket's queue is full. [Stevens UNPv1]
*/
n = connect (sd, (struct sockaddr *) &addr, sizeof (addr));

Expand All @@ -227,7 +230,7 @@ _m_msg_client_connect (m_msg_t m, char *path)
if (errno == EINTR) {
continue;
}
if (errno != ECONNREFUSED) {
if ((errno != EAGAIN) && (errno != ECONNREFUSED)) {
break;
}
if (i >= MUNGE_SOCKET_CONNECT_ATTEMPTS) {
Expand Down

0 comments on commit f528358

Please sign in to comment.