broker stuck at exit in `zmq_ctx_term()` #5484

grondo · 2023-10-03T23:37:27Z

@jameshcorbett reported that a flux-coral2 CI test is failing with most recent flux-core+flux-sched with the following log:

  ok 1 - job-manager: load dws-jobtap plugin
  PASS: t1000-dws-dependencies.t 1 - job-manager: load dws-jobtap plugin
  ok 2 - job-manager: dws jobtap plugin works when coral2_dws is absent
  PASS: t1000-dws-dependencies.t 2 - job-manager: dws jobtap plugin works when coral2_dws is absent
  ok 3 - job-manager: dws jobtap plugin works when RPCs succeed
  PASS: t1000-dws-dependencies.t 3 - job-manager: dws jobtap plugin works when RPCs succeed
  ok 4 - job-manager: dws jobtap plugin works when creation RPC fails
  PASS: t1000-dws-dependencies.t 4 - job-manager: dws jobtap plugin works when creation RPC fails
  ok 5 - job-manager: dws jobtap plugin works when setup RPC fails
  PASS: t1000-dws-dependencies.t 5 - job-manager: dws jobtap plugin works when setup RPC fails
  ok 6 - job-manager: dws jobtap plugin works when post_run RPC fails
  PASS: t1000-dws-dependencies.t 6 - job-manager: dws jobtap plugin works when post_run RPC fails
  ok 7 - job-manager: dws jobtap plugin works when job hits exception during prolog
  PASS: t1000-dws-dependencies.t 7 - job-manager: dws jobtap plugin works when job hits exception during prolog
  # passed all 7 test(s)
  flux-start: 0 (pid 8494) Killed
  1..7

Note all tests pass, but the broker is force-terminated with SIGKILL.

We ran the individual test in a docker container and attached with gdb. This is where the broker is stuck:

(gdb) where
#0  0x00007f9ef95390e1 in poll () from /lib64/libc.so.6
#1  0x00007f9efa92861e in zmq::signaler_t::wait(int) const ()
   from /lib64/libzmq.so.5
#2  0x00007f9efa904cc4 in zmq::mailbox_t::recv(zmq::command_t*, int) ()
   from /lib64/libzmq.so.5
#3  0x00007f9efa8f09c4 in zmq::ctx_t::terminate() () from /lib64/libzmq.so.5
#4  0x00007f9efa94982e in zmq_ctx_term () from /lib64/libzmq.so.5
#5  0x000000000040caa1 in main (argc=<optimized out>, argv=<optimized out>)
    at broker.c:548

Not sure what in this test triggering this condition, but wondering if anyone has any hints on where to look. This didn't seem to be a problem until recently.

The text was updated successfully, but these errors were encountered:

grondo · 2023-10-03T23:48:25Z

The zmq_ctx_term documentation states:

After interrupting all blocking calls, zmq_ctx_term() shall block until the following conditions are satisfied:

All sockets open within context have been closed with zmq_close().

For each socket within context, all messages sent by the application with zmq_send() have either been physically transferred to a network peer, or the socket's linger period set with the ZMQ_LINGER socket option has expired.

So perhaps there is an outstanding socket still open or a message stuck?

grondo · 2023-10-04T00:17:11Z

If I "comment out" the last test in t1000-dws-dependencies.t, then the hang does not occur. In fact, the following test alone reproduces the issue by itself:

#!/bin/sh

test_description='Test dws-jobtap plugin with fake coral2_dws.py'

. $(dirname $0)/sharness.sh

test_under_flux 2 job

flux setattr log-stderr-level 1

PLUGINPATH=${FLUX_BUILD_DIR}/src/job-manager/plugins/.libs
DWS_SCRIPT=${SHARNESS_TEST_SRCDIR}/dws-dependencies/coral2_dws.py
DEPENDENCY_NAME="dws-create"
PROLOG_NAME="dws-setup"
EPILOG_NAME="dws-epilog"

test_expect_success 'job-manager: load dws-jobtap plugin' '
	flux jobtap load ${PLUGINPATH}/dws-jobtap.so
'

test_expect_success 'job-manager: dws jobtap plugin works when job hits exception during prolog' '
	create_jobid=$(flux mini submit -t 8 --output=dws4.out --error=dws4.out \
		flux python ${DWS_SCRIPT} --setup-hang) &&
	flux job wait-event -vt 15 -p guest.exec.eventlog ${create_jobid} shell.start &&
	jobid=$(flux mini submit --setattr=system.dw="foo" hostname) &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} \
		${jobid} prolog-start &&
	flux job cancel $jobid
	flux job wait-event -vt 1 ${jobid} exception &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} -m status=1 \
		${jobid} prolog-finish &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-start &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-finish &&
	flux job wait-event -vt 5 ${jobid} clean &&
	flux job wait-event -vt 5 ${create_jobid} clean
'
test_done

Note that this test seems to be simulating a "hang" in the dws server if I'm reading it right. That is, the jobtap plugin sends an RPC to the dws service, but in this test the dws service doesn't respond. I wonder if that has something to do with the later hang in zmq_ctx_term().

grondo · 2023-10-04T01:12:18Z

The cause here seems to be a leaked future in the dws-jobtap plugin. When the service fails to respond, the callback wherein the future is destroyed is not invoked, so a future is leaked.

If the future is set to be destroyed along with the job, then this hang goes away:

diff --git a/src/job-manager/plugins/dws-jobtap.c b/src/job-manager/plugins/dws-jobtap.c
index fe186a8..2837a52 100644
--- a/src/job-manager/plugins/dws-jobtap.c
+++ b/src/job-manager/plugins/dws-jobtap.c
@@ -290,13 +290,13 @@ static void setup_rpc_cb (flux_future_t *f, void *arg)
                                prolog_active);
         }
     }
-
 done:
-    flux_future_destroy (f);
+    return;
 }
 
 static void fetch_R_callback (flux_future_t *f, void *arg)
 {
+    flux_plugin_t *p = arg;
     json_t *R;
     flux_t *h = flux_future_get_flux (f);
     struct create_arg_t *args = flux_future_aux_get (f, "flux::fetch_R");
@@ -355,7 +355,13 @@ static void fetch_R_callback (flux_future_t *f, void *arg)
                            0,
                            "Failed to send dws.setup RPC",
                            prolog_active);
+        goto done;
     }
+    flux_jobtap_job_aux_set (p,
+                             args->id,
+                             NULL,
+                             setup_rpc_fut,
+                             (flux_free_f) flux_future_destroy);
 
 done:
     flux_future_destroy (f);
@@ -412,7 +418,7 @@ static int run_cb (flux_plugin_t *p,
                                     prolog_active,
                                     NULL)
                    < 0
-            || flux_future_then (fetch_R_future, -1., fetch_R_callback, NULL) < 0) {
+            || flux_future_then (fetch_R_future, -1., fetch_R_callback, p) < 0) {
             flux_future_destroy (fetch_R_future);
             flux_log_error (h,
                             "dws-jobtap: "

I'll open an issue in flux-coral2 on the leaks. However, this hang behavior does seem to be new...

jameshcorbett · 2023-10-04T01:47:21Z

A leaked future causes a hang? I never would have been able to figure that out... thanks for digging into this @grondo !

garlick · 2023-10-04T01:47:44Z

Great job running this down. Hmm, I belive the future takes a reference on the flux_t handle which for a broker module contains one end of a zeromq socket for the shmem:// connection. So when the module unloads, the zeromq socket might not be getting destroyed due to the future's reference. The broker just checks if all the modules were cleanly shut down on the presumption that if they went through their teardown, the flux_t refcounts would be zero.

I'll open an issue in flux-core.

Edit: oops! This is flux core and this is the issue. Got it!

garlick · 2023-10-04T01:54:48Z

Well I guess a test failed and that uncovered a bug in the module, so that's a good outcome compared to before. But ideally we would log something, avoid the hang, and exit with a nonzero exit code. Maybe we could manage:

job-manager: module leaked a socket reference - was a future leaked?
broker: skipping 0MQ shutdown due to presumed module socket leak

garlick · 2023-10-24T14:41:16Z

Just happened to notice the commit that added the reference on the handle: 01b383b

Thought I'd x-ref it here in case someone (like me) is tempted to muck around with this refcounting.

grondo mentioned this issue Oct 4, 2023

dws-jobtap: RPC futures could be leaked if plugin unloaded before responses flux-framework/flux-coral2#101

Closed

garlick mentioned this issue Oct 5, 2023

make the loop:// connector built-in rather than a DSO #5486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker stuck at exit in `zmq_ctx_term()` #5484

broker stuck at exit in `zmq_ctx_term()` #5484

grondo commented Oct 3, 2023

grondo commented Oct 3, 2023

grondo commented Oct 4, 2023

grondo commented Oct 4, 2023

jameshcorbett commented Oct 4, 2023

garlick commented Oct 4, 2023 •

edited

Loading

garlick commented Oct 4, 2023

garlick commented Oct 24, 2023

broker stuck at exit in zmq_ctx_term() #5484

broker stuck at exit in zmq_ctx_term() #5484

Comments

grondo commented Oct 3, 2023

grondo commented Oct 3, 2023

grondo commented Oct 4, 2023

grondo commented Oct 4, 2023

jameshcorbett commented Oct 4, 2023

garlick commented Oct 4, 2023 • edited Loading

garlick commented Oct 4, 2023

garlick commented Oct 24, 2023

broker stuck at exit in `zmq_ctx_term()` #5484

broker stuck at exit in `zmq_ctx_term()` #5484

garlick commented Oct 4, 2023 •

edited

Loading