-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker stuck at exit in zmq_ctx_term()
#5484
Comments
The zmq_ctx_term documentation states:
So perhaps there is an outstanding socket still open or a message stuck? |
If I "comment out" the last test in #!/bin/sh
test_description='Test dws-jobtap plugin with fake coral2_dws.py'
. $(dirname $0)/sharness.sh
test_under_flux 2 job
flux setattr log-stderr-level 1
PLUGINPATH=${FLUX_BUILD_DIR}/src/job-manager/plugins/.libs
DWS_SCRIPT=${SHARNESS_TEST_SRCDIR}/dws-dependencies/coral2_dws.py
DEPENDENCY_NAME="dws-create"
PROLOG_NAME="dws-setup"
EPILOG_NAME="dws-epilog"
test_expect_success 'job-manager: load dws-jobtap plugin' '
flux jobtap load ${PLUGINPATH}/dws-jobtap.so
'
test_expect_success 'job-manager: dws jobtap plugin works when job hits exception during prolog' '
create_jobid=$(flux mini submit -t 8 --output=dws4.out --error=dws4.out \
flux python ${DWS_SCRIPT} --setup-hang) &&
flux job wait-event -vt 15 -p guest.exec.eventlog ${create_jobid} shell.start &&
jobid=$(flux mini submit --setattr=system.dw="foo" hostname) &&
flux job wait-event -vt 5 -m description=${PROLOG_NAME} \
${jobid} prolog-start &&
flux job cancel $jobid
flux job wait-event -vt 1 ${jobid} exception &&
flux job wait-event -vt 5 -m description=${PROLOG_NAME} -m status=1 \
${jobid} prolog-finish &&
flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
${jobid} epilog-start &&
flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
${jobid} epilog-finish &&
flux job wait-event -vt 5 ${jobid} clean &&
flux job wait-event -vt 5 ${create_jobid} clean
'
test_done Note that this test seems to be simulating a "hang" in the dws server if I'm reading it right. That is, the jobtap plugin sends an RPC to the dws service, but in this test the dws service doesn't respond. I wonder if that has something to do with the later hang in |
The cause here seems to be a leaked future in the dws-jobtap plugin. When the service fails to respond, the callback wherein the future is destroyed is not invoked, so a future is leaked. If the future is set to be destroyed along with the job, then this hang goes away: diff --git a/src/job-manager/plugins/dws-jobtap.c b/src/job-manager/plugins/dws-jobtap.c
index fe186a8..2837a52 100644
--- a/src/job-manager/plugins/dws-jobtap.c
+++ b/src/job-manager/plugins/dws-jobtap.c
@@ -290,13 +290,13 @@ static void setup_rpc_cb (flux_future_t *f, void *arg)
prolog_active);
}
}
-
done:
- flux_future_destroy (f);
+ return;
}
static void fetch_R_callback (flux_future_t *f, void *arg)
{
+ flux_plugin_t *p = arg;
json_t *R;
flux_t *h = flux_future_get_flux (f);
struct create_arg_t *args = flux_future_aux_get (f, "flux::fetch_R");
@@ -355,7 +355,13 @@ static void fetch_R_callback (flux_future_t *f, void *arg)
0,
"Failed to send dws.setup RPC",
prolog_active);
+ goto done;
}
+ flux_jobtap_job_aux_set (p,
+ args->id,
+ NULL,
+ setup_rpc_fut,
+ (flux_free_f) flux_future_destroy);
done:
flux_future_destroy (f);
@@ -412,7 +418,7 @@ static int run_cb (flux_plugin_t *p,
prolog_active,
NULL)
< 0
- || flux_future_then (fetch_R_future, -1., fetch_R_callback, NULL) < 0) {
+ || flux_future_then (fetch_R_future, -1., fetch_R_callback, p) < 0) {
flux_future_destroy (fetch_R_future);
flux_log_error (h,
"dws-jobtap: "
I'll open an issue in flux-coral2 on the leaks. However, this hang behavior does seem to be new... |
A leaked future causes a hang? I never would have been able to figure that out... thanks for digging into this @grondo ! |
Great job running this down. Hmm, I belive the future takes a reference on the flux_t handle which for a broker module contains one end of a zeromq socket for the shmem:// connection. So when the module unloads, the zeromq socket might not be getting destroyed due to the future's reference. The broker just checks if all the modules were cleanly shut down on the presumption that if they went through their teardown, the flux_t refcounts would be zero. I'll open an issue in flux-core. Edit: oops! This is flux core and this is the issue. Got it! |
Well I guess a test failed and that uncovered a bug in the module, so that's a good outcome compared to before. But ideally we would log something, avoid the hang, and exit with a nonzero exit code. Maybe we could manage:
|
Just happened to notice the commit that added the reference on the handle: 01b383b Thought I'd x-ref it here in case someone (like me) is tempted to muck around with this refcounting. |
@jameshcorbett reported that a flux-coral2 CI test is failing with most recent flux-core+flux-sched with the following log:
Note all tests pass, but the broker is force-terminated with SIGKILL.
We ran the individual test in a docker container and attached with
gdb
. This is where the broker is stuck:Not sure what in this test triggering this condition, but wondering if anyone has any hints on where to look. This didn't seem to be a problem until recently.
The text was updated successfully, but these errors were encountered: