-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker crash when sched-simple fails to unregister #5108
Comments
You can count on me to come up with strange and mysterious errors! 😆 I was wondering if, given some customization that is needed (and saving an rc file separately from the repository as I have done) there might be a way to separate that logic, so I always get the latest rc shipped with flux but can still add my customization (and don’t run into an issue like this that results from saving and using an older version). Thanks for posting this! |
The way to add to the rcX files is to drop a file into However, I think you are just adding the [resource]
noverify = true |
Reproduced locally and this fixes the crash. diff --git a/src/modules/sched-simple/sched.c b/src/modules/sched-simple/sched.c
index 970d74314..752a5e200 100644
--- a/src/modules/sched-simple/sched.c
+++ b/src/modules/sched-simple/sched.c
@@ -141,10 +141,17 @@ err:
static void simple_sched_destroy (flux_t *h, struct simple_sched *ss)
{
- struct jobreq *job = zlistx_first (ss->queue);
- while (job) {
- flux_respond_error (h, job->msg, ENOSYS, "simple sched exiting");
- job = zlistx_next (ss->queue);
+ struct jobreq *job;
+
+ if (ss == NULL)
+ return;
+
+ if (ss->queue) {
+ job = zlistx_first (ss->queue);
+ while (job) {
+ flux_respond_error (h, job->msg, ENOSYS, "simple sched exiting");
+ job = zlistx_next (ss->queue);
+ }
}
flux_future_destroy (ss->acquire_f);
zlistx_destroy (&ss->queue);
There's still a strange set of errors here though:
Edit: Fixed the first diff, paste from the wrong buffer I guess |
Testing adding Update: works!
This hugely simplifies the setup - thanks @grondo ! |
Problem: there is no test coverage for attempting to multiple schedulers loaded at once. This reproduces issue flux-framework#5108
Problem: when sched-simple fails to register the "sched" service name (via libschedutil), it triggers a zlistx assertion failure in simple_sched_destroy(). Check ss->queue for NULL before calling zlistx_first() on it. Fixes flux-framework#5108
Problem: when sched-simple fails to register the "sched" service name (via libschedutil), it triggers a zlistx assertion failure in simple_sched_destroy(). Check ss->queue for NULL before calling zlistx_first() on it. Fixes flux-framework#5108
Problem: there is no test coverage for attempting to load a scheduler when one is already loaded. This reproduces issue flux-framework#5108
Problem: there is no test coverage for attempting to load a scheduler when one is already loaded. This reproduces issue flux-framework#5108
Not sure what happened here, but @vsoch was able to trigger a broker crash by using an older
rc1
with a newer flux:The older
rc1
explains whysched-simple
was loaded (flux module list
output changed), but will have to inspect code to see how we got to theservice_unregister
errors and then a crash. (Simply attempting to loadsched-simple
with fluxion modules loaded doesn't reproduce the issue)The text was updated successfully, but these errors were encountered: