Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker crash when sched-simple fails to unregister #5108

Closed
grondo opened this issue Apr 22, 2023 · 4 comments
Closed

broker crash when sched-simple fails to unregister #5108

grondo opened this issue Apr 22, 2023 · 4 comments

Comments

@grondo
Copy link
Contributor

grondo commented Apr 22, 2023

Not sure what happened here, but @vsoch was able to trigger a broker crash by using an older rc1 with a newer flux:

node-1  | broker.info[0]: start: none->join 1.35219ms
node-1  | broker.info[0]: parent-none: join->init 0.017271ms
node-1  | cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
node-1  | job-manager.info[0]: restart: 0 jobs
node-1  | job-manager.info[0]: restart: 0 running jobs
node-1  | job-manager.info[0]: restart: checkpoint.job-manager not found
node-1  | broker.info[0]: rc1.0: running /etc/flux/rc1.d/01-sched-fluxion
node-1  | sched-fluxion-resource.info[0]: version 0.27.0-11-g214aa271
node-1  | sched-fluxion-resource.warning[0]: create_reader: allowlist unsupported
node-1  | sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire
node-1  | sched-fluxion-qmanager.info[0]: version 0.27.0-11-g214aa271
node-1  | broker.info[0]: rc1.0: running /etc/flux/rc1.d/02-cron
node-1  | sched-simple.err[0]: service_unregister: Invalid argument
node-1  | sched-simple.err[0]: service_unregister: Invalid argument
node-1  | flux-broker: zlistx.c:234: fzlistx_first: Assertion `self' failed.
node-1  | /home/fluxuser/entrypoint.sh: line 64:    20 Aborted                 (core dumped) FLUX_FAKE_HOSTNAME=$thisHost flux start -o --config /etc/flux/config ${brokerOptions} sleep inf

The older rc1 explains why sched-simple was loaded (flux module list output changed), but will have to inspect code to see how we got to the service_unregister errors and then a crash. (Simply attempting to load sched-simple with fluxion modules loaded doesn't reproduce the issue)

@vsoch
Copy link
Member

vsoch commented Apr 22, 2023

You can count on me to come up with strange and mysterious errors! 😆

I was wondering if, given some customization that is needed (and saving an rc file separately from the repository as I have done) there might be a way to separate that logic, so I always get the latest rc shipped with flux but can still add my customization (and don’t run into an issue like this that results from saving and using an older version).

Thanks for posting this!

@grondo
Copy link
Contributor Author

grondo commented Apr 22, 2023

The way to add to the rcX files is to drop a file into /etc/flux/rcX.d. E.g. for rc1 add a new file /etc/flux/rc1.d/10-compose (or whatever you'd like to name it). rc scripts are run serially in order, so the 10 prefix causes this one to run after the fluxion rc script in the same directory.

However, I think you are just adding the noverify option to the resource module here. The best way to do that is in your config file, e.g. add:

[resource]
noverify = true

@grondo
Copy link
Contributor Author

grondo commented Apr 22, 2023

Reproduced locally and this fixes the crash.

diff --git a/src/modules/sched-simple/sched.c b/src/modules/sched-simple/sched.c
index 970d74314..752a5e200 100644
--- a/src/modules/sched-simple/sched.c
+++ b/src/modules/sched-simple/sched.c
@@ -141,10 +141,17 @@ err:
 
 static void simple_sched_destroy (flux_t *h, struct simple_sched *ss)
 {
-    struct jobreq *job = zlistx_first (ss->queue);
-    while (job) {
-        flux_respond_error (h, job->msg, ENOSYS, "simple sched exiting");
-        job = zlistx_next (ss->queue);
+    struct jobreq *job;
+
+    if (ss == NULL)
+        return;
+
+    if (ss->queue) {
+        job = zlistx_first (ss->queue);
+        while (job) {
+            flux_respond_error (h, job->msg, ENOSYS, "simple sched exiting");
+            job = zlistx_next (ss->queue);
+        }
     }
     flux_future_destroy (ss->acquire_f);
     zlistx_destroy (&ss->queue);

There's still a strange set of errors here though:

2023-04-22T13:55:57.230346Z sched-simple.err[0]: service_unregister: Invalid argument
2023-04-22T13:55:57.230869Z sched-simple.err[0]: service_unregister: Invalid argument
2023-04-22T13:55:57.230944Z sched-simple.err[0]: schedutil_create: Invalid argument
2023-04-22T13:55:57.230959Z sched-simple.crit[0]: module exiting abnormally
2023-04-22T13:55:57.232986Z broker.err[0]: rc1.0: flux-module: load sched-simple: Invalid argument
2023-04-22T13:55:57.234314Z broker.err[0]: rc1.0: /home/grondo/git/flux-core.git/etc/rc1 Exited (rc=1) 1.9s
2023-04-22T13:55:57.518716Z sched-fluxion-resource.err[0]: update_resource: exiting due to resource.acquire failure: Operation canceled
2023-04-22T13:55:57.520563Z sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled

Edit: Fixed the first diff, paste from the wrong buffer I guess

@vsoch
Copy link
Member

vsoch commented Apr 22, 2023

Testing adding noverify to the resource section now - that would remove an entire file I need to save and bind (would be great!)

Update: works!

[root@node-1 fluxuser]# flux run hostname
node-3

This hugely simplifies the setup - thanks @grondo !

garlick added a commit to garlick/flux-core that referenced this issue Apr 22, 2023
Problem: there is no test coverage for attempting to multiple
schedulers loaded at once.

This reproduces issue flux-framework#5108
garlick added a commit to garlick/flux-core that referenced this issue Apr 22, 2023
Problem: when sched-simple fails to register the "sched"
service name (via libschedutil), it triggers a zlistx assertion
failure in simple_sched_destroy().

Check ss->queue for NULL before calling zlistx_first() on it.

Fixes flux-framework#5108
garlick added a commit to garlick/flux-core that referenced this issue Apr 22, 2023
Problem: when sched-simple fails to register the "sched"
service name (via libschedutil), it triggers a zlistx assertion
failure in simple_sched_destroy().

Check ss->queue for NULL before calling zlistx_first() on it.

Fixes flux-framework#5108
garlick added a commit to garlick/flux-core that referenced this issue Apr 22, 2023
Problem: there is no test coverage for attempting to load a scheduler
when one is already loaded.

This reproduces issue flux-framework#5108
garlick added a commit to garlick/flux-core that referenced this issue Apr 22, 2023
Problem: there is no test coverage for attempting to load a scheduler
when one is already loaded.

This reproduces issue flux-framework#5108
@mergify mergify bot closed this as completed in b94ca57 Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants