job-manager: improve handling of offline ranks in job prolog #5910

grondo · 2024-04-22T22:40:17Z

Recently, jobs were consistently failing with an exception job prolog failed with exit code=1 without any other details in the logs or job eventlog. It turned out that the scheduler was handing out offline nodes, and the prolog fails in this case, but due to the use of an incorrect log level, the log message for this condition was suppressed. Also, users were confused by the lack of detail in the job exception.

This PR improves prolog handling of offline ranks by

using the correct log level for the warning "prolog: rank[s] XX offline. Skipping."
raising a non-fatal job exception with this same detail (that some broker ranks were offline) so that a user is notified in the job eventlog of the reason the prolog will fail.

flux perilog-run can't raise the fatal exception itself, since would then cancel the prolog, which may be running on other ranks. The nonzero exit status from the prolog will raise the fatal exception, so a job eventlog failing in this manner might look like this:

[Apr22 22:39] submit userid=6885 urgency=16 flags=0 version=1
[  +0.011808] validate
[  +0.022818] depend
[  +0.022857] priority priority=16
[  +0.023847] alloc annotations={"sched":{"resource_summary":"rank[0-3]/core0"}}
[  +0.024318] prolog-start description="job-manager.prolog"
[  +0.253649] exception type="prolog" severity=1 note="rank 3 offline" userid=6885
[  +0.384952] exception type="prolog" severity=0 note="prolog exited with exit code=1" userid=6885
[  +0.385053] prolog-finish description="job-manager.prolog" status=256
[  +0.385087] free
[  +0.385109] clean

which is a bit better than completely silent failure.

Problem: flux-perilog-run.py fails if there are offline ranks, but the INFO log message is suppressed by default. Promote the message to a WARNING so it is logged. Additionally fix use of plural 'ranks' when there is only one rank offline.

Problem: When a prolog fails due to offline ranks, a generic exception like "prolog exited with exit code=1" is raised after the prolog exits, but this gives the user and admins no details about why the prolog failed. Add a nonfatal exception from the job prolog when there are offline ranks detected so that a specific reason is associated with the job eventlog.

Problem: t2274-manager-perilog.t runs a test instance with the default topology, but it will be easier to manage offline rank testing if the TBON were flat. Update test_under_flux() to run a flat TBON for this test.

Problem: The handling of offline ranks from the job-manager prolog needs to be tested, but there is no easy way to force this condition (a rank goes offline between the scheduler allocating resources and the prolog execution.) Add a jobtap plugin which can be loaded before the perilog plugin that will set rank 3 offline first thing in RUN state for a job.

garlick

LGTM! I just spotted one nit.

garlick · 2024-04-22T23:10:04Z

t/t2274-manager-perilog.t

+	test_must_fail flux job attach $id &&
+	flux jobtap remove offline.so
+'
+test_expect_success 'perilog: offline ranks are logeed by prolog' '


s/logeed/logged/

Fixed that and I'll set MWP. Thanks

Problem: In t2274-manager-perilog.t, handling of offline ranks by flux-perilog-run.py is only tested after a job has run by manually invoking the command. However, this doesn't ensure that the script raises a nonfatal exception of the job instead of silently failing. Use the new offline.so jobtap plugin to force rank 3 offline instead of disconnecting the rank outside of any job. Since the offline.so plugin is loaded before the perilog.so plugin, this simulates a rank going offline between the scheduler assigning resources and the prolog. Ensure that a nonfatal exception is raised on the job.

codecov · 2024-04-23T16:30:13Z

Codecov Report

Merging #5910 (a557e58) into master (63953d3) will decrease coverage by 0.03%.
The diff coverage is 92.30%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5910      +/-   ##
==========================================
- Coverage   83.34%   83.31%   -0.03%     
==========================================
  Files         514      514              
  Lines       82901    82913      +12     
==========================================
- Hits        69092    69082      -10     
- Misses      13809    13831      +22

Files	Coverage Δ
src/cmd/flux-perilog-run.py	`94.02% <92.30%> (-0.16%)`	⬇️

... and 17 files with indirect coverage changes

grondo added 4 commits April 22, 2024 14:33

flux-perilog-run: log message on offline ranks

f300ed6

Problem: flux-perilog-run.py fails if there are offline ranks, but the INFO log message is suppressed by default. Promote the message to a WARNING so it is logged. Additionally fix use of plural 'ranks' when there is only one rank offline.

t2274-manager-perilog: run with flat TBON

6e2456a

Problem: t2274-manager-perilog.t runs a test instance with the default topology, but it will be easier to manage offline rank testing if the TBON were flat. Update test_under_flux() to run a flat TBON for this test.

garlick approved these changes Apr 22, 2024

View reviewed changes

grondo force-pushed the prolog-and-offline branch from 4503eac to a557e58 Compare April 22, 2024 23:34

grondo added the merge-when-passing label Apr 23, 2024

mergify bot merged commit d8b86fd into flux-framework:master Apr 23, 2024
35 checks passed

grondo deleted the prolog-and-offline branch April 23, 2024 16:33

grondo mentioned this pull request Apr 23, 2024

flux-perilog-run exits silently with failure when one or more ranks are not online #5904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-manager: improve handling of offline ranks in job prolog #5910

job-manager: improve handling of offline ranks in job prolog #5910

grondo commented Apr 22, 2024

garlick left a comment

garlick Apr 22, 2024

grondo Apr 23, 2024

codecov bot commented Apr 23, 2024

job-manager: improve handling of offline ranks in job prolog #5910

job-manager: improve handling of offline ranks in job prolog #5910

Conversation

grondo commented Apr 22, 2024

garlick left a comment

Choose a reason for hiding this comment

garlick Apr 22, 2024

Choose a reason for hiding this comment

grondo Apr 23, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 23, 2024

Codecov Report