Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: improve handling of offline ranks in job prolog #5910

Merged
merged 5 commits into from
Apr 23, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Apr 22, 2024

Recently, jobs were consistently failing with an exception job prolog failed with exit code=1 without any other details in the logs or job eventlog. It turned out that the scheduler was handing out offline nodes, and the prolog fails in this case, but due to the use of an incorrect log level, the log message for this condition was suppressed. Also, users were confused by the lack of detail in the job exception.

This PR improves prolog handling of offline ranks by

  • using the correct log level for the warning "prolog: rank[s] XX offline. Skipping."
  • raising a non-fatal job exception with this same detail (that some broker ranks were offline) so that a user is notified in the job eventlog of the reason the prolog will fail.

flux perilog-run can't raise the fatal exception itself, since would then cancel the prolog, which may be running on other ranks. The nonzero exit status from the prolog will raise the fatal exception, so a job eventlog failing in this manner might look like this:

[Apr22 22:39] submit userid=6885 urgency=16 flags=0 version=1
[  +0.011808] validate
[  +0.022818] depend
[  +0.022857] priority priority=16
[  +0.023847] alloc annotations={"sched":{"resource_summary":"rank[0-3]/core0"}}
[  +0.024318] prolog-start description="job-manager.prolog"
[  +0.253649] exception type="prolog" severity=1 note="rank 3 offline" userid=6885
[  +0.384952] exception type="prolog" severity=0 note="prolog exited with exit code=1" userid=6885
[  +0.385053] prolog-finish description="job-manager.prolog" status=256
[  +0.385087] free
[  +0.385109] clean

which is a bit better than completely silent failure.

Problem: flux-perilog-run.py fails if there are offline ranks, but
the INFO log message is suppressed by default.

Promote the message to a WARNING so it is logged.

Additionally fix use of plural 'ranks' when there is only one rank
offline.
Problem: When a prolog fails due to offline ranks, a generic exception
like "prolog exited with exit code=1" is raised after the prolog
exits, but this gives the user and admins no details about why the
prolog failed.

Add a nonfatal exception from the job prolog when there are offline
ranks detected so that a specific reason is associated with the
job eventlog.
Problem: t2274-manager-perilog.t runs a test instance with the default
topology, but it will be easier to manage offline rank testing if
the TBON were flat.

Update test_under_flux() to run a flat TBON for this test.
Problem: The handling of offline ranks from the job-manager prolog
needs to be tested, but there is no easy way to force this condition
(a rank goes offline between the scheduler allocating resources and
the prolog execution.)

Add a jobtap plugin which can be loaded before the perilog plugin
that will set rank 3 offline first thing in RUN state for a job.
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just spotted one nit.

test_must_fail flux job attach $id &&
flux jobtap remove offline.so
'
test_expect_success 'perilog: offline ranks are logeed by prolog' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/logeed/logged/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that and I'll set MWP. Thanks

Problem: In t2274-manager-perilog.t, handling of offline ranks by
flux-perilog-run.py is only tested after a job has run by manually
invoking the command. However, this doesn't ensure that the script
raises a nonfatal exception of the job instead of silently failing.

Use the new offline.so jobtap plugin to force rank 3 offline instead
of disconnecting the rank outside of any job. Since the offline.so
plugin is loaded before the perilog.so plugin, this simulates a rank
going offline between the scheduler assigning resources and the prolog.

Ensure that a nonfatal exception is raised on the job.
Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Merging #5910 (a557e58) into master (63953d3) will decrease coverage by 0.03%.
The diff coverage is 92.30%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5910      +/-   ##
==========================================
- Coverage   83.34%   83.31%   -0.03%     
==========================================
  Files         514      514              
  Lines       82901    82913      +12     
==========================================
- Hits        69092    69082      -10     
- Misses      13809    13831      +22     
Files Coverage Δ
src/cmd/flux-perilog-run.py 94.02% <92.30%> (-0.16%) ⬇️

... and 17 files with indirect coverage changes

@mergify mergify bot merged commit d8b86fd into flux-framework:master Apr 23, 2024
35 checks passed
@grondo grondo deleted the prolog-and-offline branch April 23, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants