-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: improve handling of offline ranks in job prolog #5910
Conversation
Problem: flux-perilog-run.py fails if there are offline ranks, but the INFO log message is suppressed by default. Promote the message to a WARNING so it is logged. Additionally fix use of plural 'ranks' when there is only one rank offline.
Problem: When a prolog fails due to offline ranks, a generic exception like "prolog exited with exit code=1" is raised after the prolog exits, but this gives the user and admins no details about why the prolog failed. Add a nonfatal exception from the job prolog when there are offline ranks detected so that a specific reason is associated with the job eventlog.
Problem: t2274-manager-perilog.t runs a test instance with the default topology, but it will be easier to manage offline rank testing if the TBON were flat. Update test_under_flux() to run a flat TBON for this test.
Problem: The handling of offline ranks from the job-manager prolog needs to be tested, but there is no easy way to force this condition (a rank goes offline between the scheduler allocating resources and the prolog execution.) Add a jobtap plugin which can be loaded before the perilog plugin that will set rank 3 offline first thing in RUN state for a job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I just spotted one nit.
t/t2274-manager-perilog.t
Outdated
test_must_fail flux job attach $id && | ||
flux jobtap remove offline.so | ||
' | ||
test_expect_success 'perilog: offline ranks are logeed by prolog' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/logeed/logged/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that and I'll set MWP. Thanks
Problem: In t2274-manager-perilog.t, handling of offline ranks by flux-perilog-run.py is only tested after a job has run by manually invoking the command. However, this doesn't ensure that the script raises a nonfatal exception of the job instead of silently failing. Use the new offline.so jobtap plugin to force rank 3 offline instead of disconnecting the rank outside of any job. Since the offline.so plugin is loaded before the perilog.so plugin, this simulates a rank going offline between the scheduler assigning resources and the prolog. Ensure that a nonfatal exception is raised on the job.
4503eac
to
a557e58
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5910 +/- ##
==========================================
- Coverage 83.34% 83.31% -0.03%
==========================================
Files 514 514
Lines 82901 82913 +12
==========================================
- Hits 69092 69082 -10
- Misses 13809 13831 +22
|
Recently, jobs were consistently failing with an exception
job prolog failed with exit code=1
without any other details in the logs or job eventlog. It turned out that the scheduler was handing out offline nodes, and the prolog fails in this case, but due to the use of an incorrect log level, the log message for this condition was suppressed. Also, users were confused by the lack of detail in the job exception.This PR improves prolog handling of offline ranks by
flux perilog-run
can't raise the fatal exception itself, since would then cancel the prolog, which may be running on other ranks. The nonzero exit status from the prolog will raise the fatal exception, so a job eventlog failing in this manner might look like this:which is a bit better than completely silent failure.