-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-mini: fix --progress counters with job exceptions #3514
flux-mini: fix --progress counters with job exceptions #3514
Conversation
Problem: If a job gets an exception before it has started (unlikely), then the job will still finish with status == 0 since the job releases resources and finishes without reaping any job shells which would adjust the job->wait_status reported in the finish event. Tools which use the finish event status to determine if a job failed will incorrectly assume these types of jobs have succeeded. Check for a job exception when posting finish event back to the job-manager, and ensure the finish status will be reported as nonzero.
Problem: The job "finish" event is always posted when there is an "alloc" event, but the code for progress_update() in flux-mini submit assumes that the "finish" event is instead associated with the "start" event. This leads to a counting error when a job gets an exception after alloc but before start, since the job is not yet marked as running. Set the internal state of a job to 'running' and increment the running jobs count at the "alloc" event instead of the "start" event. This fixes the counting error since every finish event can decrement the running job counter.
Problem: When processing the "alloc" event in flux-mini submit --wait, a test for "args.watch" was inadvertently required before updating the job state, thereby leaving the job state as "submit", which results in incorrect accounting in the --progress display. Remove the test for args.watch when processing the "alloc" event in event_watch_cb().
Also, during development of the fixes here it was discovered that there was a case where a job could get an exception, but the That bug is fixed here as well. |
Problem: There are no tests for flux mini submit --progress --wait with early job exceptions, i.e. during job startup. Add a test to t2703-mini-bulksubmit.t which ensures the accounting of jobs displayed in --progress is as expected with job startup exceptions.
Problem: The flux-mini bulksubmit --progress tests in t2703-mini-bulksubmit.t assume that all jobs will be pending before any job is running or complete. This is not true since jobs could be started and even complete before all jobs have been fully submitted (the time at which they are marked pending). Fix the test to ensure that a *known* initial condition for the progress display is emitted, namely that a single job is pending while all other counters are 0.
5b04f24
to
c347f33
Compare
Codecov Report
@@ Coverage Diff @@
## master #3514 +/- ##
==========================================
- Coverage 82.45% 82.45% -0.01%
==========================================
Files 318 318
Lines 48736 48739 +3
==========================================
+ Hits 40186 40188 +2
- Misses 8550 8551 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks, setting MWP |
This is a set of fixes for
flux mini submit --progress --wait
which were discovered while developing the test for #3505.Some incorrect assumptions and bugs in the counters displayed in the progress bar caused duplicate or negative counts when jobs had exceptions during intialization.
Also included is a fix for many of the tests in
t2703-mini-bulksubmit.t
which we've seen fail recently. (Bad assumption that all jobs would get submitted before the first job ran)