Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-mini: fix --progress counters with job exceptions #3514

Merged
merged 5 commits into from
Feb 13, 2021

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 13, 2021

This is a set of fixes for flux mini submit --progress --wait which were discovered while developing the test for #3505.

Some incorrect assumptions and bugs in the counters displayed in the progress bar caused duplicate or negative counts when jobs had exceptions during intialization.

Also included is a fix for many of the tests in t2703-mini-bulksubmit.t which we've seen fail recently. (Bad assumption that all jobs would get submitted before the first job ran)

Problem: If a job gets an exception before it has started (unlikely),
then the job will still finish with status == 0 since the job releases
resources and finishes without reaping any job shells which would
adjust the job->wait_status reported in the finish event. Tools
which use the finish event status to determine if a job failed will
incorrectly assume these types of jobs have succeeded.

Check for a job exception when posting finish event back to the
job-manager, and ensure the finish status will be reported as nonzero.
Problem: The job "finish" event is always posted when there is an
"alloc" event, but the code for progress_update() in flux-mini submit
assumes that the "finish" event is instead  associated with  the
"start" event. This leads to a counting error when a job gets an
exception after alloc but before start, since the job is not yet
marked as running.

Set the internal state of a job to 'running' and increment the running
jobs count at the "alloc" event instead of the "start" event. This
fixes the counting error since every finish event can decrement the
running job counter.
Problem: When processing the "alloc" event in flux-mini submit --wait,
a test for "args.watch" was inadvertently required before updating the
job state, thereby leaving the job state as "submit", which results in
incorrect accounting in the --progress display.

Remove the test for args.watch when processing the "alloc" event in
event_watch_cb().
@grondo
Copy link
Contributor Author

grondo commented Feb 13, 2021

Also, during development of the fixes here it was discovered that there was a case where a job could get an exception, but the finish event would have status=0. This was a rare condition since the exception would have to occur before any job shells have started, but it is serious since many utilities would see the job as successful instead of failing (including, perhaps flux_job_wait() though I did not test that)

That bug is fixed here as well.

@grondo grondo changed the title flux-mini: fixes for --progress counters for exception conditions flux-mini: fix --progress counters with job exceptions Feb 13, 2021
Problem: There are no tests for flux mini submit --progress --wait
with early job exceptions, i.e. during job startup.

Add a test to t2703-mini-bulksubmit.t which ensures the accounting of
jobs displayed in --progress is as expected with job startup exceptions.
Problem: The flux-mini bulksubmit --progress tests in
t2703-mini-bulksubmit.t assume that all jobs will be pending before
any job is running or complete. This is not true since jobs could be
started and even complete before all jobs have been fully submitted
(the time at which they are marked pending).

Fix the test to ensure that a *known* initial condition for the
progress display is emitted, namely that a single job is pending
while all other counters are 0.
@codecov
Copy link

codecov bot commented Feb 13, 2021

Codecov Report

Merging #3514 (8249669) into master (8da07e1) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3514      +/-   ##
==========================================
- Coverage   82.45%   82.45%   -0.01%     
==========================================
  Files         318      318              
  Lines       48736    48739       +3     
==========================================
+ Hits        40186    40188       +2     
- Misses       8550     8551       +1     
Impacted Files Coverage Δ
src/cmd/flux-mini.py 91.20% <100.00%> (+0.01%) ⬆️
src/modules/job-exec/job-exec.c 75.82% <100.00%> (+0.10%) ⬆️
src/modules/job-archive/job-archive.c 59.03% <0.00%> (-0.81%) ⬇️
src/broker/state_machine.c 82.18% <0.00%> (-0.54%) ⬇️
src/common/libsubprocess/local.c 79.65% <0.00%> (+0.34%) ⬆️
src/modules/job-info/guest_watch.c 76.72% <0.00%> (+0.57%) ⬆️

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@grondo
Copy link
Contributor Author

grondo commented Feb 13, 2021

Thanks, setting MWP

@mergify mergify bot merged commit 44933b7 into flux-framework:master Feb 13, 2021
@grondo grondo deleted the flux-mini-progress-fixes branch February 13, 2021 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants