Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flux-job status for jobs with exceptions before start #2784

Merged
merged 2 commits into from Feb 29, 2020

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 28, 2020

Fix flux job status for jobs that get an exception before the start event (e.g. unsatisfiable request, or canceled). flux job status only waits for the finish event, so currently reports any job without a finish event as exit code 0 (success). (As seen in flux-framework/flux-sched#602)

@chu11
Copy link
Member

chu11 commented Feb 28, 2020

LGTM, need to add "Fixes #2782" to commit message though.

stat->status = 256;
stat->exit_code = 1;
stat->exception = true;
strncpy (stat->ex_type, type, sizeof(stat->ex_type) - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a request for a change, but a question for my own understanding and edification)

If the length of type is > sizeof(stat->ex_type) - 1 and stat wasn't zero'd when alloc'd, then stat->ex_type will not be NULL-terminated, right? Tracing back to stat's allocation, it looks like it was callocd, so I don't think it is ultimately an issue here. For my own understanding though, is best-practice to NULL-terminate the string "just to be safe"? Or is it to fine to rely on structs and char[]s to have been calloc'd so that you don't have to worry about this edge-case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's often fine to rely on zero-initialized buffers when it's obvious that's the case.

@grondo
Copy link
Contributor Author

grondo commented Feb 28, 2020 via email

@dongahn
Copy link
Member

dongahn commented Feb 28, 2020

This looks good to me. I initially thought it may be better to return a higher exit code than 1 so that this kind of error can bubble up better in a tool like flux-tree. But then, that might just be too arbitrary. I'm good as is.

@grondo
Copy link
Contributor Author

grondo commented Feb 28, 2020 via email

@dongahn
Copy link
Member

dongahn commented Feb 28, 2020

Maybe an option to set the exit code for exceptions?

If easy enough... yeah. I can definitely make use of this within flux-tree. This is an internal error that is more serious than the exit codes from user's script or jobs. So, I would give 255 using such an option.

@grondo grondo force-pushed the issue#2782 branch 2 times, most recently from 9985b62 to 7f19833 Compare February 29, 2020 06:16
@grondo
Copy link
Contributor Author

grondo commented Feb 29, 2020

Ok, I've added a fixup commit that adds -e, --exception-exit-code=N to choose the default exit code set for job exceptions. If its okay, I'll later squash everything together.

@dongahn
Copy link
Member

dongahn commented Feb 29, 2020

LGTM!

@codecov-io
Copy link

Codecov Report

Merging #2784 into master will increase coverage by <.01%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #2784      +/-   ##
==========================================
+ Coverage   81.05%   81.05%   +<.01%     
==========================================
  Files         250      250              
  Lines       39411    39428      +17     
==========================================
+ Hits        31944    31958      +14     
- Misses       7467     7470       +3
Impacted Files Coverage Δ
src/cmd/flux-job.c 86.3% <94.73%> (+0.12%) ⬆️
src/modules/job-info/guest_watch.c 76.14% <0%> (-0.58%) ⬇️
src/common/libsubprocess/local.c 80.06% <0%> (-0.35%) ⬇️
src/broker/broker.c 73.4% <0%> (+0.09%) ⬆️

Problem: flux-job status works by waiting for the job finish event
and grabbing the status key from its context. However, jobs that
get an exception before start will not have a finish event. In these
cases `flux job status` erroneously reports an exit status of 0.

Additionally process job exceptions in flux-job status, and set the
exit status to 1 or the value of a new option `--exception-exit-code`
in these cases. Capture the exception type to report with
`flux job status -v` if used.

Fixes flux-framework#2782
Enhance t2501-job-status.t with jobs that get exceptions.
@grondo
Copy link
Contributor Author

grondo commented Feb 29, 2020

Ok, squashed the incremental development and force-pushed.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@grondo
Copy link
Contributor Author

grondo commented Feb 29, 2020

Had to restart asan builder. 🙁

@mergify mergify bot merged commit 7bac877 into flux-framework:master Feb 29, 2020
@grondo grondo deleted the issue#2782 branch February 29, 2020 19:30
@grondo
Copy link
Contributor Author

grondo commented Feb 29, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants