Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info & job-manager& flux-jobs: Support job annotations (annotations part 3) #3065

Merged
merged 12 commits into from Jul 28, 2020

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jul 22, 2020

Built on top of PR #3062, this PR completes phase 1 job annotation support in flux-core.

  • job-manager - support a new annotations event, which will publish annotations

  • job-info - read in annotations events from the above for querying by job-list services

  • job-manager - support a new user annotation service, allowing users to annotate their jobs

    • add command support in flux job annotate
    • is the user interface ok? I supported a simple flux job annotate id key value syntax, and value can be - for stdin.
    • by default, everything is put under the user object in the annotations.
  • flux-jobs - support listing any annotations

    • the annotation fields listed by the user can be arbitrary. flux-jobs has a rule that any annotation that doesn't exist results in an empty string. So this is even if the user inputs bogus annotations. The reason we have to do this is b/c there is no way to know what annotations are legal or illegal, given any scheduler or user can come up with any annotations they want.
    • do we want to support a few special case annotations such as sched and/or user? So the user doesn't have to type "annotations." for each of them?
    • the output headers ok? I chose to upper case everything after "annotations." as the header.

An example output for:

jobid=`flux mini submit sleep 1000`                                                                                                         
sleep 5                                                                                                                                     
flux job annotate $jobid foo 1234                                                                                                           
flux jobs -n --format="{id} --- {annotations}"                                                                                              
flux jobs -n --format="{id} --- {annotations.sched}"                                                                                        
flux jobs -n --format="{id} --- {annotations.user}"                                                                                         
flux jobs -n --format="{id} --- {annotations.sched.resource_summary}"                                                                       
flux jobs -n --format="{id} --- {annotations.sched.foobar}"                                                                                 
flux jobs -n --format="{id} --- {annotations.user.foo}"                                                                                     
flux jobs -n --format="{id} --- {annotations.user.a.illegal.key}"  
>src/cmd/flux start ./submit_annotate.sh
15804137472 --- {"sched": {"resource_summary": "rank0/core0"}, "user": {"foo": 1234}}
15804137472 --- {"resource_summary": "rank0/core0"}
15804137472 --- {"foo": 1234}
15804137472 --- rank0/core0
15804137472 --- 
15804137472 --- 1234
15804137472 --- 

example w/ output headers (sorta ugly for this example)

JOBID --- ANNOTATIONS
15854469120 --- {"sched": {"resource_summary": "rank0/core0"}, "user": {"foo": 1234}}
JOBID --- SCHED
15854469120 --- {"resource_summary": "rank0/core0"}
JOBID --- USER
15854469120 --- {"foo": 1234}
JOBID --- SCHED.RESOURCE_SUMMARY
15854469120 --- rank0/core0
JOBID --- SCHED.FOOBAR
15854469120 --- 
JOBID --- USER.FOO
15854469120 --- 1234
JOBID --- USER.A.ILLEGAL.KEY
15854469120 --- 

@chu11 chu11 changed the title job-info & job-manager& flux-jobs: Support job annotations (part 3) job-info & job-manager& flux-jobs: Support job annotations (annotations part 3) Jul 22, 2020
@grondo
Copy link
Contributor

grondo commented Jul 22, 2020

do we want to support a few special case annotations such as sched and/or user? So the user doesn't have to type "annotations." for each of them?

That might be a good idea. We could also have meaningful default headings for those cases.

@chu11
Copy link
Member Author

chu11 commented Jul 22, 2020

We could also have meaningful default headings for those cases.

Have anything in mind? I could see a couple of special cases, such as T_ESTIMATE, "REASON PENDING" and perhaps "RESOURCE SUMMARY", per RFC27 pre-defined fields (i would probably just drop the sched. and remove the underscore). But those are the only ones that come to mind.

@chu11
Copy link
Member Author

chu11 commented Jul 23, 2020

rebased on master & fixed up some conflicts given the unit tests added in PR #3062

@grondo
Copy link
Contributor

grondo commented Jul 23, 2020

Have anything in mind? I could see a couple of special cases, such as T_ESTIMATE, "REASON PENDING" and perhaps "RESOURCE SUMMARY", per RFC27 pre-defined fields (i would probably just drop the sched. and remove the underscore).

Yeah, T_ESTIMATE and REASON were the two I had in mind. REASON will probably end up in default output of flux jobs I'd think.

@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from 95b2338 to 8f358f9 Compare July 23, 2020 16:19
@chu11
Copy link
Member Author

chu11 commented Jul 23, 2020

  • Re-pushed with a few changes

    • added some more tests to t2205-job-manager-annotate to handle a user annotating a job with a sub-dict.

    • removed some tests in t2205-job-manager-annotate that were excessive

    • added the field names "sched" and "user" to be short hand for annotations.sched and annotations.user.

    • per discussion above, made the following RFC27 annotations have specific header outputs

+        # The following are special pre-defined cases per RFC27
+        "annotations.sched.t_estimate": "T_ESTIMATE",
+        "annotations.sched.reason_pending": "REASON",
+        "annotations.sched.resource_summary": "RESOURCES",
+        "sched": "SCHED",
+        "sched.t_estimate": "T_ESTIMATE",
+        "sched.reason_pending": "REASON",
+        "sched.resource_summary": "RESOURCES",
+        "user": "USER",
  • fixed some typos and English documentation errors I found

Went ahead and squashed since my tweaks were a bit jumbled up, I hope thats ok :P

Edit: simple output example

>flux jobs --format="{id}, {sched.resource_summary} {user.foo}"
JOBID, RESOURCES USER.FOO
123782299648, rank0/core0 42

@garlick
Copy link
Member

garlick commented Jul 23, 2020

What am I doing wrong here?

$ flux mini run hostname
sun1
$ flux jobs list -a
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
       70514638848 garlick  hostname       CD      1      1   0.079s 0
$ flux job annotate 70514638848 fubar xyz
flux-job: invalid value: invalid token near 'xyz': Success
$ flux job annotate 70514638848 user.fubar xyz
flux-job: invalid value: invalid token near 'xyz': Success
$ flux job annotate 70514638848 user.fubar 42
flux-job: flux_rpc_get: No such file or directory

Error messages are not helping me figure this out :-)

@chu11
Copy link
Member Author

chu11 commented Jul 23, 2020

The first two error messages are b/c "xyz" isn't legal json. Should be \"xyz\". But I should probably just assume its a string if its not legal json?

The latter is b/c you're trying to annotate a completed job. I should definitely put in a better error message in that case.

@garlick
Copy link
Member

garlick commented Jul 23, 2020

Thanks, sorry, first naive experience! Note that jansson functions don't set errno, so the No error is because log_err_exit() is being used to log errors. Better messages would be good.

Edit: looks like the service sends back useful string error messages, but flux jobs annotate doesn't display them. Maybe use future_strerror(f, errno) macro there?

@chu11
Copy link
Member Author

chu11 commented Jul 24, 2020

Thanks, sorry, first naive experience!

My similarly naive first implementation :-) I implemented the flux job annotate half quick to try things out, forgot to go back and make the error messages better :P

Edit: looks like the service sends back useful string error messages, but flux jobs annotate doesn't display them. Maybe use future_strerror(f, errno) macro there?

Realized it too :-)

Pushed a fixup for some better error messages.

>flux job annotate 81872814080 foo bar
flux-job: invalid json value: invalid token near 'bar'
>flux job annotate 81872814080 foo 1234
flux-job: annotate: unknown job id

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here's a first round of suggestions for a bit of cleanup.

src/modules/job-manager/event.c Outdated Show resolved Hide resolved
src/modules/job-manager/annotate.c Outdated Show resolved Hide resolved
src/modules/job-manager/annotate.c Outdated Show resolved Hide resolved
src/modules/job-manager/annotate.c Show resolved Hide resolved
src/modules/job-manager/annotate.c Show resolved Hide resolved
src/modules/job-manager/annotate.c Outdated Show resolved Hide resolved
src/modules/job-info/job_state.c Outdated Show resolved Hide resolved
src/modules/job-info/job_state.c Outdated Show resolved Hide resolved
src/modules/job-manager/event.c Outdated Show resolved Hide resolved
src/cmd/flux-job.c Show resolved Hide resolved
@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from 9bd4db0 to 044bb75 Compare July 24, 2020 22:17
@chu11
Copy link
Member Author

chu11 commented Jul 24, 2020

rebased on master & re-pushed addressing the stuff @garlick mentioend above. There's a lot of fixups and a new cleanup commit that I should re-order in the commit series. Would it be easier to review by just squashing everything?

@garlick
Copy link
Member

garlick commented Jul 24, 2020

Thanks! Squashing would probably be preferable.

@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch 2 times, most recently from c6aeb11 to d5f9f57 Compare July 25, 2020 00:15
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2020

Thanks! Squashing would probably be preferable.

Squashed and fixed a few nits along the way. Rebased while I was at it.

@garlick
Copy link
Member

garlick commented Jul 25, 2020

How about one more rebase to resolve the conflict and then I'll do another review pass?

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2020

How about one more rebase to resolve the conflict and then I'll do another review pass?

Already working on it :-) but hit #3071

@garlick garlick added this to the flux-core v0.18.0 milestone Jul 25, 2020
@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from d5f9f57 to c209ef9 Compare July 25, 2020 20:20
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2020

rebased and re-pushed, added a fix for #3074 on top. It reverts a testsuite performance enhancement that @trws did. Since its only performance related, should be fine to revert.

@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from c209ef9 to 88e7c59 Compare July 27, 2020 13:57
@chu11
Copy link
Member Author

chu11 commented Jul 27, 2020

rebased and repushed. I removed the #3074 fix since that won't affect travis. I'll put that fix in a separate PR.

Edit: I also added a Closes #2608 to one of the commits. I think with this PR merged, we can declare that issue done.

@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from 88e7c59 to 687b620 Compare July 27, 2020 14:04
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the annotations added with flux job annotate are not sticky across a restart. Is that intended? Should there be an annotate event in the job eventlog that is recovered by the job manager?

Possibly that could just be opened as an issue for now and we could circle back on that later if it's the desired behavior (I'm not super clear if it is?)

if (update_annotation_recursive (job,
job->annotations,
annotations) < 0) {
flux_log_error (h, "update_annotation_recursive");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both places annotate_update() is called log a message on error, this log message is redundant.

Comment on lines 151 to 154
if (annotations_update (ctx->h, job, annotations) < 0) {
flux_log_error (h, "%s: annotations_update", __FUNCTION__);
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe logging could be skipped here since the user is going to get a failure response?

json_t *value;

if (!json_is_array (annotations)) {
flux_log (ctx->h, LOG_ERR, "%s: annotations EPROTO", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve error message? Maybe "annotations event is not an array" and skip the FUNCTION?

flux_jobid_t id;
json_t *aValue;

if (parse_annotation (value, &id, &aValue) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "job %ju annotation parse error"

Copy link
Member Author

@chu11 chu11 Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I won't know the ID :-) (its parsed in this line), i'll adjust to "annotation parse error", to differentiate from the error above it

if (!(h = flux_open (NULL, 0)))
log_err_exit ("flux_open");

id = parse_arg_unsigned (argv[optindex], "id");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job ID's should now be parsed with parse_jobid() to support F58 and other representations.

@chu11
Copy link
Member Author

chu11 commented Jul 27, 2020

I'll open an issue on that. It seems like we might want to take a step back and re-evaluate how annotations are made persistent. (If they are now a general concept, do they belong in the alloc eventlog entry, for example?)

Sounds good. We'll have to update RFC27 as well. The key points in there:

The job manager posts an alloc event in response to the successful allocation of resources. A snapshot of job's annotation dictionary, after the above update, is included in the alloc event context per RFC 21, thus preserving it in job record when the allocation is successful.
Annotations SHALL be considered volatile until a SUCCESS response is received to the sched.alloc request, as described in Alloc Success above. Annotations SHALL be discarded by the job manager if the allocation fails.

@chu11
Copy link
Member Author

chu11 commented Jul 27, 2020

No need to hold up this PR on that though, IMHO.

Shall I squash?

@garlick
Copy link
Member

garlick commented Jul 27, 2020

sure!

@chu11 chu11 force-pushed the issue2608_cleanup_part2_part3 branch from e473fd6 to b5688d0 Compare July 27, 2020 19:08
@chu11
Copy link
Member Author

chu11 commented Jul 27, 2020

squashed and re-pushed

@grondo
Copy link
Contributor

grondo commented Jul 27, 2020

I get a reproducible failure in t2205-job-manager-annotate.t when the test is run with FLUX_TEST_VALGRIND set:

ok 26 - job-manager: cancel all jobs
not ok 27 - job-manager: no annotations (IIII)
#	
#	        jmgr_check_no_annotations $(cat job1.id) &&
#	        jmgr_check_no_annotations $(cat job2.id) &&
#	        jmgr_check_no_annotations $(cat job3.id) &&
#	        jmgr_check_no_annotations $(cat job4.id) &&
#	        jmgr_check_no_annotations $(cat job5.id)
#	
ok 28 - job-manager: no annotations in job-info (IIII)
ok 29 - job-manager: no annotations in job-info (IIII)
ok 30 - job-manager: remove sched-dummy
ok 31 - job-manager: remove job-manager, job-ingest
# failed 1 among 31 test(s)
1..31

Maybe some synchronization needed between the cancelation of jobs and the next test?

Comment on lines 198 to 201
test_expect_success 'job-manager: cancel all jobs' '
flux job cancel $(cat job1.id) &&
flux job cancel $(cat job4.id) &&
flux job cancel $(cat job3.id) &&
flux job cancel $(cat job4.id)
flux job cancel $(cat job1.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, I think flux job cancelall takes a --states argument, so this test could be shortened and made self documenting with something like (untested):

        flux job cancelall --states=SCHED -f &&
        flux job cancelall -f

@chu11
Copy link
Member Author

chu11 commented Jul 27, 2020

Maybe some synchronization needed between the cancelation of jobs and the next test?

I think you're right. I removed the "job states" tests b/c I thought them superfluous, but they are probably necessary for synchronization.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be offline for a bit so adding my approval. I'm OK if this goes in after @grondo's issues are resolved. Thanks for the long slog @chu11!

@grondo
Copy link
Contributor

grondo commented Jul 27, 2020

I'm going to push @chu11's latest work here. All I did was run git rebase -i --autosquash upstream/master.

@grondo grondo force-pushed the issue2608_cleanup_part2_part3 branch from b5688d0 to f6e7e56 Compare July 27, 2020 23:03
@garlick
Copy link
Member

garlick commented Jul 27, 2020

Setting MWP. Thanks!

chu11 added 12 commits July 28, 2020 03:32
jansson functions do not reliable set errno, so call flux_log()
instead of flux_log_error().  In addition, to reduce excessive calls
to flux_log(), place parsing of job state transitions into its own
function.
Support a new job-annotations event so other modules, such as job-info,
can be aware of changes to job annotations.
Listen to the newly created job-annotations event and make
job annotations available to be returned by job listing services.
Add flux job list annotation tests to t2203-job-manager-dummysched-single
and t2204-job-manager-dummysched-unlimited.

Add list attribute tests to t/t2230-job-info-list.
Add job-manager.annotate RPC target, allowing users to be able to
annotate their own jobs.

Move several convenience functions from alloc.c to annotate.c,
adjusting interfaces.

Update unit tests accordingly.
Support new annotate command to annotate a specific job id.
Add new tests for job-manager.annotate service.
Add support to be able to handle arbitrary annotations setup by
the scheduler / user by checking for the special prefix
"annotations." in a formatting field.

Ensure that any arbitrary missing field will return an empty string
through a special AnnotationsInfo class created based on the
annotations dictionary returned from job-info.

Closes flux-framework#2608
Add flux jobs annotation tests to

t2203-job-manager-dummysched-single
t2204-job-manager-dummysched-unlimited
t2205-job-manager-annotate

Add output header and corner case tests to t2800-job-cmds.
For a few specially defined annotations in RFC27, output special
case headers for those cases.

Update header tests in t/t2800-jobs-cmd.t accordingly.
As a convenience to users, support the "sched" field as the same
as "annotations.sched" and the "user" field as the same as
"annotations.user".

Update tests and documentation accordingly.
@SteVwonder SteVwonder force-pushed the issue2608_cleanup_part2_part3 branch from f6e7e56 to 7da96d2 Compare July 28, 2020 03:32
@codecov-commenter
Copy link

Codecov Report

Merging #3065 into master will increase coverage by 0.01%.
The diff coverage is 78.99%.

@@            Coverage Diff             @@
##           master    #3065      +/-   ##
==========================================
+ Coverage   81.18%   81.19%   +0.01%     
==========================================
  Files         285      286       +1     
  Lines       44233    44398     +165     
==========================================
+ Hits        35911    36051     +140     
- Misses       8322     8347      +25     
Impacted Files Coverage Δ
src/modules/job-info/job-info.c 78.20% <ø> (ø)
src/modules/job-info/list.c 68.31% <ø> (ø)
src/modules/job-manager/job-manager.c 54.28% <50.00%> (-0.26%) ⬇️
src/modules/job-manager/alloc.c 78.89% <65.21%> (+0.17%) ⬆️
src/cmd/flux-job.c 84.88% <74.07%> (-0.25%) ⬇️
src/modules/job-manager/event.c 77.37% <76.92%> (+0.80%) ⬆️
src/modules/job-info/job_state.c 71.13% <78.18%> (+1.09%) ⬆️
src/modules/job-manager/annotate.c 78.26% <78.26%> (ø)
src/cmd/flux-jobs.py 93.15% <100.00%> (+0.64%) ⬆️
src/modules/job-info/job_util.c 93.45% <100.00%> (+0.25%) ⬆️
... and 7 more

@mergify mergify bot merged commit 28fd322 into flux-framework:master Jul 28, 2020
@chu11 chu11 deleted the issue2608_cleanup_part2_part3 branch June 5, 2021 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants