Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules/job-info: New module #2071

Merged
merged 7 commits into from Mar 14, 2019

Conversation

@chu11
Copy link
Contributor

commented Mar 11, 2019

After several WIPs (#2066, #2065), I think this PR is a candidate for merger.

At this moment, the module only supports getting/watching eventlog entries by jobid. Included are two new job subcommands, flux job eventlog and flux job wait-event. They do about what you think they would do.

Adjusted tests that previously used flux kvs eventlog get to use new subcommands, removing the need to do flux job id commands to get paths. flux job wait-event conveniently removes a big function in t2300-sched-simple.t that emulated the same exact thing (Edit: after rebase, also in t2202-job-manager.t`).

I had begun to add a proxy to ensure guest access to the eventlog was possible, but when I tried to add tests, it wasn't clear to me how one submits a job as a guest user, or if I have to fake it in some way right now. In addition, a number of the functions in job-manager/util.[ch] should probably be exported for convenience first. So I decided perhaps that could be in a follow up PR.

@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from e77dbef to fad6af9 Mar 11, 2019
Copy link
Member

left a comment

Nicely done.

A few minor suggestion and one (less minor) question about API (with caveat that I'm just not sure what the APIs should look like here, so posting mainly to open a dialog).

#include <jansson.h>
#include <flux/core.h>

#include "src/common/libjob/job.h"

This comment has been minimized.

Copy link
@garlick

garlick Mar 11, 2019

Member

Included by <flux/core.h>, so redundant.

int saved_errno = errno;
flux_msg_handler_delvec (ctx->handlers);
if (ctx->lookups)
zlist_destroy (&ctx->lookups);

This comment has been minimized.

Copy link
@garlick

garlick Mar 11, 2019

Member

Any in progress lookups should probably be canceled here (and removed from list and destroyed).

This comment has been minimized.

Copy link
@chu11

chu11 Mar 12, 2019

Author Contributor

I've set a free function on the zlist entries so they are freed via that mechanism. But you're right, should cancel any watches.

* SPDX-License-Identifier: LGPL-3.0
\************************************************************/

#ifndef _FLUX_CORE_JOB_EVENTLOG_H

This comment has been minimized.

Copy link
@garlick

garlick Mar 11, 2019

Member

I forgot where we left the discussion, but I think @grondo earlier proposed just adding this to job.h rather than creating job_info.h, job_types.h, etc., just to keep the number of installed headers under control.

This comment has been minimized.

Copy link
@chu11

chu11 Mar 11, 2019

Author Contributor

That was discussion on job_util.h. I felt job_info.h warranted a new file given future additions? But perhaps we are thinking too far ahead given your comment below on what the API should look like. I'll just squash it all into job.h for the time being.

#endif

flux_future_t *flux_job_eventlog_lookup (flux_t *h, int flags, flux_jobid_t id);
int flux_job_eventlog_lookup_get (flux_future_t *f, const char **event);

This comment has been minimized.

Copy link
@garlick

garlick Mar 11, 2019

Member

I wonder if we could actually drop the flags and rename to flux_job_event_watch()? I was plannig to add a flux_job_state_watch() that would return one state per response, so this could be kind of paired with that?

We'll need a function for fetching static job info in bulk (R, jobspec, J, etc). Maybe that one could be used to fetch the eventlog in its entirety?

I'm not at all sure what the right interfaces are here, so one possibility is to go with what you have and tweak later.

This comment has been minimized.

Copy link
@grondo

grondo Mar 11, 2019

Contributor

Agreed, I think exported interfaces in libjob related to job "events" could hide the fact from (most) users that the events are in an event log. I like the parity with flux_job_state_watch.

I'm also fine starting with what you have though. Probably more important are the cmdline utilities and Python interface...

This comment has been minimized.

Copy link
@chu11

chu11 Mar 12, 2019

Author Contributor

So are you thinking flux_job_event_watch(), flux_job_event_watch_get(), and flux_job_event_watch_cancel()? Sounds like a good idea if there will be other similar ones.

We'll need a function for fetching static job info in bulk (R, jobspec, J, etc). Maybe that one could be used to fetch the eventlog in its entirety?

Yeah, once that API / interface becomes more clear, I think that could easily be used for the full eventlog retrieval.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

hmmm, one builder failed

not ok 23 - job-manager: priority was updated in KVS
FAIL: t2202-job-manager.t 23 - job-manager: priority was updated in KVS
#	
#		jobid=$(tail -1 <list10_ids.out) &&
#		flux job eventlog $jobid \
#			| cut -d" " -f2- | grep ^priority >pri.out &&
#		grep -q priority=31 pri.out
#	

A little concerning. I'm not sure if there's something racy that's possible?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 12, 2019

I'm not sure if there's something racy that's possible?

Maybe a flux job wait-event --timeout=5 priority is required before dumping the eventlog?

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

Yup that sounds right. To be clear, that race was not introduced in this PR.

@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from fad6af9 to 71796b1 Mar 12, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

Just re-pushed with changes noted above. Added new flux_job_event_watch(), flux_job_event_watch_get(), flux_job_event_watch_cancel() functions.

flux_job_eventlog_lookup/lookup_get() is only for retrieving the full eventlog now. In the future, there could be a more generic "get R, get J, etc." bulk API functions but decided to leave this here for now.

@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from 71796b1 to a94fc59 Mar 12, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

one builder failed the same way as reported above

expecting success: 
	jobid=$(tail -1 <list10_ids.out) &&
	flux job eventlog $jobid \
		| cut -d" " -f2- | grep ^priority >pri.out &&
	grep -q priority=31 pri.out
not ok 23 - job-manager: priority was updated in KVS

I think @grondo's suggested fix ought to solve this e.g.

diff --git a/t/t2202-job-manager.t b/t/t2202-job-manager.t
index b33b2fac4..76b24efd0 100755
--- a/t/t2202-job-manager.t
+++ b/t/t2202-job-manager.t
@@ -153,6 +153,7 @@ test_expect_success 'job-manager: flux job priority sets last job priority=31' '
 
 test_expect_success 'job-manager: priority was updated in KVS' '
        jobid=$(tail -1 <list10_ids.out) &&
+       flux job wait-event --timeout=5.0 ${jobid} priority >pri.out &&
        flux job eventlog $jobid \
                | cut -d" " -f2- | grep ^priority >pri.out &&
        grep -q priority=31 pri.out
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

was going to fix that in a different PR, but I'll add it in this one.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

missed a couple paths on diff coverage, let me add more tests

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

@garlick I just realized, I didn't add one feature you requested. That after a single event in 'inactive' the watcher should exit. I'll add that too.

Copy link
Contributor

left a comment

I ran through this PR branch and everything is working great!

I just commented inline with very minor things I found during testing.

test_expect_success NO_CHAIN_LINT 'flux job wait-event hangs on no event' '
jobid=$(flux jobspec --format json srun -N1 hostname | flux job submit)
! run_timeout 3 flux job wait-event $jobid foobar
'

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

The 3s timeout here seems excessive, maybe just left over debug value?

In fact shouldn't .5 or even .2 seconds be enough? (run_timeout supports < 1s timeouts) then the test overall will run 3x faster.

This comment has been minimized.

Copy link
@chu11

chu11 Mar 12, 2019

Author Contributor

I simply wanted a number that felt "timeout like", but you're right, it need not be larger.

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

Yeah, figured it was something like that. The drawback is that we always pause for the timeout there.. so we want it to be as long as necessary and no longer ;-)

#

test_expect_success 'flux job eventlog works (active)' '
jobid=$(flux jobspec --format json srun -N1 hostname | flux job submit) &&

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

Suggestion: run flux jobspec once and save the output to a file for use in subsequent tests (see e.g. t2202-job-manager.t). This will slightly speed up the test.

goto done;
}
else
flux_log_error (ctx->h, "%s: flux_kvs_lookup_get", __FUNCTION__);

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

Maybe this error should be supressed (at least for ENOENT), since it is likely a result of looking up an invalid jobid? (No reason to spam the logs, since the error is returned to the user anyway)

}

if (flux_job_event_watch_get (f, &s) < 0)
log_err_exit ("flux_job_event_watch_get");

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

Suggestion, on at least ENOENT, return a more readable error to the user, e.g. "job %lu not found" instead of flux_job_event_watch_get: No such file or directory

}
id = strtoull (argv[optindex++], &endptr, 10);
if (*endptr != '\0')
log_err_exit ("id requires an unsigned integer argument");

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

Switch to log_msg_exit() here to suppress a possibly confusing strerror (errno) appended to the end of the error, i.e. I'm getting:

flux-job: id requires an unsigned integer argument: No such file or directory
if (flux_rpc_get (f, NULL) < 0) {
if (errno == ETIMEDOUT) {
fprintf (stderr, "wait-event timeout on event '%s'\n",
ctx->wait_event);

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

should this use log_msg() to get command prefix as in other error messages?

}
id = strtoull (argv[optindex++], &endptr, 10);
if (*endptr != '\0')
log_err_exit ("id requires an unsigned integer argument");

This comment has been minimized.

Copy link
@grondo

grondo Mar 12, 2019

Contributor

s/log_err_exit/log_msg_exit

@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from 654939f to f267834 Mar 13, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

  • re-pushed, fixed up things commented on by review by @grondo
  • added more tests for hopefully more coverage
  • if a lookup reaches the 'inactive' directory of eventlog, assume no more events will occur and end watching.

May be worth noting that due to the latter point above, some of the tests can be racy with flux kvs move <activedir> <inactivedir>, so I worked around it by doing:

        activekvsdir=$(flux job id --to=kvs-active $jobid) &&
        inactivekvsdir=$(echo $activekvsdir | sed 's/active/inactive/') &&
        flux kvs copy ${activekvsdir}.eventlog ${inactivekvsdir}.eventlog &&
        flux kvs eventlog append ${inactivekvsdir}.eventlog foobar &&
        flux kvs unlink ${activekvsdir}.eventlog &&
@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from f267834 to 2d553b1 Mar 13, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

was debating if this code is correct when the module is being unloaded

            while ((l = zlist_pop (ctx->lookups))) {
                if ((l->flags & FLUX_JOB_INFO_WATCH)) {
                    if (flux_kvs_lookup_cancel (l->f) < 0)
                        flux_log_error (ctx->h, "%s: flux_kvs_lookup_cancel",
                                        __FUNCTION__);

                    if (flux_respond_error (ctx->h, l->msg, ENODATA, NULL) < 0)
                        flux_log_error (ctx->h, "%s: flux_respond_error",
                                        __FUNCTION__);
                }
                lookup_ctx_destroy (l);
            }

i.e. cancel the kvs-watcher watcher and send ENODATA to the current eventlog watcher on teardown.

Should the latter return ENODATA? It's true no more watch data is coming, but it's not returning ENODATA under a "good" or "normal" scenario, it's ending b/c of a "bad" situation.

It doesn't appear we send ENODATA on the equivalent situation in kvs-watch.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

Couple other nits (sorry missed earlier):

  • In flux-job.c there is a function parse_arg_unsigned() which should be called to parse job id's rather than calling strtoull directly in the new subcommands.
  • Similarly, there is an iso_timestr() function (msec precision) that could be used in place of eventlog_timestr() (usec precision). I don't think usec precision is really needed in the eventlog output.
  • It might be useful for wait-event to have an option to display the event (or just make that the default?) In the tests anyway, we often wait for an event, then fetch the eventlog to extract the triggering event.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

Similarly, there is an iso_timestr() function (msec precision) that could be used in place of eventlog_timestr() (usec precision). I don't think usec precision is really needed in the eventlog output.

I was actually going to suggest that the timestamps be left alone in seconds with usec precision. To me the ISO timestamp in UTC is not all that useful for viewing the eventlog for jobs -- e.g. you can't subtract two entries to get their difference, and events especially in debug mode may actually be ~1ms or less apart.

I think all the eventlog commands should:

  • display the unmodified timestamp by default (e.g. like Linux dmesg(1))
  • have an option to display the ISO timestr in current timezone (e.g. like Linux dmesg --time-format=iso)
  • have an option to display the offset time (e.g. first entry is 0.0000, other entries are offsets)
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

That sounds good to me @grondo. What do you think @chu11?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

Of course, adding TZ support to iso_timestr and the support for relative timestamps do not need to land in this PR. I would prefer this PR pass the timestamps unmodified in flux job eventlog (i.e. make that the default, but just my opinion), then the other stuff could come in a future PR.

I'll open an issue.

It might be useful for wait-event to have an option to display the event (or just make that the default?) In the tests anyway, we often wait for an event, then fetch the eventlog to extract the triggering event.

BTW, I agree this would be awesome! flux job wait-event --quiet could be used to wait without printing the entry from the matching event.. (and maybe --verbose could print even non-matching entries?)

Call log_msg_exit instead of log_err_exit on input error.
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

Hey guys, I simply yoinked the timestr output from flux kvs without thinking about it. I think all of the time output functions are a good idea, but agree it should be in a future PR.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

diff coverage isn't what I'd like it to be, the main missed part is the cleanup done when a module is unloaded AND watchers still exist. I'll write up something special to cover that to get atleast above 75%.

chu11 added 2 commits Mar 5, 2019
Module is a proxy between caller and KVS taking job id as input
and ensuring caller has access to respective job information.

Initially, support reading eventlog entries.
Add flux_job_eventlog_lookup() and flux_job_eventlog_lookup_get()
to retrieve an entire eventlog.

Add flux_job_event_watch(), flux_job_event_watch_get(), and
flux_job_event_watch_cancel() functions to stream an eventlog
one entry at a time.
@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from 2d553b1 to 80224b6 Mar 13, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

just re-pushed with fixes

  • use parse_arg_unsigned appropriately
  • wait-event outputs the event by default
  • --quiet & --verbose options added for wait-event
  • module unload corner case, returns ENOSYS instead of ENODATA. I felt this was the right one, indicating the module is no longer loaded and the watch is no longer available.
  • eventlog & wait-event output raw timestamp
    • (I didn't do this in flux-kvs, should I?)
  • update log_err to log_msg in flux kvs given prior comments
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

Nice!

Did notice one bug - flux job wait-event --verbose seems to dump events that occur after the triggering event when it's documented to (and probably should) dump only those leading up to and including it. Example:

$ flux job wait-event -v 40617639936  submit 
1552515114.064075 submit userid=5588 priority=16 flags=0
1552515152.285945 alloc rank0/core[0-1]
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 13, 2019

I gave the recent stuff a spin and it works great!

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

occur after the triggering event when it's documented to (and probably should) dump only those leading up to and including it.

Hmmm. I thought I handled that, but clearly I didn't do it correctly.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

I think this is probably ready to merge once we get that little bug fixed.

chu11 added 2 commits Mar 6, 2019
Support new subcommands 'eventlog' to get a job eventlog and
'wait-event' to wait for an event to occur.
@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from 80224b6 to 6bcc7e8 Mar 13, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 13, 2019

Hmmm. I thought I handled that, but clearly I didn't do it correctly.

Doh! Had a corner case. And I realize how a lot of the tests can be made simpler. Re-pushed with fix and a bunch of test cleanups.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

Maybe a chainlint failure? I've gotta run but I'll check in later and merge if this comes together. Thanks!

@grondo
grondo approved these changes Mar 14, 2019
chu11 added 2 commits Mar 11, 2019
Replace use of flux kvs eventlog get with flux job eventlog
appropriately.

Use flux job wait-event when appropriate.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Maybe a chainlint failure?

In case you don't find it quickly, from travis logs:

PASS: t2204-job-info.t 7 - flux job eventlog fails on bad id
error: bug in the test script: broken &&-chain: 
        jobid=$(flux job submit test.json)
        flux job wait-event $jobid submit > wait_event1.out &&
        grep submit wait_event1.out
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

yeah, multiple chain lint issues. Fixing up.

@chu11 chu11 force-pushed the chu11:issue2041_part1-devel11 branch from 6bcc7e8 to a3eaee0 Mar 14, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Mar 14, 2019

Codecov Report

Merging #2071 into master will decrease coverage by 0.06%.
The diff coverage is 73.89%.

@@            Coverage Diff             @@
##           master    #2071      +/-   ##
==========================================
- Coverage   80.48%   80.41%   -0.07%     
==========================================
  Files         192      193       +1     
  Lines       30246    30540     +294     
==========================================
+ Hits        24342    24560     +218     
- Misses       5904     5980      +76
Impacted Files Coverage Δ
src/cmd/flux-kvs.c 82.28% <0%> (-0.12%) ⬇️
src/common/libjob/job.c 74.78% <100%> (+8.88%) ⬆️
src/modules/job-info/job-info.c 68% <68%> (ø)
src/cmd/flux-job.c 85.04% <77.27%> (-2.94%) ⬇️
src/common/libflux/message.c 81.39% <0%> (-0.25%) ⬇️
src/cmd/flux-module.c 83.72% <0%> (-0.24%) ⬇️
src/modules/kvs-watch/kvs-watch.c 78.99% <0%> (+0.21%) ⬆️
src/modules/barrier/barrier.c 78.62% <0%> (+2.06%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Great! Looks like @garlick already approved and this is super useful, so merging!

@grondo grondo merged commit aedccca into flux-framework:master Mar 14, 2019
3 of 4 checks passed
3 of 4 checks passed
codecov/patch 73.89% of diff hit (target 80.48%)
Details
Mergify — Summary 1 potential rule
Details
codecov/project 80.41% (-0.07%) compared to c59f78a
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.