Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: improve internal eventlog support #2025

Merged
merged 9 commits into from Feb 19, 2019

Conversation

@garlick
Copy link
Member

commented Feb 18, 2019

This PR is definitely a WIP, but I wanted to get it out there for some early review if that's OK.

The job eventlog is a sort of fine grained synchronization facility for single jobs, and also a record of what has happened so far to a job sitting in the queue. In addition, the job manager implements a "restart" capability where active jobs are pulled in from the KVS and their job manager state recreated by replaying the job's eventlog.

The first thing this PR does is ensure that full in-memory job state can be recreated from the eventlog, as the number of events increases. The "restart" code is refactored so that replay occurs in a new constructor for a struct job. Some util functions for parsing key=value attributes out of the event context were added to make this job easier. The submit event, which is added by the ingest module, was modified to include the userid and initial job priority, so that the job manager only needs to access the eventlog, not individual keys, when it restarts (RFC 16 should be updated to reflect this).

The second part of this PR adds a proxy service in the job manager to allow guests to access their jobs' eventlogs by job id. There's a matching job.h api function and a flux job eventlog subcommand. In this first cut, the service only supports fetching a snapshot of the eventlog. The subcommand dumps events in raw form.

$ flux job eventlog 53418655744
1550463786.443906 submit userid=5588 priority=16
1550463800.251430 exception type=cancel severity=0 userid=5588
1550477469.741397 exception type=cancel severity=0 userid=5588

The API function:

/* Lookup eventlog for job, using job-manager as proxy to the KVS.
 * Response contains raw RFC 18 KVS eventlog string, which may be decoded
 * with flux_kvs_eventlog_decode().
 */
flux_future_t *flux_job_lookup_eventlog (flux_t *h, flux_jobid_t id, int flags);

Obviously a watch flag is the big missing piece here. I paused here partly because I felt we might want to do better with the way events are returned by that API call. It currently retrieves an eventlog snapshot in RFC 18 format, what you get from a kvs_lookup response, which you can parse and iterate over with functions in kvs_eventlog.h. When we add WATCH, if the kvs_lookup semantics are preserved, each change causes a progressively larger block of events to be returned, rather than only the new ones. kvs_eventlog.h handles this well, but it seems like exposing this weirdness at the job API level might be a bad idea, for example if a python script wants to synchronize on events. Maybe having each response return a single event in JSON form would be better?

TODO:

  • Handle a WATCH flag to stream events as they arrive
  • Transparently handle the eventlog moving from active to inactive
  • Consider translating odd KVS lookup semantics to a stream of JSON objects
  • Better event formatting + options for the command
Problem: key=value attributes are not removed from
the optional 'note' field of the raise RPC request.
Depending on parser details, this risks allowing
a user to override exception userid field.

Reject raise requests with '=' in the note field.

Verify this change in the job-manager sharness test.
garlick added 8 commits Feb 17, 2019
Problem: userid and priority are written to separate
keys in the KVS job schema.  job-manager has to read
these to recreate job state when it restarts.

Add userid and priority to job eventlog "submit"
event, so that the job manager can get info when it
replays the eventlog and won't need to access the
other keys.
Problem: job-manager unit tests are growing new dependencies
on queue.o, job.o, and util.o.  The current method of listing
each for each test does not scale.

Add those objects to the $(test_ldadd) macro.
Problem: code is growing little bits of one-off
event context parsing code, which if we keep going
that way will result in code duplication.

Add some more generic util functions for parsing
key=int, key=string attributes, and free form text
following attributres.

Add unit tests.
Problem: the code for recreating the queue from
job data in the KVS is quickly becoming unreadable.

Add job_create_from_eventlog(), which creates a
a struct job by replaying a snapshot of the job's
KVS eventlog.

RFC 21 documents that the job state can be
recreated by replaying the eventlog. This function
implies that this property extends to the whole
in-memory job-manager state for a job.

This depends on the "submit" event including the
userid and initial priority, just added in an
earlier commit to job-ingest.  If those are missing,
creation fails.
Change the restart logic to use the newly added
flux_job_create_from_eventlog() constructor to
recreate in-memory job state (soley) from the job's
eventlog in the KVS.

A side benefit fo this cleanup is dropping two synchronous
KVS lookups per job to fetch priority and userid, which
which should speed up restart.

It also allows the one-off context parsing code in
restart.c to be dropped.
Problem: util KVS helper functions only need the job id
to construct a key, not the whole struct job.  This
limits the context in which these functions are usable
to active jobs only.

Change the function footprints and update tests and users.
Add a helper function for doing a KVS lookup on
a key within the active or inactive directory of
a job.
Problem: restart has its own KVS helper for looking
up a job attribute, but now we have a generalized
function for that.

Use util_attr_lookup() instead and drop the local one.
@garlick garlick force-pushed the garlick:job_eventlog2 branch from 8af9afc to 6ffb967 Feb 19, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Just forced a push which drops what was described above as the second part of this PR, e.g. flux_job_eventlog(), flux job eventlog, and the "proxy" service in the job manager. We have a discussion going on in #2026 about those interfaces, and so that part of this PR will need a rethink (and another PR).

The "first" part is good cleanup though, even if we end up replacing the job-manager util functions that parse the event contexts with functions in the public API, so here it is by itself.

@garlick garlick changed the title WIP: job-manager: improve eventlog support, add eventlog proxy access job-manager: improve internal eventlog support Feb 19, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Feb 19, 2019

Codecov Report

Merging #2025 into master will decrease coverage by 0.03%.
The diff coverage is 78.54%.

@@            Coverage Diff             @@
##           master    #2025      +/-   ##
==========================================
- Coverage   80.58%   80.54%   -0.04%     
==========================================
  Files         180      181       +1     
  Lines       28914    29093     +179     
==========================================
+ Hits        23300    23434     +134     
- Misses       5614     5659      +45
Impacted Files Coverage Δ
src/modules/job-manager/restart.c 86.79% <100%> (+0.12%) ⬆️
src/modules/job-manager/priority.c 82.08% <100%> (ø) ⬆️
src/modules/job-manager/raise.c 70% <100%> (+0.88%) ⬆️
src/modules/job-manager/job-manager.c 67.69% <50%> (-1.17%) ⬇️
src/modules/job-ingest/job-ingest.c 72.95% <60%> (-0.34%) ⬇️
src/common/libjob/job.c 62.79% <66.66%> (+0.29%) ⬆️
src/modules/job-manager/proxy.c 72.41% <72.41%> (ø)
src/cmd/flux-job.c 86.04% <73.33%> (-0.8%) ⬇️
src/modules/job-manager/job.c 89.28% <84.21%> (-10.72%) ⬇️
src/modules/job-manager/util.c 79.12% <85.48%> (+10.37%) ⬆️
... and 4 more
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

Can't find any issues with this PR, though 1bdce58 makes me wonder if we need to adjust eventlog to make designing entry formats with extra data safer, or more robust against potential issues like this?

However, for now the approach taken seems fine, so I think this PR is ready to merge?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Yes please.

(Your point about 1bdce58 is well taken - let's keep that in mind)

@grondo grondo merged commit 2bf6370 into flux-framework:master Feb 19, 2019
2 checks passed
2 checks passed
Mergify — Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Thanks!

@garlick garlick deleted the garlick:job_eventlog2 branch Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.