Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: add scheduler interface #2031

Merged
merged 15 commits into from Feb 26, 2019

Conversation

@garlick
Copy link
Member

commented Feb 20, 2019

This is pretty rough, but it adds a scheduler interface to the job manager, a sched-dummy test scheduler, and a basic sharness test.

Jobs transition immediately to SCHED state upon being submitted. If a scheduler is loaded, the job manager sends alloc requests to the scheduler for each job. (The maximum number of concurrent outstanding alloc requests is regulated by the negotiated scheduler interface mode - current choices are single and unlimited)

Once an alloc request is fulfilled, an alloc event is logged to the job eventlog and the job transitions to RUN state. The only way out of this currently is cancellation, which triggers a free request to be sent to the scheduler.

When the scheduler is loaded, it executes a two-phase startup. It first sends an (empty) hello request to the job manager. The job manager responds with an array of job id's that are currently holding resources. The scheduler must mark these as allocated in its internal bookkeeping system. The scheduler then sends a ready request, requesting an interface mode. The job manager responds with the number of jobs that are queued up to send alloc requests.

After this handshake, the scheduler can expect to receive alloc and free requests for resources from the job manager. These RPCs have one oddity - to improve throughput, they do not carry unique matchtags, but instead include the jobid in the response so that the job manager can match them to requests. This means a normal error response cannot be mapped to a specific job, so an alloc failure is encoded within a "success" response. If a regular RPC error is returned for any reason, this will cause the job manager to assume something is really wrong and abort its "session" with the scheduler. It works out that if the scheduler is not loaded (e.g. if it was unloaded on a live system), then the ENOSYS response returned automatically to sched requests has the effect of aborting the scheduler session.

The job manager does not pass a jobpec to the scheduler in an alloc request. Instead it passes a job id, and the scheduler fetches the jobspec from the expected KVS location per RFC 18. Also, the scheduler does not pass R to the job manager in an alloc response. Instead it commits R to the KVS, and returns the jobid to the job manager. Thus far the job manager does not directly consume either of those.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2019

This looks great!

I have some very simple code I can use to start testing this interface. Though I've kind of made a mess of it for the moment -- I'll hopefully get to this early tomorrow.

Thanks!

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

Here's a couple quick comments (more questions really) after a first pass at this PR:

  • Could a scheduler send a service name along with the job-manager.sched-hello or job-manager.sched-ready requests? This would remove the need for the scheduler to dynamically register the sched service name (and the associated state flag). The scheduler implementation would just have to guarantee it supports the job manager interface at that service name (e.g. name.alloc, name.free)

  • This shouldn't matter to me, but it does. Why t/jobmanager not t/job-manager to match the module name (and its directory under src/modules)?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Could a scheduler send a service name along with the job-manager.sched-hello or job-manager.sched-ready requests?

Could work, but might be awkward later if we want the scheduler to offer services to other parts of the system (I can't think what those would be unless things other than jobs might want to allocate resources without waiting in the queue). Or maybe some resource status tool?

More generally, is the pattern of registering a generic service name the way we want to abstract interface vs implementation?

Why t/jobmanager not t/job-manager to match the module name

I'll fix that. (Conservation of underbars in Makefile.am or some other indefensible brain fart).

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

Could work, but might be awkward later if we want the scheduler to offer services to other parts of the system (I can't think what those would be unless things other than jobs might want to allocate resources without waiting in the queue). Or maybe some resource status tool?

I was thinking that could be handled at the time those other service endpoints are known, and they might have different names anyway. For example a resource status service might be found under the resource.* namespace, it doesn't seem to belong to sched. (in flux-sched it may belong the resource service)

More generally, is the pattern of registering a generic service name the way we want to abstract interface vs implementation?

For cases where the implementation is registering a "hello" (here I am) handshake with an existing service, I was thinking the generic service name is a bit overkill, and has setup/teardown requirements that seemed a bit fussy (what if you go down without de-registering the generic service name?)

I don't feel too strongly about it, though.

@garlick garlick force-pushed the garlick:sched_interface4 branch 2 times, most recently from 8fc9285 to 3388dd5 Feb 21, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Forced a push with the test directory name change suggested by @grondo, some cleanup of the internal job manager flags (switch to bitfields), and a rebase to current master.

@garlick garlick force-pushed the garlick:sched_interface4 branch from 3388dd5 to 8554171 Feb 22, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2019

Forced another push with some scheduler internals factored out to libschedutil, just to enable some code reuse between the sched-dummy and other test schedulers in the short term, rather than cut and paste.

@garlick garlick force-pushed the garlick:sched_interface4 branch from 8554171 to 8e8811b Feb 22, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2019

sched-dummy now is using mode=single (only) to reduce its complexity.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2019

Cool. I was able to tie my simple allocator to your current implementation.
Very simple example:

grondo@asp:~/git/flux-core.git$ src/cmd/flux start -s 4
flux-start: warning: setting --bootstrap=selfpmi due to --size option
grondo@asp:~/git/flux-core.git$ t/jobspec/summarize-minimal-jobspec.py<t/j.json
{"total_num_cores": 2, "total_num_nodes": 0, "total_num_tasks": 1}
grondo@asp:~/git/flux-core.git$ t/ingest/submitbench -r 8 t/j.json
549856477184
549940363264
550007472128
550091358208
550175244288
550259130368
550343016448
550426902528
grondo@asp:~/git/flux-core.git$ flux module load sched-simple
grondo@asp:~/git/flux-core.git$ flux job list
JOBID           STATE   USERID  PRI     T_SUBMIT
549856477184    R       1000    16      2019-02-23T18:37:15Z
549940363264    R       1000    16      2019-02-23T18:37:15Z
550007472128    R       1000    16      2019-02-23T18:37:15Z
550091358208    R       1000    16      2019-02-23T18:37:15Z
550175244288    R       1000    16      2019-02-23T18:37:15Z
550259130368    R       1000    16      2019-02-23T18:37:15Z
550343016448    R       1000    16      2019-02-23T18:37:15Z
550426902528    R       1000    16      2019-02-23T18:37:15Z
grondo@asp:~/git/flux-core.git$ flux dmesg | grep sched-simple
2019-02-23T18:37:43.264549Z broker.debug[0]: insmod sched-simple
2019-02-23T18:37:43.265239Z sched-simple.debug[0]: service_register
2019-02-23T18:37:43.266735Z sched-simple.debug[0]: ready: rank[0-3]/core[0-3]
2019-02-23T18:37:43.267880Z sched-simple.debug[0]: req: id=549856477184 spec={0,1,2}
2019-02-23T18:37:43.267914Z sched-simple.debug[0]: alloc: 549856477184: rank0/core[0-1]
2019-02-23T18:37:43.272850Z sched-simple.debug[0]: req: id=549940363264 spec={0,1,2}
2019-02-23T18:37:43.272895Z sched-simple.debug[0]: alloc: 549940363264: rank1/core[0-1]
2019-02-23T18:37:43.277619Z sched-simple.debug[0]: req: id=550007472128 spec={0,1,2}
2019-02-23T18:37:43.277659Z sched-simple.debug[0]: alloc: 550007472128: rank2/core[0-1]
2019-02-23T18:37:43.282570Z sched-simple.debug[0]: req: id=550091358208 spec={0,1,2}
2019-02-23T18:37:43.282616Z sched-simple.debug[0]: alloc: 550091358208: rank3/core[0-1]
2019-02-23T18:37:43.287540Z sched-simple.debug[0]: req: id=550175244288 spec={0,1,2}
2019-02-23T18:37:43.287587Z sched-simple.debug[0]: alloc: 550175244288: rank0/core[2-3]
2019-02-23T18:37:43.292858Z sched-simple.debug[0]: req: id=550259130368 spec={0,1,2}
2019-02-23T18:37:43.292885Z sched-simple.debug[0]: alloc: 550259130368: rank1/core[2-3]
2019-02-23T18:37:43.297448Z sched-simple.debug[0]: req: id=550343016448 spec={0,1,2}
2019-02-23T18:37:43.297484Z sched-simple.debug[0]: alloc: 550343016448: rank2/core[2-3]
2019-02-23T18:37:43.302245Z sched-simple.debug[0]: req: id=550426902528 spec={0,1,2}
2019-02-23T18:37:43.302282Z sched-simple.debug[0]: alloc: 550426902528: rank3/core[2-3]

submission of the next job "blocks":

$ flux jobspec srun hostname | flux job submit
4818030559232
$ flux job list | tail -1
4818030559232   S       1000    16      2019-02-23T18:41:30Z

I've got some other cleanup (and a couple bugs to fix), but then I could push a branch or draft pr up to play with.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2019

Ok, found the main bug where cancellation of any job resulted in cancellation of the "pending" job. Now we can show that cancel of a running job frees the resource which triggers allocation to the next job:

grondo@asp:~/git/flux-core.git$ flux job list
JOBID           STATE   USERID  PRI     T_SUBMIT
216023433216    R       1000    16      2019-02-23T18:49:05Z
216107319296    R       1000    16      2019-02-23T18:49:05Z
216191205376    R       1000    16      2019-02-23T18:49:05Z
216275091456    R       1000    16      2019-02-23T18:49:05Z
216358977536    R       1000    16      2019-02-23T18:49:05Z
216426086400    R       1000    16      2019-02-23T18:49:05Z
216509972480    R       1000    16      2019-02-23T18:49:05Z
216593858560    R       1000    16      2019-02-23T18:49:05Z
960579502080    S       1000    16      2019-02-23T18:49:49Z
grondo@asp:~/git/flux-core.git$ flux job cancel 216023433216
grondo@asp:~/git/flux-core.git$ flux job list
JOBID           STATE   USERID  PRI     T_SUBMIT
216023433216    C       1000    16      2019-02-23T18:49:05Z
216107319296    R       1000    16      2019-02-23T18:49:05Z
216191205376    R       1000    16      2019-02-23T18:49:05Z
216275091456    R       1000    16      2019-02-23T18:49:05Z
216358977536    R       1000    16      2019-02-23T18:49:05Z
216426086400    R       1000    16      2019-02-23T18:49:05Z
216509972480    R       1000    16      2019-02-23T18:49:05Z
216593858560    R       1000    16      2019-02-23T18:49:05Z
960579502080    R       1000    16      2019-02-23T18:49:49Z

grondo@asp:~/git/flux-core.git$ flux dmesg | grep sched-simple | tail
2019-02-23T18:50:03.121922Z sched-simple.debug[0]: free: rank0/core[0-1]
2019-02-23T18:50:03.121963Z sched-simple.debug[0]: try_alloc: id=960579502080
2019-02-23T18:50:03.121977Z sched-simple.debug[0]: alloc: 960579502080: rank0/core0
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2019

Found a minor memory leak I believe (almost accidentally) in the current job-manager code. Not a big deal as it would have been found once we tie a scheduler for the job-manager to talk to into the valgrind workload.

diff --git a/src/modules/job-manager/scheduler.c b/src/modules/job-manager/sched
uler.c
index ba731ab6..89c87491 100644
--- a/src/modules/job-manager/scheduler.c
+++ b/src/modules/job-manager/scheduler.c
@@ -378,6 +378,7 @@ static void hello_cb (flux_t *h, flux_msg_handler_t *mh,
         }
         job = queue_next (sc->queue);
     }
+    json_decref (o);
     return;
 nomem:
     errno = ENOMEM;
``
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2019

Sounds great @grondo! I'll go ahead and push a commit here to address the memleak you found (thanks!)

If it sounds like the scheduler is coming together based on this interface, maybe I should try to get the coverage up on this PR as a next step, unless you have found other things here that need addressing?

I could push a branch or draft pr up to play with.

Great! A draft PR is good IMHO.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2019

I added a commit for the memleak, and also a couple to make the trivial job schema changes just merged in RFC 16.

@codecov-io

This comment has been minimized.

Copy link

commented Feb 23, 2019

Codecov Report

Merging #2031 into master will decrease coverage by 0.29%.
The diff coverage is 66.29%.

@@           Coverage Diff            @@
##           master   #2031     +/-   ##
========================================
- Coverage   80.59%   80.3%   -0.3%     
========================================
  Files         180     187      +7     
  Lines       28966   29433    +467     
========================================
+ Hits        23346   23637    +291     
- Misses       5620    5796    +176
Impacted Files Coverage Δ
src/modules/job-ingest/job-ingest.c 72.56% <100%> (-0.4%) ⬇️
src/modules/job-manager/priority.c 85.71% <100%> (+3.62%) ⬆️
src/common/libschedutil/jobkey.c 100% <100%> (ø)
src/common/libschedutil/free.c 100% <100%> (ø)
src/modules/job-manager/job.c 81.96% <16.66%> (-7.32%) ⬇️
src/modules/job-manager/job-manager.c 59.13% <51.02%> (-9.72%) ⬇️
src/common/libschedutil/ops.c 58.92% <58.92%> (ø)
src/common/libschedutil/hello.c 59.25% <59.25%> (ø)
src/common/libschedutil/alloc.c 60.71% <60.71%> (ø)
src/common/libschedutil/ready.c 61.53% <61.53%> (ø)
... and 14 more
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2019

Add a commit to log the 'note' as part of the alloc exception. This was noted missing in #2038.

Then added debug instrumentation to sched-dummy to simulate an alloc failure, and sharness coverage for the alloc exception.

@garlick garlick force-pushed the garlick:sched_interface4 branch from 2d33dd6 to c7fd474 Feb 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2019

Just forced a push which renames the job manager's scheduler.c to alloc.c per coffee time discussion. It also renames some variables and functions to reflect the overall name change of that component, e.g. struct scheduler to struct alloc_ctx. This change and other incremental change squashed.

I'm OK to have this considered for a merge at this point.

@garlick garlick changed the title WIP: job-manager: add scheduler interface job-manager: add scheduler interface Feb 26, 2019
@garlick garlick force-pushed the garlick:sched_interface4 branch from c7fd474 to 53902e7 Feb 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2019

Rebased on current master and added a unit test for the internal job-manager struct job code.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

I've rebased my simple scheduler branch on top of the most recent version of this PR and everything is working fine. Is this close to ready to merge?

My only comments are that some of the commits could use a descriptive commit body (e.g. job-manager: support multiple queues and similar one-liner commits might be helped by a short description in the body (for future multiverse selves browsing history).

If you are already going to reword some commits, then it would be nice to be consistent with the "job-manager" prefix, either modules/job-manager: or job-manager: (I prefer the latter since it is probably extraneous verbosity to declare the job-manager is under modules), unless there was a sensible reason for the difference I missed.

Obviously the above comments are trivial and not required to be addressed before merge.

@garlick garlick force-pushed the garlick:sched_interface4 branch from 53902e7 to b7dcbdb Feb 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2019

You are quite right that commit comments were too sparse. I've gone through it and updated them, and I'm now OK with merging, with the understanding that we'll improve test coverage over time, especially once eventlog and state synchronization are implemented (#2041 and #1665 respectively).

garlick added 3 commits Feb 14, 2019
Generalize 'struct queue' to prepare the way for
other queues to exist in the job manager.

The 'struct queue' lookup hash is now optional.
Since all active jobs will be in the main queue with
the hash to speed up lookup by id, other queues
containing a subset of active jobs can leverage the
main queue's lookup hash and thus don't need one.

The "list handle" is exposed in the API so that
users can assign to the appropriate struct job
member (or ignore it).  Add job->aux_queue_handle,
so that a job can be be in another queue besides
the main one.  Note the handle is returned by insert()
and is used as input to delete() or reorder().

Change the queue implementation so that the job
reference count is incremented when inserted into
the list and decremented when removed from the list.
If the optional hash is enabled, it tracks the list
(not taking a reference on jobs) not the other way
around like before.

Update users and tests.
Problem: manual bit manipulation of internal job
flags is error prone and verbose.

Switch to bitfields.

Drop the flags argument from internal job_create() function
and update users.
Problem: jobs are submitted in batches to the job manager.
If one job fails to be added, all the jobs fail.  However,
when there is a failure, some jobs may be left in the queue
after their users were told they failed.

Keep a list of newly added jobs.  On failure, walk the
list and dequeue the jobs.
garlick added 12 commits Jan 18, 2019
Add sched-dummy, an test schduler that doesn't
properly schedule anything, but demonstrates and
exercises the libschedutil code and the job manager
scheduler interface.

The scheduler is started with some number of imaginary
cores to allocate. Change with --cores option.

Each job is presumed to be requesting one core.
Parsing jobspec is beyond the capability of this scheduler.

When a core is available, a fake R is provided which
contains the string "1core".

There is a "module debug" flag (0x01) which causes allocations
to fail, so that this code can be exercised in the job-manager.
Add the scheduler interface, defining an initial
handshake so that the scheduler can be unloaded
and reloaded on a live system while still accounting
for allocated resources.

Implement two scheduler modes:

single: only one alloc request is allowed to be
outstanding to the scheduler at one time.  This is
appropriate for the FCFS scheduler, and requires
that the job manager implement a queue of jobs
that are ready to allocate resources.

unlimited: an alloc request is immediatley sent
when a job becomes ready to be allocated.  The
queue is actually still useful in the event that
the scheduler is not loaded when jobs become
ready.

Define alloc and free RPCs that do not use matchtags
or futures.  One side effect is that a "failure"
returned in the usual RFC 6 way cannot be associated
wtih a job, so we have to encode the expected forms
of failure in a "success" response payload, and
reserve the normal failure for disasters that should
stop the scheduler interface.

The code to advance the job state and log eventlog
messages leaves a bit to be desired here, but it's
a start that can be improved upon later.
When the ingest module adds jobs to the queue,
advance the state to SCHED (skipping DEPEND
because we don't have that yet) and insert the
job in the "alloc_ctx" queue.  Once the scheduler
is ready, a sched.alloc request will be sent
on behalf of the job.

When the job-manager is restarting from the KVS,
do the same thing for newly read-in jobs in the
SCHED state.
Now that the job-manager groks the SCHED state,
jobs will immediately advance there from NEW.

Change t2202-job-manager.t to expect jobs to
be in S instead of N state after submission.
In t2201-job-cmd.t and t2202-job-manager.t,
use a simple invocation of flux jobspec to generate
test job input, rather than converting YAML test
input to JSON, now that we have the flux-jobspec
command.

Also, drop test prereq settings that were not used
in these tests.
In job_create_from_eventlog(), the eventlog is
replayed to set the job state, and internal flags.

Incorporate alloc and free events into the eventlog
replay logic, so that a job that has an alloc event
is moved to SCHED state with the has_resources flag
set, and a job that has a free event is moved to
CLEANUP with has_resources clear.
When handling an exception of severity 0,
notify the scheduler machinery so that it can
clean up any allocated resources.
Add a simple sharness test that exercises the job-manager
scheduler interface using sched-dummy.
We will initially have sched-dummy for testing,
and, shortly, sched-simple for our first functional
in-core scheduler.

There is significant boiler plate code needed
to handle the sched - job-manager interaction
that would be duplicated in both schedulers, so
rather than cut & paste, put it in an internal
library, "libschedutil".

After we've had some experience with it, hopefully
it can be improved to the point that it can be
exported as part of the API for external schedulers
like the flux-sched project.
Problem: J-signed was renamed to J in RFC 16.

Change the job-ingest code that commits J to
the KVS to use the new name.
Problem: priority and userid were dropped as
individual job keys in RFC 16.  The are now
in the context of the submit event.

Change the job-ingest code that commits priority
and userid to the KVS to omit these keys.

Update job-ingest sharness test.
Add a unit test for 'struct job', with some basic
checks, and also some tests to verify that various
eventlogs replay to expected state.
@garlick garlick force-pushed the garlick:sched_interface4 branch from b7dcbdb to 366e91d Feb 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 26, 2019

Repushed with one typo fix and changed bitfield defintion from int to uint8_t. Doh!

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

Very nice! Thanks! As a sanity check I'll rebase my PR here, but after Travis turns green this looks good to merge.

@codecov-io

This comment has been minimized.

Copy link

commented Feb 26, 2019

Codecov Report

Merging #2031 into master will decrease coverage by 0.2%.
The diff coverage is 68.84%.

@@            Coverage Diff             @@
##           master    #2031      +/-   ##
==========================================
- Coverage    80.6%   80.39%   -0.21%     
==========================================
  Files         180      187       +7     
  Lines       28966    29435     +469     
==========================================
+ Hits        23348    23665     +317     
- Misses       5618     5770     +152
Impacted Files Coverage Δ
src/modules/job-ingest/job-ingest.c 72.56% <100%> (-0.4%) ⬇️
src/modules/job-manager/priority.c 85.71% <100%> (+3.62%) ⬆️
src/modules/job-manager/job.c 98.38% <100%> (+9.1%) ⬆️
src/common/libschedutil/jobkey.c 100% <100%> (ø)
src/common/libschedutil/free.c 100% <100%> (ø)
src/modules/job-manager/job-manager.c 59.13% <51.02%> (-9.72%) ⬇️
src/common/libschedutil/ops.c 58.92% <58.92%> (ø)
src/common/libschedutil/hello.c 59.25% <59.25%> (ø)
src/common/libschedutil/ready.c 61.53% <61.53%> (ø)
src/common/libschedutil/alloc.c 69.64% <69.64%> (ø)
... and 14 more
@grondo grondo merged commit 8104c17 into flux-framework:master Feb 26, 2019
2 of 4 checks passed
2 of 4 checks passed
codecov/patch 68.84% of diff hit (target 80.6%)
Details
codecov/project 80.39% (-0.21%) compared to 880e050
Details
Mergify — Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick deleted the garlick:sched_interface4 branch Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.