Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add simple fcfs node and core scheduler #2038

Merged
merged 11 commits into from Feb 27, 2019

Conversation

@grondo
Copy link
Contributor

commented Feb 24, 2019

Putting this draft PR up mainly to synchronize with #2031 and proffer a simple (embarrassingly so) scheduler for flux-core, to be used for experimentation and testing of the job-manager/scheduler interface, as well an in-tree testing target and possible vehicle for simulated execution and other upcoming sprints.

This scheduler (really a node/core allocator) just maintains a list of nodes an free cores, generated from the resource.hwloc.by_rank object, and attempts to allocate to the current job based on a simple strategies (allocate from "most available" node by default).

I can put more detail up a bit later, but wanted to get the initial PR up in case @garlick (or anyone else) wanted to give early feedback.

In case I have to defend myself, we did say we wanted a simple allocator initially for our first cut at a fcfs scheduler in flux-core..

@grondo grondo force-pushed the grondo:sched-simple branch from 5cb15b4 to d8acc50 Feb 24, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 24, 2019

I poked at this a bit and it seems to work fine on my small system at home at least! It is pretty exciting to have this piece of the system working.

I did get a segfault running this, which I realize is unfair input at this stage, and that we'll want to restrict jobspec with validation at ingest, but thought I'd report it anyway in case its useful (reran broker under gdb to get backtrace):

$ t/jobspec/y2j.py <t/jobspec/valid/basic.yaml | flux job submit
Thread 15 "flux-broker-0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc37fe700 (LWP 15374)]
rlist_mrlist (rl=<optimized out>) at rlist.c:324
324	    n = zlistx_first (rl->nodes);
(gdb) bt full
#0  rlist_mrlist (rl=<optimized out>) at rlist.c:324
        n = 0x0
        mrn = 0x0
        l = 0x7fffb0008570
#1  0x00007ffff24a576a in rlist_compressed (rl=0x0) at rlist.c:350
        mrn = 0x0
        o = 0x7fffb0008540
        l = 0x0
        mrn = <optimized out>
        o = <optimized out>
        l = <optimized out>
        entry = <optimized out>
#2  rlist_to_R (rl=rl@entry=0x0) at rlist.c:437
        R = 0x0
        R_lite = <optimized out>
#3  0x00007ffff24a3afe in Rstring_create (l=0x0) at sched.c:98
        s = 0x0
        R = <optimized out>
        s = <optimized out>
        R = <optimized out>
#4  try_alloc (h=h@entry=0x7fffb0001720, ss=ss@entry=0x7fffb0012820)
    at sched.c:121
        alloc = 0x0
        jj = <optimized out>
        R = 0x0
#5  0x00007ffff24a3d75 in alloc_cb (h=0x7fffb0001720, msg=0x7fffb0006fc0,
    jobspec=0x7fffb00193e0 "{\"attributes\":null,\"tasks\":[{\"slot\":\"foo\",\"count\":{\"per_slot\":1},\"command\":\"app\"}],\"version\":1,\"resources\":[{\"count\":1,\"type\":\"slot\",\"with\":[{\"count\":1,\"type\":\"node\"}],\"label\":\"foo\"}]}", arg=0x7fffb0012820) at sched.c:205
        ss = 0x7fffb0012820
#6  0x00007ffff24a6abd in alloc_continuation (f=0x7fffb0018af0,
    arg=0x7fffb0018f90) at ops.c:40
        ctx = 0x7fffb0018f90
        msg = 0x7fffb0006fc0
        jobspec = 0x7fffb00193e0 "{\"attributes\":null,\"tasks\":[{\"slot\":\"foo\",\"count\":{\"per_slot\":1},\"command\":\"app\"}],\"version\":1,\"resources\":[{\"count\":1,\"type\":\"slot\",\"with\":[{\"count\":1,\"type\":\"node\"}],\"label\":\"foo\"}]}"
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 24, 2019

Ugh! Sorry about that. I should have mentioned that the current code can only deal with the minimal (v1) jobpec (flux-framework/rfc#149). However, the segfault here is just a stupid failure to check for NULL without ENOSPC. I'll push a fix shortly.

@grondo grondo force-pushed the grondo:sched-simple branch from 9d91ca1 to 3c23bdf Feb 24, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 24, 2019

Ok, fixed that sefgault, sorry @garlick! Scheduler now returns schedutil_alloc_respond_denied for failure in decoding jobspec (or invalid minimal jobspec that sneaks through) or other early errors.

I did inadvertently flux_respond_error() at first, even though you had warned of this "oddity" in the alloc "request". Maybe we can find a way to hid the flux_msg_t from the scheduler interface so that a sched developer won't be tempted to respond with error? That is something we could tackle later.

Also, I don't think I see the "note" appear in the eventlog for an alloc exception like this:

$ t/jobspec/y2j.py < t/jobspec/valid/basic.yaml | flux job submit
211241926656
$ flux job list
JOBID		STATE	USERID	PRI	T_SUBMIT
211241926656	C	1000	16	2019-02-24T22:16:43Z
$ flux kvs get $(flux job id --to=kvs-active 211241926656).eventlog
1551046603.764767 submit userid=1000 priority=16
1551046603.778242 exception type=alloc severity=0 userid=4294967295
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 25, 2019

I've added an experimental change for flux-hwloc reload to add a cpuset member to the resource.hwloc.by_rank JSON object, which encodes the actual logical core ids to which the rank has access. This allows sched-simple to work off a real core list instead of just a core count on each rank (important for example when core sibling threads are masked off or we are running more than one broker per physical node)

This does make the object a bit more verbose, especially if running more than one broker rank per node:

$ srun -N4 -n16 src/cmd/flux start flux kvs get resource.hwloc.by_rank | jq -S .
{
  "[0,4,8,12]": {
    "Core": 1,
    "PU": 1,
    "Package": 1,
    "cpuset": "0"
  },
  "[1,5,9,13]": {
    "Core": 1,
    "PU": 1,
    "Package": 1,
    "cpuset": "1"
  },
  "[2,6,10,14]": {
    "Core": 1,
    "PU": 1,
    "Package": 1,
    "cpuset": "2"
  },
  "[3,7,11,15]": {
    "Core": 1,
    "PU": 1,
    "Package": 1,
    "cpuset": "3"
  }
}
@grondo grondo force-pushed the grondo:sched-simple branch 2 times, most recently from a0e9ecf to 3b82c0d Feb 26, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2019

I've forced a push with various updates and fixes

  • fixed several quoting issues in t0022-jj-reader.t which was causing false successes
  • The allocator can now distinguish requests that are unsatisfiable vs those that can block waiting for resources. These requests are accepted by submit but the scheduler denies the alloc request (once we have feasibility checks in addition to validation, this behavior could be premoted)
  • A t2300-sched-simple.t sharness test now tests much of the simple scheduler behavior. A rlist-query test utility is built to query the currently available resources.
  • some other whitespace fixes, cleanup, etc.

Most of this can be squashed down once we're close to ready to merge.

@grondo grondo force-pushed the grondo:sched-simple branch from 38b6d14 to 921c89f Feb 26, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Ok, rebased on current master and squashed the intermediate work.

I'm moving this PR out of draft stage as I think it is fairly complete -- open questions or work that may remain includes:

  • For now, should sched-simple be added by default to rc1? I don't see much reason to do that since we can't actually run anything yet, however the job-ingest and job-manager modules are there already, and it is just as easily removed later...
  • There are some idset functions added to rlist.c, I should see if any of these should be promoted into libidset (though that could always be done at a later time).
@grondo grondo marked this pull request as ready for review Feb 27, 2019
@grondo grondo requested review from garlick and SteVwonder Feb 27, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

My vote would be to load sched-dummy in rc1 and unload in rc3. One nice effect is that it will then get a bit of valgrind coverage.

I'm fine with factoring idset functions out of rlist.c into libidset in the future (perhaps open a bug?)

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

One nice effect is that it will then get a bit of valgrind coverage.

Yeah, I had considered that. If so, to get better coverage I may extend the "job" workload to cancel some jobs as well as submit them.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

I may extend the "job" workload to cancel some jobs as well as submit them.

Eh, that will be much easier once flux job wait is implemented. I don't want to propagate kludgy synchronization across more than one test at this time...

@garlick

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

I'm for merging this as soon as you feel you're ready.

If so, to get better coverage I may extend the "job" workload to cancel some jobs as well as submit them.

Up to you but not required as far as I'm concerned.

Copy link
Member

left a comment

Looks great to me, nice work!

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Ok, added sched-simple to default rc scripts.

@codecov-io

This comment has been minimized.

Copy link

commented Feb 27, 2019

Codecov Report

Merging #2038 into master will increase coverage by <.01%.
The diff coverage is 80.59%.

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
+ Coverage   80.39%   80.39%   +<.01%     
==========================================
  Files         187      191       +4     
  Lines       29435    30239     +804     
==========================================
+ Hits        23665    24312     +647     
- Misses       5770     5927     +157
Impacted Files Coverage Δ
src/cmd/builtin/hwloc.c 81.42% <100%> (+0.79%) ⬆️
src/modules/sched-simple/sched.c 72.54% <72.54%> (ø)
src/modules/sched-simple/rlist.c 81.71% <81.71%> (ø)
src/modules/sched-simple/rnode.c 85.71% <85.71%> (ø)
src/modules/sched-simple/libjj.c 86.2% <86.2%> (ø)
src/common/libflux/ev_flux.c 97.56% <0%> (-2.44%) ⬇️
src/common/libflux/msg_handler.c 89.06% <0%> (-1.18%) ⬇️
src/connectors/local/local.c 88.57% <0%> (-0.72%) ⬇️
src/common/libflux/reactor.c 90.95% <0%> (-0.23%) ⬇️
src/common/libflux/message.c 81.39% <0%> (-0.13%) ⬇️
... and 7 more
@dongahn

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

load sched-dummy in rc1 and unload in rc3

Sorry I'm coming at this a bit late. But does it mean that flux-sched will have to unload sched-dummy within its rc1?

@grondo grondo force-pushed the grondo:sched-simple branch from 3f3b7cd to 836acc6 Feb 27, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Sorry I'm coming at this a bit late. But does it mean that flux-sched will have to unload sched-dummy within its rc1?

Yes, but flux-sched will have to be adapted to the new job-manager/scheduler interface anyway, and that API is not settled quite yet (thus the need for dummy scheduler for testing). Once that interface is finalized, we'll probably have a better way to load flux-sched's full scheduler by default when present instead of the core-only simple sched.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

I forced another push. I forgot to add FLUX_KVS_WAITCREATE to the kvs_lookup for the resource.hwloc.by_rank key, and several tests were failing due to the resulting race (flux hwloc reload is performed in the background)

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Hit a valgrind failure in one of the builders. Could the job manager have been unloaded before the kvs_commit response was received, allowing the flux_future_t to leak? There is more going on in the valigrind job workload now that a scheduler is loaded, but there is no synchronization to ensure the workload doesn't exit until the system is quiescent... I'll have to add some ad-hoc synchronization after all...

=28801== 
==28801== HEAP SUMMARY:
==28801==     in use at exit: 63,724 bytes in 106 blocks
==28801==   total heap usage: 158,642 allocs, 158,536 frees, 57,110,519 bytes allocated
==28801== 
==28801== 5,243 (128 direct, 5,115 indirect) bytes in 1 blocks are definitely lost in loss record 42 of 45
==28801==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28801==    by 0x4E60A18: flux_future_create (future.c:312)
==28801==    by 0x4E5D6A7: flux_rpc_message_nocopy (rpc.c:227)
==28801==    by 0x4E5DCE9: flux_rpc_vpack (rpc.c:368)
==28801==    by 0x4E5DCE9: flux_rpc_pack (rpc.c:381)
==28801==    by 0x4E643C3: flux_kvs_commit (kvs_commit.c:119)
==28801==    by 0x12F2834F: eventlog_append.isra.1 (alloc.c:131)
==28801==    by 0x12F287B7: alloc_response_cb (alloc.c:289)
==28801==    by 0x4E587C6: call_handler (msg_handler.c:215)
==28801==    by 0x4E58CD5: dispatch_message (msg_handler.c:242)
==28801==    by 0x4E58CD5: handle_cb (msg_handler.c:308)
==28801==    by 0x4E83952: ev_invoke_pending (ev.c:3322)
==28801==    by 0x4E86F48: ev_run (ev.c:3726)
==28801==    by 0x4E57972: flux_reactor_run (reactor.c:126)
==28801==    by 0x12F2666F: mod_main (job-manager.c:225)
==28801==    by 0x11422B: module_thread (module.c:144)
==28801==    by 0x55B16DA: start_thread (pthread_create.c:463)
==28801==    by 0x635888E: clone (clone.S:95)
==28801== 
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

Suggestion: maybe just drop the rc1/rc3 addition (and hence addition to the valgrind workload) until we have #2044 sorted out?

Add a trivial "library" for reading simple (v1) json jobspec.
Reads in a jobspec JSON object and returns the request counts
of nodes and slots as well as the requested slot size.
grondo added 7 commits Feb 23, 2019
Add a jj-reader helper utility under t/sched-simple/jj-reader.c
which can be used to drive unit testing for the simple json
jobspec libjj_get_counts() API call.
Add a very simple `rlist` interface for managing allocation requests
against a list of nodes and cores. Satisfies requests for resources
including an optional number of nodes, a number of "slots" and a
slot size (in number of cores).

Supports selecting slots from the least full nodes first (worst-fit)
or the least available nodes first (best-fit) or strict rank order
(first-fit).
Add a set of unit tests for the rlist and rnode interfaces.
Problem: the resource.hwloc.by_rank key encodes the number of cores
available on each rank, but does not encode the logical core ids, and
not all available core lists will be strictly core0-core(n-1).

Add a new `cpuset` key to the by_rank JSON object which encodes the
list of available cores for the indicated rank, so that this list
may be used by readers of this object.
update the flux-hwloc tests to include the new "cpuset" member
of resource.hwloc.by_rank.
Add a simplistic fcfs node and core allocator.

This allocator initializes from the resource.hwloc.by_rank key
and maintains a list of free and allocated cores across flux-broker
ranks as denoted in the by_rank key.

The allocator uses libjj to support v1 minimal jobspec requests of
the form {nnodes, nslots, slot_size}, and allocates these from an
internal struct rlist (rank/core list).

The fcfs "queue" is implemented by registering with the job-manager
in "single" mode. The job-manager protocol is managed using the
schedutils library provided by flux-core.

Closes #1333
For debug and testing purposes, add an `rlist-query` utlity to that
sends a `sched-simple.status` RPC and formats the returned R JSON
object into a one-line short-form resource list.
@grondo grondo force-pushed the grondo:sched-simple branch from 836acc6 to 05656a2 Feb 27, 2019
grondo and others added 2 commits Feb 25, 2019
Add a small sanity testsuite for the sched-simple module.
Problem: If an alloc request has been sent to, but not received by, a
scheduler when an exception event is processed then it may still send
an alloc response to the job-manager, causing the job to transition
from CLEANUP->RUN.

Avoid the invalid state transition by checking the job state in
alloc_response_cb(), and only transition to RUN if the current state
is SCHED. For jobs in CLEANUP state, a "free" request is issued
immediately.

Fixes #2051
@grondo grondo force-pushed the grondo:sched-simple branch from 05656a2 to 3e02c26 Feb 27, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Suggestion: maybe just drop the rc1/rc3 addition (and hence addition to the valgrind workload) until we have #2044 sorted out?

Ok, done and forced a push.

I also improved the tests in t2300-sched-simple.t to add a timeout to the job_wait_event() shell function.

Finally, added your fix for #2051

Problem: upgrading pylint with every travis-ci run may result in
expected errors unrelated to the PR or master being tested.

Instead of running `pip install --upgrade pylint`, pin the pylint
version to the last-known working version: 2.2.2.

Partial fix for #2052.
@garlick garlick merged commit fece4aa into flux-framework:master Feb 27, 2019
4 checks passed
4 checks passed
Mergify — Summary 1 potential rule
Details
codecov/patch 80.54% of diff hit (target 80.4%)
Details
codecov/project 80.41% (+0.01%) compared to 66b980e
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2019

Thanks!

@grondo grondo deleted the grondo:sched-simple branch Feb 27, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

Sorry I'm late to the party. This looks great @grondo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.