Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement exec protocol in job-manager, add prototype job-exec service #2077

Merged
merged 12 commits into from Mar 21, 2019

Conversation

@grondo
Copy link
Contributor

commented Mar 14, 2019

This is an early draft PR that implements the exec rpc protocol in the job-manager (by @garlick) along with a prototype job-exec module which implements that protocol and simulates execution of jobs by sleeping for the duration specified in the jobspec (or a default duration)

I will let @garlick follow up with a detailed description of the implemented exec protocol.

Other things of note included in this PR:

  • flux jobspec srun now converts any Slurm timeout given into a duration string (floating point seconds with optional s,m,h,d suffix.
  • t5000-valgrind.t now runs with sched-simple and job-exec loaded, and includes synchronization at the end using flux job wait-event
  • I've cleaned up and added the simple sched-bench.sh test into src/test/scripts which can be used to benchmark running a large set of jobs (with or without execution) through flux.
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

Caveats:

  • Current job-exec module is a prototype and will likely need a large refactor when adding support for actually running job shells
  • Still need to add add support for submitting test jobspec that exercises failure modes of job execution to be able to get good coverage, e.g. see #2072
@grondo grondo force-pushed the grondo:job-exec branch 2 times, most recently from c13e64b to 497b877 Mar 14, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

And just for general interest, here's a couple examples of the added sched-bench script:

$ src/cmd/flux start src/test/sched-bench.sh -n 128  -c 32 -j 4096
sched-bench.sh: On branch job-exec: v0.11.0-450-gc13e64b
sched-bench.sh: starting test with 4096 jobs across 128 with 32 cores/node.
sched-bench.sh: broker.pid=28625
sched-bench.sh: ingested 4096 jobs in 16.666s (245.77s job/s)
sched-bench.sh: allocated 4096 jobs in 32.403s (126.41s job/s)
sched-bench.sh: ran 4096 jobs in 32.410s (126.38s job/s)
sched-bench.sh: total walltime for 4096 jobs in 32.696s (125.27s job/s)
$ src/cmd/flux start src/test/sched-bench.sh -n 128  -c 32 -j 4096 --noexec
sched-bench.sh: On branch job-exec: v0.11.0-450-g497b877
sched-bench.sh: starting test with 4096 jobs across 128 with 32 cores/node.
sched-bench.sh: broker.pid=31154
2019-03-14T19:00:33.303403Z job-manager.err[0]: start: service teardown due to Function not implemented
sched-bench.sh: ingested 4096 jobs in 16.127s (253.98s job/s)
sched-bench.sh: allocated 4096 jobs in 17.519s (233.81s job/s)
sched-bench.sh: total walltime for 4096 jobs in 17.797s (230.15s job/s)

(though there is currently an issue with that second mode of the script)

Not exactly sure why adding running jobs halves the throughput (the jobs are only "running" for milliseconds)

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Protocol summary

The exec service loads after job-manager, and dynamically registers its service name. It is possible for another instance of the service to be registered after that one to override it, so long as there are no start requests in flight at that point. (Use case: simulator initial program overrides "normal" exec service.)

STARTUP

Exec service sends job-manager.exec-hello request with its service name, {"service":s}. Job-manager responds with success/failure.

Active jobs are scanned and hello fails if any jobs have outstanding start request.

OPERATION

Job manager makes a <exec_service>.start request once resources are allocated. The request is made without matchtag, so the job id must be present in all response payloads.

A response looks like this: {"id":I "type":s "data":o} where type determines the content of data:

start - indicates job shells have started. data contains: {}

release - release R fragment to job-manager. data contains: {"ranks":s "final":b}

exception - raise an exception (0 is fatal). data contains: {"severity":i "type":s "note"?:s}

finish - job shells have finished. data contains: {} (TBD: should carry global exit status in future)

Responses stream back until a release response is received with "final":true, which means all resources allocated to the job are no longer in use by the exec system. (Currently the job manager ignores partial release and always releases resources together after this message).

TEARDOWN

If an ENOSYS (or other "normal RPC error" response is returned to an alloc request, it is assumed that the current service is unloading or a fatal error has occurred. Start requests are paused waiting for another hello. No attempt is made to restart the interface with a previously overridden exec service.

(This design was discussed in #2040)

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

release - release R fragment to job-manager. data contains: {"idset":s "final":b}

What the current job-exec module uses is {"ranks":s "final":b}, special case of "ranks": "all" releases the entire R. Let me know if that should change.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

Are jobs stuck in RUN state due to unloaded exec service supposed to be re-started at the next job-manager.exec-hello RPC?

Jobs submitted after the reload of the exec system do get a start request though..

grondo@asp:~/git/flux-core.git$ src/cmd/flux start
grondo@asp:~/git/flux-core.git$ flux module remove job-exec
grondo@asp:~/git/flux-core.git$ flux job submit job.json
337104601088
grondo@asp:~/git/flux-core.git$ 2019-03-15T02:17:18.146620Z job-manager.err[0]: start: service teardown due to Function not implemented

grondo@asp:~/git/flux-core.git$ flux job list
JOBID		STATE	USERID	PRI	T_SUBMIT
337104601088	R	1000	16	2019-03-15T02:17:18Z
grondo@asp:~/git/flux-core.git$ flux module load job-exec
grondo@asp:~/git/flux-core.git$ flux job list
JOBID		STATE	USERID	PRI	T_SUBMIT
337104601088	R	1000	16	2019-03-15T02:17:18Z
grondo@asp:~/git/flux-core.git$ flux job submit job.json
3078132596736
grondo@asp:~/git/flux-core.git$ flux job list
JOBID		STATE	USERID	PRI	T_SUBMIT
337104601088	R	1000	16	2019-03-15T02:17:18Z
3078132596736	C	1000	16	2019-03-15T02:20:01Z
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

What the current job-exec module uses is {"ranks":s "final":b}, special case of "ranks": "all" releases the entire R. Let me know if that should change.

Typo in my doc! I'll edit above and submit a patch to the start.c inline docs. The code is fine.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Are jobs stuck in RUN state due to unloaded exec service supposed to be re-started at the next job-manager.exec-hello RPC?

I think I was assuming the exec service would generate responses (exception followed by release) for all jobs with outstanding start requests when it unloads, so none would be left with the start_pending flag set. Possibly that's not adequate though - needs some pondering.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

Note there were no active jobs when the job-exec module was unloaded, the job stuck in RUN is the one that generated ENOSYS. Oh, I think I see the problem here... There is no way to know which job never made it to the exec system

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

There is no way to know which job never made it to the exec system

Yeah, I think maybe the answer is: upon receiving ENOSYS, clear all the start_pending bits on all active jobs, on the assumption that at that point, the exec service has done all the cleanup it intends to do (generating exceptions, releasing resources), and what's left are jobs with start requests that were never received or at least never acted upon.

If you look at interface_teardown() in alloc.c, this is what the job manager already does for the sched interface, more or less.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

I went ahead and implemented that, and I'll push to this branch. As a test, I unloaded the exec module, submitted a job, then reloaded. The job's eventlog says

1552624060.426626 submit userid=5588 priority=16 flags=2
1552624060.440126 debug.alloc-request
1552624060.444883 alloc rank0/core[0-1]
1552624060.444906 debug.start-request
1552624060.445024 debug.start-lost start response error
1552624096.999135 debug.start-request
1552624097.003232 start
1552624097.004260 finish
1552624097.008307 release ranks=all final=1
1552624097.008332 debug.free-request
1552624097.008850 free
@grondo grondo marked this pull request as ready for review Mar 15, 2019
@grondo grondo changed the title implement exec protocol in job-manager, add prototype job-exec service WIP: implement exec protocol in job-manager, add prototype job-exec service Mar 15, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

Removed draft status from this PR (and addd "WIP:" prefix) so we can easily get travis-ci run.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Just added the CLEANUP->INACTIVE state transition. There is a new clean event that is generated once all cleanup activities have completed. This triggers the state transition, and will always be the last event in the job eventlog (potentially useful as an EOF marker).

Once jobs enter INACTIVE, they are removed from the queue but as yet they are not archived in job.inactive.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

Just pushed a rather large change to the job-exec module, including:

  • The guest.exec.eventlog is now created and some dummy events are generated for testing purposes.
  • Ensure that the release final=1 event isn't sent to the job-manager until the move of the guest namespace is complete
  • Ensure a final event is emitted to the exec.eventlog (currently simply done) before the namespace move is initiated.

The module still needs some work, especially around handling exceptional conditions

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

Pushed another incremental step forward -- the job-exec module now handles exceptions due to canceled jobs. Exceptions during initialization should work (including proper cleanup and terminating event in exec.eventlog), but that will require some code to mock those exceptions during testing.

Canceled job after start shows the following in the eventlog: (hm, the double exception is kind of weird, but currently the job-exec module doesn't know the difference between internal vs externally generated exception, so it replies to job manager with an error in both cases. Will try to fix)

1552952239.742362 submit userid=1000 priority=16 flags=0
1552952248.132879 alloc rank0/core1
1552952248.137564 start
1552952255.763984 exception type=cancel severity=0 userid=1000
1552952255.764512 exception type=run severity=0 userid=4294967295 aborted due to exception type=cancel
1552952255.772382 release ranks=all final=1
1552952255.773120 free
1552952255.773131 clean

The exec.eventlog (at this point) looks like:

1552952248.134873 init
1552952248.137225 starting
1552952248.137270 running timer=10.000000s 
1552952255.764255 exception aborted due to exception type=cancel
1552952255.764302 cleanup.start
1552952255.767446 cleanup.finish
1552952255.768513 done
@grondo grondo force-pushed the grondo:job-exec branch from eaa5da9 to f753664 Mar 19, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

Rebased on current master

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

This is coming along nicely!

Maybe we should add some kind of "global exit status" result to the finish response that can be the basis for a flux_job_wait() response, as discussed in #1665. What should that look like?

Think it might be OK to just return success/failure, and assume in the failure case that the flux_job_wait() client implementation can fetch the first fatal exception, or a full event trace from the primary eventlog?

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

Maybe we should add some kind of "global exit status" result to the finish response that can be the basis for a flux_job_wait() response, as discussed in #1665. What should that look like?

I was wondering the same thing. We probably cannot reliably summarize the exit status of all tasks in large jobs in the 256 characters allotted to the event context, so I assume the exit status will be summarized by a single integer? We could do the obvious thing and report the maximum task exit status simply as: {s:i}, "status", max_status.

Think it might be OK to just return success/failure, and assume in the failure case that the flux_job_wait() client implementation can fetch the first fatal exception, or a full event trace from the primary eventlog?

Ah, sorry I misread this statement at first, however I decided to include my above text anyway as another possibility.

I guess I'm still confused if you consider the "global exit status" to include some information about the exit status of the tasks in the job, or if this status would indicate Success even if one or more tasks exited with nonzero exit code, but the job ran to completion without exception?

If we want to summarize the real task exit status, we could require that conforming job shells themselves exit (or otherwise report back) with the maximum exit status of all local tasks. The exec system could then report the highest exit status of all job shells as the "global exit status".

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

with the maximum exit status of all local tasks. The exec system could then report the highest exit status of all job shells as the "global exit status".

From a recent conversation: of course the shells will not be able to exit with WTERMSIG or WCOREDUMP set if any tasks do, so we'd either have to report the exit status through another channel (e.g. a protocol over stdout, custom eventlog entry) or we would have to use special exit codes like 128+signal like the shell.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

Just added a commit that adds

"s:{s:i}", "data", "status", exit_status

To the finish event response from the exec system to the job-manager.

A custom exit status can be configured by adding an exit_status entry to a attributes.system.exec.test object in the jobspec.

@codecov-io

This comment has been minimized.

Copy link

commented Mar 19, 2019

Codecov Report

Merging #2077 into master will decrease coverage by 0.36%.
The diff coverage is 63.66%.

@@            Coverage Diff             @@
##           master    #2077      +/-   ##
==========================================
- Coverage   80.44%   80.07%   -0.37%     
==========================================
  Files         193      196       +3     
  Lines       30540    31233     +693     
==========================================
+ Hits        24569    25011     +442     
- Misses       5971     6222     +251
Impacted Files Coverage Δ
src/modules/job-manager/alloc.c 74.35% <ø> (-0.14%) ⬇️
src/modules/job-manager/event.c 76.57% <100%> (+3.02%) ⬆️
src/bindings/lua/flux-lua.c 87.34% <100%> (-0.05%) ⬇️
src/modules/job-manager/start.c 46.78% <46.78%> (ø)
src/modules/job-manager/job-manager.c 68.88% <60%> (-1.85%) ⬇️
src/modules/job-exec/job-exec.c 62.47% <62.47%> (ø)
src/modules/job-exec/rset.c 90.32% <90.32%> (ø)
src/common/libutil/aux.c 90.74% <0%> (-3.71%) ⬇️
src/common/libflux/mrpc.c 87.74% <0%> (-1.19%) ⬇️
src/modules/connector-local/local.c 73.62% <0%> (-1.04%) ⬇️
... and 6 more
@grondo grondo force-pushed the grondo:job-exec branch from 2f7882f to 94e7631 Mar 19, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

Force pushed with a couple more incremental changes:

  • For clarity, changed use of exit_status to wait_status in both the code and the name of the test key attributes.system.exec.test.wait_status
  • Support for generating internal mock exceptions from job-exec during both initialization and "running" phase via the exec.test object by setting the key mock_exception to either "run" or "init".
  • Fixes along the way for exception handling -- e.g. the exec system no longer generates a second exception when a job is canceled.

This is enough infrastructure I think to get to work on a sharness test for the job-exec module.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

I like the framework for mock exceptions - nice work there!

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

Just pushed a commit with a basic set of sharness tests to drive the current job-exec module.

The attributes.system.exec.test object is used in the tests to throw exceptions during initialization and while the job is running, and ensures that subsequent job events are as expected in the job and exec eventlogs.

Unfortunately, there is not a way to mock an exception in early initialization (since job-exec hasn't read the jobspec yet), and it isn't quite clear what should happen if there is an exception during "cleanup" (run all cleanup tasks including any epilog). There is a test to ensure that a cancel posted after "finish" but before "release" still releases resources and allows the execution cleanup to finish its work. (This test is known to be racy though, and may need to be removed)

The tests do nominally cover most of the current functionality, the only missing piece is a test for alternate execution service "takeover", and that this takeover fails when there are jobs that are still managed by the active exec service. (This will go in the job-manager tests I assume) Once that is done I say we don't spend a lot of time covering all corner cases yet with tests, since things may change significantly once we have a "real" execution service.

@grondo grondo force-pushed the grondo:job-exec branch from 42151c5 to 870b206 Mar 21, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

Minor: hello is misspelled in the commit message for 264f39a

Haha, you mean it isn't the hellow service?

Forced a push with this update, some further squashing, and added missing exec-service.lua to the dist.

I further updated the valgrind test to test a canceled job, plus limited the instance to a single core so that jobs run in order (this can be removed when we have the flux job drain utility) and now wait for clean event instead of free.

I still need to make a pass on the job-exec module and add a comment to the top of the file, meanwhile I'll hand this off to @garlick to squash the fixup: job-manager commits appropriately.

@garlick

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

There are some hello test failures in travis:

ERROR: t2401-job-exec-hello
===========================
flux-broker: module 'sched-simple' was not cleanly shutdown
flux-start: 0 (pid 436) exited with rc=143
ok 1 - exec hello: load sched-simple module
PASS: t2401-job-exec-hello.t 1 - exec hello: load sched-simple module
ERROR: t2401-job-exec-hello.t - missing test plan
ERROR: t2401-job-exec-hello.t - exited with status 143 (terminated by signal 15?)

it looks like it must be getting stuck in the second test: exec hello: start exec server-in-a-script

The job is submitted concurrently with the lua service starting in the background - is there a race there somewhere?

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

The job is submitted concurrently with the lua service starting in the background - is there a race there somewhere?

Yes, probably, sorry had to get the kids off to school, will fix it when I get in.

Oh, the errors you quoted above were because I forgot to dist the exec-server.lua script. That was already fixed. The current error is due to chain-lint. Will quickly push a fix.

@grondo grondo force-pushed the grondo:job-exec branch 2 times, most recently from c3c3102 to a7eb4b1 Mar 21, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

The job is submitted concurrently with the lua service starting in the background - is there a race there somewhere?

There is not a race when the first "service" is launched, because of the nice design of the job-manager it will hold the start request until the script completes the exec-hello protocol.

However, there was a race as you intuited when the 2nd and 3rd services are started, because the job could be submitted before the expected service has taken over. As discussed, to synchronize, the script now puts a key into the kvs after the exec-hello protocol has completed, and the sharness test waits until the key has appeared before submitting subsequent jobs.

garlick added 2 commits Mar 11, 2019
Implement exec service protocol in start.c (see
documentation of protocol at the top) as discussed
in #2040.  The protocol strongly resembles the
sched protocol, eschewing matchtags and futures,
and requiring jobid's to be carried in all responses.

The design allows the exec service to be loaded after
the job-manager, and to be overridden to support
the simulator initial program that provides a simulator-
specific exec service.

The exec service may only be (re-)regstered on an idle
system, that is, one with no outstanding start requests.

Wire the exec service events into the job state machine.

Change job-manager job unit test to insert a severity=0
exception in submit+alloc+free test input, since state
machine now requires free to be issued from CLEANUP state.
Generate a 'clean' event once activities in CLEANUP
have concluded.  The 'clean' then event triggers the
transtion to INACTIVE state.  It can also serve as an
EOF marker in the job eventlog.

Add an action for the INACTIVE state: removal from the queue.
It is left for a future PR to move INACTIVE jobs from the KVS
job.active -> job.inactive directory.

Update job-manager-dummysched sharness test:  if the job is
not in the queue listing, but has the 'clean' event in its
eventlog, report state as I.
@garlick garlick force-pushed the grondo:job-exec branch from a7eb4b1 to 58a64ca Mar 21, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

OK, I squashed some incremental job-manager commits, including the ones you added during the rebase from hell, and force pushed.

@codecov-io

This comment has been minimized.

Copy link

commented Mar 21, 2019

Codecov Report

Merging #2077 into master will decrease coverage by 0.07%.
The diff coverage is 75.03%.

@@            Coverage Diff             @@
##           master    #2077      +/-   ##
==========================================
- Coverage   80.37%   80.29%   -0.08%     
==========================================
  Files         192      195       +3     
  Lines       30563    31294     +731     
==========================================
+ Hits        24564    25129     +565     
- Misses       5999     6165     +166
Impacted Files Coverage Δ
src/modules/job-manager/alloc.c 74.48% <ø> (-0.13%) ⬇️
src/bindings/lua/flux-lua.c 87.34% <100%> (-0.05%) ⬇️
src/modules/job-manager/job-manager.c 68.88% <60%> (-1.85%) ⬇️
src/modules/job-manager/start.c 69.02% <69.02%> (ø)
src/modules/job-exec/job-exec.c 73.85% <73.85%> (ø)
src/modules/job-manager/event.c 73.05% <87.5%> (+2.31%) ⬆️
src/modules/job-exec/rset.c 90.32% <90.32%> (ø)
src/common/libflux/message.c 81.39% <0%> (-0.13%) ⬇️
... and 4 more
grondo added 8 commits Mar 9, 2019
Convert any specified timelimit from Slurm timelimit format to
a "duration" string in floating-point with suffix 's' for seconds,
'm' for minutes, 'h' for hours and 'd' for days.
Add sched-simple and job-exec modules to rc1, rc3 to allow
scheduling and dummy execution of submitted jobs.
Allow multiple response for a single message. There is no
destroy-on-send semantic for Flux anymore.
Now that a scheduler and dummy execution module are loaded
during the valgrind `job` workload, ensure the test waits for
all jobs to run by adding a call to `flux job wait-event` to
the end of the script.

Ensure jobs run in order for now by limiting resources to a
single core.

Add test for cancelation of a job, to ensure no leaks there.

Finally, simplify the script by using `flux jobspec` jobspec
generator tool, instead of hard-coded jobspec string.
Add do-nothing job-exec service that runs through exec system protocol
by sleeping for duration specified in system.attributes.duration
of the jobspec. If a duration is not specified the job "finishes"
almost immediately (10us).

An object in jobspec, attributes.system.exec.test, can be used to
set other test parameters on a per-job basis (see documentation
in the code for more information), including forcing the module
to throw an exception during initialization or run of the job.
Add a set of rudimentary functionality tests for the exec system
in simulated execution mode (i.e. running job shells simulated by
a timer). Uses the mock exception field of exec.test system attributes
in the jobspec to fake exceptions in a couple of spots, ensuring
proper operation.
Add a sharness test exercising the job manager's exec-hello service,
ensuring that:

 - a new exec service can register, overriding the existing service.
 - registration of a new service fails if there are jobs running
 - no jobs run until at least one exec service is registerd.
Add src/test/sched-bench.sh test script that can be used to
exercise/benchmark scheduling of jobs in a flux instance.
@grondo grondo force-pushed the grondo:job-exec branch from 58a64ca to 50a5075 Mar 21, 2019
@grondo grondo changed the title WIP: implement exec protocol in job-manager, add prototype job-exec service implement exec protocol in job-manager, add prototype job-exec service Mar 21, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

Ok, one last force-push from me. I added a comment to the top of job-exec.c to detail its limitations and operation, including a small fix or two.

I also fixed the sched-bench.sh script to wait for the final clean event instead of free.

Removed the "WIP" prefix because I think this is ready for a merge.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

Just as a checkpoint, here's the current benchmark results with simulated execution included:

$ src/cmd/flux start -o,--shutdown-grace=15 src/test/sched-bench.sh -n 128 -c 32 
sched-bench.sh: On branch job-exec: v0.11.0-474-g50a5075
sched-bench.sh: starting with 4096 jobs across 128 nodes with 32 cores/node.
sched-bench.sh: broker.pid=31520
sched-bench.sh: ingested 4096 jobs in 16.733s (244.79 job/s)
sched-bench.sh: allocated 4096 jobs in 41.797s (98.00 job/s)
sched-bench.sh: ran 4096 jobs in 41.816s (97.95 job/s)
sched-bench.sh: total walltime for 4096 jobs in 47.633s (85.99 job/s)

and without exec:

$ src/cmd/flux start -o,--shutdown-grace=15 src/test/sched-bench.sh -n 128 -c 32 --noexec
sched-bench.sh: On branch job-exec: v0.11.0-474-g50a5075
sched-bench.sh: starting with 4096 jobs across 128 nodes with 32 cores/node.
sched-bench.sh: broker.pid=313
sched-bench.sh: ingested 4096 jobs in 16.172s (253.28 job/s)
sched-bench.sh: allocated 4096 jobs in 18.285s (224.01 job/s)
sched-bench.sh: total walltime for 4096 jobs in 18.686s (219.20 job/s)
@codecov-io

This comment has been minimized.

Copy link

commented Mar 21, 2019

Codecov Report

Merging #2077 into master will decrease coverage by 0.09%.
The diff coverage is 75%.

@@            Coverage Diff            @@
##           master    #2077     +/-   ##
=========================================
- Coverage   80.37%   80.27%   -0.1%     
=========================================
  Files         192      195      +3     
  Lines       30563    31297    +734     
=========================================
+ Hits        24564    25125    +561     
- Misses       5999     6172    +173
Impacted Files Coverage Δ
src/modules/job-manager/alloc.c 74.48% <ø> (-0.13%) ⬇️
src/bindings/lua/flux-lua.c 87.34% <100%> (-0.05%) ⬇️
src/modules/job-manager/job-manager.c 68.88% <60%> (-1.85%) ⬇️
src/modules/job-manager/start.c 69.02% <69.02%> (ø)
src/modules/job-exec/job-exec.c 73.81% <73.81%> (ø)
src/modules/job-manager/event.c 73.05% <87.5%> (+2.31%) ⬆️
src/modules/job-exec/rset.c 90.32% <90.32%> (ø)
src/common/libflux/message.c 81.15% <0%> (-0.37%) ⬇️
... and 4 more
@garlick

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

Whee! Nice to get this done!

@garlick garlick merged commit 11d31c1 into flux-framework:master Mar 21, 2019
3 of 4 checks passed
3 of 4 checks passed
codecov/patch 75% of diff hit (target 80.37%)
Details
Mergify — Summary 1 potential rule
Details
codecov/project 80.27% (-0.1%) compared to 909d9cc
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo grondo deleted the grondo:job-exec branch Mar 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.