Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental: add -o nokz option for wreck jobs #1875

Merged
merged 27 commits into from Dec 23, 2018

Conversation

Projects
None yet
4 participants
@grondo
Copy link
Contributor

grondo commented Dec 17, 2018

This PR adds an option -o nokz for wreck jobs which results in the use of direct messages for stdout and stderr instead of using kz streams to store the output in the kvs.

When the "nokz" option is enabled, the io multiplexer will register a service for either stdout or stderr, or both, instead of watching kz streams for each task, and wrexecd will forward message-based i/o to this service, addressed to the service name and rank encoded in a new ioservice kvs key associated with the job.

If --output or --error is also supplied then the corresponding services will be registered by the first wrexecd for the job (i.e. the same output.lua plugin that handles output files now, but instead of watching kz streams, the plugin will handle incoming messages)

Unfortunately, this isn't yet a complete solution, as nokz option for jobs submitted with flux submit or with wreckrun using the -d option will end up dropping all output. For now, flux submit -o nokz is an error without also supplying an output file. We could also make -o nokz for "batch" jobs imply -o job-{{id}}.out or similar. We could also make nokz not applicable to flux submit jobs (ignore the option) and instead default them to stdio-delay-commit.

I also did not make nokz the default yet until above problems are solved. The option can be set globally for an instance with flux wreck setopt nokz. To enable it "by default" then we could add this command to the default rc1.

It may be that the current approach needs a rethink, but I'm not sure what we should do/have time for before the next release. I'm able to work a little bit, but slow typing for now...

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 17, 2018

Codecov Report

Merging #1875 into master will decrease coverage by <.01%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master    #1875      +/-   ##
==========================================
- Coverage   80.13%   80.12%   -0.01%     
==========================================
  Files         196      196              
  Lines       35002    35060      +58     
==========================================
+ Hits        28048    28091      +43     
- Misses       6954     6969      +15
Impacted Files Coverage Δ
src/bindings/lua/flux-lua.c 82.32% <100%> (+0.03%) ⬆️
src/modules/wreck/wrexecd.c 76.74% <90.14%> (+0.63%) ⬆️
src/common/libflux/handle.c 83.99% <0%> (-2.09%) ⬇️
src/broker/modservice.c 78.84% <0%> (-0.97%) ⬇️
src/modules/kvs/kvs.c 66.56% <0%> (-0.15%) ⬇️
src/common/libflux/message.c 81.64% <0%> (+0.24%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

I'll poke at this today.

@trws, if you have a moment, could you comment on the caveats @grondo mentions above, and give us some thoughts on what you would like us to steer towards for the release?

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 18, 2018

Thanks a ton for working on this @grondo, I have some questions but this sounds like exactly what I was hoping for. I hope you’re recovering well by the way, feel better!

If I’m understanding correctly:

  1. Interactive jobs with -o nokz work fine
  2. Jobs submitted with submit or wreckrun -d need somewhere to dump their output (makes sense) so either a file needs to be specified or the output is dropped
  3. -o nokz could reasonably replace the kz-based -o option for file output given a file

If that’s all right, and I’ll do some quick tests to see how things behave with the target workload shortly, I’d request that we make nokz the default, and for a sync jobs output to something like ‘flux-.out/err’ unless the user specifies a file or provides an option to use KZ. That would give us behavior much like slurm provides IIRC. It may also be worth considering an option that uses nokz to gather output, then commits the full output into a kz stream at the end of the job for users that want to be able to use attach to inspect what happened after the fact, but what’s already here sounds like it should completely solve the splash issues. Great stuff!

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 18, 2018

@trws, yes, all correct except:

-o nokz could reasonably replace the kz-based -o option for file output given a file

In the current PR, -o nokz -O outfile works and bypasses kz. (Now that I reread your statement, though, perhaps that is what you meant.

My other worry, of course, is that since this is a big change and will become the default, there may be some race conditions or corner cases that aren't handled.

For example, I can't remember what the output plugin does for output files that already exist. With Slurm there is one global set of monotonically increasing jobids, so slurm-$id.out works well. For Flux, each instance restarts at 1, so there will potentially be a lot of overwritten output files, or errors...

I can look into these things once we figure out what behavior should be.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 18, 2018

BTW, another approach for "submit" jobs and nokz would be to have the output plugin hold output in memory and then dump the output file as one key into the kvs, instead of by default writing to a file. This probably would not be too much work, though a bit of a kludge.

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 18, 2018

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 18, 2018

Wow, the performance difference is just crazy.

/dev/null interactive:

~/flux/flux-core wreck-output-nokz* scogland@gent 10s
� time ./src/cmd/flux start flux wreckrun  seq 1 100000 > /dev/null
sh: ldconfig: command not found
./src/cmd/flux start flux wreckrun seq 1 100000 > /dev/null  306.94s user 11.60s system 149% cpu 3:32.75 total

~/flux/flux-core wreck-output-nokz* scogland@gent 3m 33s
� time ./src/cmd/flux start flux wreckrun -o nokz seq 1 100000 > /dev/null
sh: ldconfig: command not found
./src/cmd/flux start flux wreckrun -o nokz seq 1 100000 > /dev/null  0.81s user 0.39s system 71% cpu 1.669 total

Dropping that by half, using kz uses 12m of database, using nokz leaves it as the default 40k SQLite base database. It’s so incredibly improved I’m kinda floored, it’s even visible to the eye how much faster it is watching attached output.

One thing I noticed, that I don’t think is a problem but is a difference, is that using -O/-E with nokz outputs only to the files, but without nokz it also outputs to the terminal. I was actually surprised by the behavior without the nokz option, but either way.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 18, 2018

One thing I noticed, that I don’t think is a problem but is a difference, is that using -O/-E with nokz outputs only to the files, but without nokz it also outputs to the terminal. I was actually surprised by the behavior without the nokz option, but either way.

Yeah, that is due to the initial design of "wreck" job prototype, for which all job "state" is managed through the kvs, and wreckrun is only passively involved (after the initial job launch). Similarly, the "output" plugin is just passively monitoring output and copying it to where the user requested.

The message based output is a big change from the initial design, but given the way most people run their jobs, hopefully the difference in design vs new implementation won't cause any onerous issues (and given that wreck is going to be replaced soon, we can probably live with any caveats)

@grondo grondo force-pushed the grondo:wreck-output-nokz branch 2 times, most recently from badf1b9 to 63a1483 Dec 19, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 19, 2018

Ok, I've made a small, somewhat hacky, change here to allow -O, --output and -E, --error output file templates to be prefixed with kvs:// to indicate a kvs key as the output "file". The current lwj.x.y.z directory for the job is always prefixed to the key following kvs://.

For a kvs output file, the "file" is only written to the kvs after it is closed, so there is only a single kvs_put/commit for each kvs output file. Thus, all data is held in memory until the output is closed.

flux-submit then defaults to -O kvs://output-file when -o nokz is used with -O, --output, so that data is no longer dropped (in theory)

No tests have been added yet.

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 20, 2018

That sounds really good @grondo. I like the solution, should make it pretty easy to get after-the-fact attach working for these which would be cool. Thanks for putting this together so fast!

By the way, I mentioned this to @garlick in person, but found another cool feature of this while doing some testing yesterday. It turns out it pretty significantly improves the scalability of launching multiple tasks per broker (not completely sure why, some serialization in watcher setup in the kz version maybe?):

Some numbers for 5 brokers, all in seconds to run that many instances of /bin/true, so this is with no output (well, "\nEOF" I suppose):

ranks/broker regular nokz
40 2.81 0.91
80 8.28 1.79
160 30.63 3.84

@grondo grondo force-pushed the grondo:wreck-output-nokz branch from 63a1483 to 65892a8 Dec 21, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

Ok, I added some sanity checks for --output/--error=kvs://foo and also updated the flux-wreckrun and flux-submit manpages.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 21, 2018

Thanks @grondo! What's left to do here prior to merging?

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

@grondo grondo force-pushed the grondo:wreck-output-nokz branch 2 times, most recently from f4ee718 to 40f7c75 Dec 21, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

Just rebased on current master, and made a change so that flux-submit writes stdout/err to files.stdout and files.stderr in the kvs instead of a single output-file by default for nokz jobs.

Also made a small change to flux wreck attach so that it dumps stdout and stderr output from nokz jobs if they are found in the default files.stdout and files.stderr keys.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 21, 2018

I have this branch (including your most recent push) and flux-sched master installed on a test system. When I start an instance (sched loaded) and run:

$ flux submit -onokz hostname
submit: Submitted jobid 1
$ flux wreck attach 1
$

I don't see any output. KVS contains:

$ flux kvs dir -R
lwj.0.0.1.0.stdin -> lwj.0.0.1.input.files.stdin
lwj.0.0.1.R_lite = [{"children": {"core": "0"}, "node": "jimbo", "rank": 0}]
lwj.0.0.1.cmdline = ["hostname"]
lwj.0.0.1.complete-timelux wreck attach 1 = 1545419358.548033
lwj.0.0.1.completing-time = 1545419358.545980
lwj.0.0.1.create-time = 1545419358.5No output files02223
lwj.0.0.1.cwd = /home/garlick/grondo_tmp
lwj.0.0.1.environ = {"CCACHE_CPP2": "1", "DBUS_SESSION_BUS_ADDRESS": "unix:pa...
lwj.0.0.1.exit_status = {"count": 1, "entries": {"0": 0}, "max": 0, "min": 0,...
lwj.0.0.1.input.config = [{"dst": "*", "src": "stdin"}]
lwj.0.0.1.ioservice = {"stderr": {"name": "5cb4007c27e06fa3:err", "rank": 0},...
lwj.0.0.1.log.0.
lwj.0.0.1.ncores = 1
lwj.0.0.1.ngpus = 0
lwj.0.0.1.nnodes = 0
lwj.0.0.1.ntasks = 1
lwj.0.0.1.options = {"nokz": 1}
lwj.0.0.1.opts.tasks-per-node = 1
lwj.0.0.1.pmi.PMI_process_mapping = (vector,(0,1,1))
lwj.0.0.1.running-time = 1545419358.540197
lwj.0.0.1.starting-time = 1545419358.521515
lwj.0.0.1.state = complete
lwj.0.0.1.walltime = 0
lwj-complete.4.1 -> lwj.0.0.1
resource.hosts = jimbo
resource.hwloc.by_rank.0.Core = 12
resource.hwloc.by_rank.0.HostName = jimbo
resource.hwloc.by_rank.0.Machine = 1
resource.hwloc.by_rank.0.PU = 12
resource.hwloc.by_rank.0.Package = 2
resource.hwloc.loaded.0 = 1
resource.hwloc.xml.0 = <?xml version="1.0" encoding="UTF-8"?>...
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

If you do have the latest, I'll take a look at lunch. Sorry!

Oh, oops, I know what is wrong. I'll push a fix and hopefully somehow add a test.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 21, 2018

[Edit: I tried again and submit didn't produce the output. Now I'm thinking maybe I omitted -o nokz that time. Sorry!]

So I just rebuilt the same code (I think - it's v0.10.0-783-g40f7c756e) and submit produced:

lwj.0.0.2.0.stderr.000000 = {"data": "", "eof": true}
lwj.0.0.2.0.stdin -> lwj.0.0.2.input.files.stdin
lwj.0.0.2.0.stdout.000000 = {"data": "amltYm8K", "eof": false}
lwj.0.0.2.0.stdout.000001 = {"data": "", "eof": true}

and flux wreck attach worked.

The same run of hostname with flux wreckrun did not put output in the KVS.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 21, 2018

@grondo just FYI I think our tag is likely going to be in the latter part of next week, so please don't feel obliged to fix this now.

grondo added some commits Dec 3, 2018

flux-wreckrun: abstract stdio attach into function
Move code to open output from all tasks and start stdin writer
to a new function in wreckrun: wreck_attach_stdio() for slightly
better code clarity.
wreck: io.lua: abstract output handler function
Move the output handler function to a locally defined function
instead of an anonymous function defined within the msghandler call.
This enhances readability and reusability.
wreck: clean up init of kz streams in wrexecd
Clean up initialization of kz streams for wreck jobs in wrexeced.

Rename io_cb() to io_cb_kz() to indicate this is a kz-only method.

Separate the initialization of input and output kz streams
instead of bundling them together, and set IO callbacks for zio
output streams only after destination kz streams have been
"openend" for clearer program flow.
wreck: wrexecd: abstract integer kvs lookup
Move kvs lookup of integer into a function for future reuse.
wreck: wrexecd: save rcalc_t after parsing R_lite
Save the generated rcalc_t object after parsing R_lite from kvs
for future use during wrexecd operation, instead of using the
object only temporarily in prog_ctx_read_R_lite().
bindings/lua: export FLUX_NODEID_* constants to Lua
Add FLUX_NODEID_ANY and FLUX_NODEID_UPSTREAM to constants exported
via the Lua "flux" module, available as flux.NODEID_ANY and
flux.NODEID_UPSTREAM.

grondo added some commits Dec 6, 2018

wreck: wreck.lua: Add support for nokz job option
Add support for `-o nokz` option for wreck jobs, which will use
message based service for stdout and stderr instead of kz streams
in the KVS.

The service name is randomly generated, so that the service may be
started before a jobid is allocated.
wreckrun: pass nokz option to wreck.ioattach
Pass the "nokz" option and generated `wreck.ioservices` table to
wreck.ioattach() so that message-based stdout/err are utilized
when requested.
wreck: support "nokz" option in output.lua plugin
Support the nokz option in the wrexecd output.lua plugin. This
allows file output to use the message based service instead of
funnelling through kz streams in the KVS.
flux-submit: store output for nokz in kvs by default
Problem: jobs submitted with -o nokz and no output redirection by
default (i.e. --output) will drop all output.

If a job is submitted with flux-submit using -o nokz, but no -O
option is provided, default to `-O kvs://files.stdout` and
`-E kvs://files.stderr` to dump the final output from all tasks
into kvs key at the end of the job.
wreck: io.lua: allow output redirect to kvs
Allow output redirection to kvs key instead of output file or stream.

A kvs destination may be specified as an output destination by prefixing
a path with "kvs://" followed by the destination key. The kvs_put will
occur only after all data has been written and the desination stream
"closed". (This means all data held in memory until the output "file"
is closed)
wreck: wrexecd: prepend kvs:// output keys with lwj dir
Ensure kvs keys used as output for jobs are relative to the current
lwj directory in the kvs.
doc: use include files for wreckrun/submit options
Move the common options to flux-wreckrun and flux-submit into
adoc files included by both source manpages instead of trying to
keep them both in sync.
doc: document 'nokz' option for submit and wreckrun
Add a short entry for the new 'nokz' option to flux-submit and
flux-wreckrun.
t2000-wreck-nokz: add test for kvs://key output/error
Add tests that sanity check use of --output and --error with
kvs:// prefix to push output into kvs at end of job.
test: doc: update dictionary
Add "kz" to spelling dictionary for docs.
flux-wreck: support nokz jobs with attach
Problem: jobs using the 'nokz' option will cause `flux wreck attach`
to hang since attach is trying to read kz streams that will never be
created in the kvs.

Check for the `nokz` option during attach, and skip the taskio attach
if this option is true, instead attempting to dump kvs keys `files.stdout`
and `files.stderr` if they exist (this is the default used by flux-submit)
wreck: wreck.lua: abstract output table generation
Abstract the generation of the output file table from the --output
and --error so it can be used from outside of parse_cmdline()
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

Yeah, I made a dumb mistake and already have a "fix" (all of this is a bit kludgy).

I'll test with flux-sched first though...

@grondo grondo force-pushed the grondo:wreck-output-nokz branch from 40f7c75 to 4018441 Dec 21, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

Ok, fixed flux-submit with -o nokz. Attach will now nominally work in this scenario:

grondo@peep:/tmp/f.git$ flux submit -o nokz -n4 hostname 
submit: Submitted jobid 2
grondo@peep:/tmp/f.git$ flux wreck attach 2
peep
peep
peep
peep
grondo@peep:/tmp/f.git$ flux kvs dir -R $(flux wreck last-jobid -p)
lwj.0.0.2.0.stdin -> lwj.0.0.2.input.files.stdin
lwj.0.0.2.1.stdin -> lwj.0.0.2.input.files.stdin
lwj.0.0.2.2.stdin -> lwj.0.0.2.input.files.stdin
lwj.0.0.2.3.stdin -> lwj.0.0.2.input.files.stdin
lwj.0.0.2.R_lite = [{"children": {"core": "0-3"}, "node": "peep", "rank": 0}]
lwj.0.0.2.cmdline = ["hostname"]
lwj.0.0.2.complete-time = 1545432170.124154
lwj.0.0.2.completing-time = 1545432170.121751
lwj.0.0.2.create-time = 1545432170.028134
lwj.0.0.2.cwd = /tmp/f.git
lwj.0.0.2.environ = {"BIN_SH": "xpg4", "DBUS_SESSION_BUS_ADDRESS": "unix:path...
lwj.0.0.2.exit_status = {"count": 4, "entries": {"[0-3]": 0}, "max": 0, "min"...
lwj.0.0.2.files.stderr = 
lwj.0.0.2.files.stdout = peep...
lwj.0.0.2.input.config = [{"dst": "*", "src": "stdin"}]
lwj.0.0.2.ioservice = {"stderr": {"name": "6f38200793fcd8fa:err", "rank": 429...
lwj.0.0.2.log.0.
lwj.0.0.2.ncores = 4
lwj.0.0.2.ngpus = 0
lwj.0.0.2.nnodes = 0
lwj.0.0.2.ntasks = 4
lwj.0.0.2.options = {"nokz": 1}
lwj.0.0.2.opts.ntasks = 4
lwj.0.0.2.opts.tasks-per-node = 1
lwj.0.0.2.output = {"files": {"stderr": "kvs://files.stderr", "stdout": "kvs:...
lwj.0.0.2.pmi.PMI_process_mapping = (vector,(0,1,4))
lwj.0.0.2.running-time = 1545432170.111131
lwj.0.0.2.starting-time = 1545432170.080101
lwj.0.0.2.state = complete
lwj.0.0.2.walltime = 0
@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 21, 2018

LMK when you feel OK merging. Maybe we can get this in now with the understanding that we will revisit before the release for discussion about the default I/O mode? Then at least you won't need to keep rebasing this as we merge the remaining PR's.

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 21, 2018

How does it wireup in the attach case? Is it pulling the result at the end, and always get the full output of the job, or something like tail -f if the job isn't done yet or..?

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

How does it wireup in the attach case? Is it pulling the result at the end, and always get the full output of the job, or something like tail -f if the job isn't done yet or..?

You can't have multiple readers in the nokz case, so it doesn't wireup per se. It just pulls the result at the end (which I had thought was the main use case). I didn't test nokz/attach for a running job, it might not work as expected...

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 21, 2018

Yeah, attaching to running 'nokz' just exits immediately with no output since the files.{stderr,stdout} keys do not exist. To fix I'll have to augment the attach function to wait on the state key. I can work on that a bit later.

@trws

This comment has been minimized.

Copy link
Member

trws commented Dec 21, 2018

I wouldn't let it hold up this PR. We'll want some way to watch intermediate values at some point, but for our current users the majority of use by far is after the fact, so that sounds great. Thanks again @grondo!

flux-wreck: better handling for attach with nokz jobs
Wait for jobs with "nokz" option set to complete before attempting to
dump files.stderr and files.stdout during `flux wreck attach`.
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 22, 2018

I pushed a fix for flux wreck attach which delays exit for nokz jobs until the job state is complete, then dumps files.stderr and files.stdout. Still not perfect, but better than completely broken...

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 23, 2018

Just sanity checked this with sched and seems to work as advertised. Nice!

Any further work pending in this PR @grondo, or is it OK to merge now?

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 23, 2018

Yeah, we can merge as is and figure out default behavior later

@garlick garlick merged commit aa972be into flux-framework:master Dec 23, 2018

3 checks passed

codecov/patch 90.9% of diff hit (target 80.13%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +10.77% compared to 56825e3
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Dec 24, 2018

Thanks!

@grondo grondo deleted the grondo:wreck-output-nokz branch Feb 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.