Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection of small fixes for splash project #1450

Merged
merged 31 commits into from Apr 13, 2018

Conversation

Projects
None yet
5 participants
@grondo
Copy link
Contributor

grondo commented Apr 12, 2018

Here's a collection of small fixes related to the splash use case for consideration.

This PR includes:

  • General reduction in logging including removal of unnecessary log messages, reduction of some "info" level log messaages to debug, and fix for #1439 to log only from rank 0 instead of all ranks
  • Fix for crash in flux-wreck status when exit_status is not in kvs yet (#1437)
  • Increase NOFILE_LIMIT in wrexecd (#1441)
  • Error checking during task setup in wrexecd (#1441)
  • Add line buffering to output files with flux submit/wreckrun -O file
  • Check for errors from kz stream in output.lua and log errors (#1427)
  • Terminate io "flush" loop early on encountering any error from f:reactor ("once")
  • New command flux-wreck dumplog ID to dump the "logstream" error log for a job
  • Add new KZ_FLAGS_NOFOLLOW flag as suggested by @garlick (#1420)
  • Add -n, --no-follow option to flux-wreck attach to use KZ_FLAGS_NOFOLLOW (#1420)

Might need to add more testing (e.g. of flux wreck attach -n) but since there is lots of little things in here I wanted to get an early review.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 12, 2018

Coverage Status

Coverage decreased (-0.06%) to 79.045% when pulling 333ac81 on grondo:splash-fixes into 37499f8 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2018

Restarted one builder that had stopped due to #1077

scripts/waitfile works after 1s

expecting success: 
	flux start ${ARGS} -s2 --bootstrap=selfpmi bash -c "nohup flux event sub hb &"

lt-flux-broker: module 'job' was not cleanly shutdown
lt-flux-broker: module 'barrier' was not cleanly shutdown
lt-flux-broker: module 'resource-hwloc' was not cleanly shutdown
lt-flux-broker: module 'aggregator' was not cleanly shutdown
lt-flux-broker: module 'kvs' was not cleanly shutdown
flux-start: 0 (pid 11626) Killed
flux-event: flux_event_subscribe: Success
not ok 53 - instance can stop cleanly with subscribers (#1025)
@garlick
Copy link
Member

garlick left a comment

This is really excellent cleanup! It all looks good to me.

@@ -393,6 +393,15 @@ int kz_close (kz_t *kz)
return -1;

This comment has been minimized.

@garlick

garlick Apr 12, 2018

Member

Commit message was truncated - add "watcher." ?

io.stderr:write (r.."\n")
end
}
local iow

This comment has been minimized.

@garlick

garlick Apr 12, 2018

Member

The problem with using the reactor is that log streams aren't properly
terminated with EOF at job end, so reactor based read of the kz
files doesn't know when it is done...

This is fine, but doesn't KZ_FLAGS_NOFOLLOW allow this to work with the reactor?

This comment has been minimized.

@grondo

grondo Apr 12, 2018

Author Contributor

Yeah, this commit was done before the KZ_FLAGS_NOFOLLOW work, so I should probably take a second look. I'd like to try and see if the reactor would indeed exit normally with a set of kz streams open with KZ_FLAGS_NOFOLLOW. The way it is done now might be more predictable, but at the very least the commit message is misleading, sorry.

@grondo grondo force-pushed the grondo:splash-fixes branch 3 times, most recently from 6d2a7c8 to 334eac0 Apr 12, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 12, 2018

Ok, I switched the logstream "oneshot" to use KZ_FLAGS_NOFOLLOW instead of the one-off use of kz_open() for that case.

I also added @trws's fix for -t, --tasks-per-node along with a new test for that functionality.

Finally, a fix for #1453, along with a test.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2018

Nice - will merge once travis finishes.

grondo added some commits Apr 9, 2018

wreck: reduce logging from job module
Reduce the amount of redundant logging from the job module. For errors
that will be the same for every rank (e.g. missing or unparseable R_lite),
log only from rank 0. Change the log level for "no rank.N dir for this
rank" to debug, since it is an expected condition for a job that does
not target rank N.

This should greatly reduce amount of chatter from the job module to the
flux log.

Fixes #1439
aggregator: demote log messages to LOG_DEBUG
The aggregator module is fairly chatty at LOG_INFO level with info
that is only useful for debugging. Demote these message to LOG_DEBUG.
wreck: fix crash in flux-wreck ls when exit_status missing
Ensure `exit_status` key for complete job is valid before trying to
use it to form the exit state string for `flux-wreck ls`.

Fixes #1437
wreck: wrexecd: check for errors in task setup
Problem: wrexecd does not check for errors from zio creation
functions, and when these functions fail the daemon terminates
with segv or other fatal error without logging the error.

Add proper error checks to all zio creation functions.
wreck: increase RLIMIT_NOFILE to max in wrexecd
Bump the max number of open file descriptors to the hard limit in
wrexecd. This avoids runtime errors when running many tasks per
rank with low file descriptor limit.

Of course, there is no way to completely avoid running out of fds
since at least 2 pipes need to be opened for every task that will be
launched. However, in the time frame where we are still using the
wreck prototype, this change should avoid many preventable errors.

Fixes #1441
wreck: wrexecd: don't log every time PMI FD is closed
Problem: wrexecd issues the somewhat useless message about
"client close PMI_FD" for every task in a job, and even though it is
at LOG_DEBUG level, this message clutters the logfile.

Stop printing a debug message for this event.
kz: remove debug message
Remove per-kz-file debug message in lookup_next

 lookup_next: watch lwj.x.y.z.0.stdout...

for every file in a job. Reduces the size of debug logs. If debug
is needed in the future, a debug flag could be added to libkz.
wreck/io.lua: set line buffering on output files
It is expected that output files are written line-buffered, but
the default for non-tty files in stdio is fully buffered. Since this
might confuse users (and developers!), automatically set output file
to line bufferend in the ioplex code.
wreck/io.lua: capture and log errors in iowatcher callback
Capture the error string parameter if passed to the iowatcher callback
in the wreck/io.lua implementation. If an error is passed to the handler,
then log it and disable that iowatcher (i.e. consider it closed).

The error is logged to stderr unless a new "log_err" function is passed
through the constructor, in which case errors are logged via this
function.

Fixes #1427
wreck: pass error logging function in output.lua
When creating the io multiplexer for output handling for
wreck jobs, pass in an error logging function so that any
io errors are logged to the job error log.
wreck: terminate io flush loop early on error
Check for errors from f:reactor("once") during rexecd_exit()
handler for the output plugin, and exit the loop on any error
returned via the reactor.

This should help avoid infinite looping here.
wreck/lua: logstream: fix bad error message when job doesn't exist
Fix an invalid error message when a job id doesn't exist in the kvs.
Existence of arg.nnodes was checked before the error message from
grabbing info about the job from the kvs, so the error was incorrect
in that case.
wreck/lua: unify logstream line format
Add a single function logstream_print_line() to handle printing
lines in the job log stream, instead of using a different format
for normal output vs log:dump().

@grondo grondo force-pushed the grondo:splash-fixes branch from a4c1c93 to 7138b4a Apr 12, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 12, 2018

Broke the pmi tests with the -t, --tasks-per-node fix. Now that wreck.tasks_per_node was always initialized that broke the "fake" R_lite generation when wreckrun is run without a scheduler (however, only with specific combinations of arguments).

Added a fix and rebased on current master, we'll see if this passes Travis.

@trws

This comment has been minimized.

Copy link
Member

trws commented Apr 12, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 12, 2018

Codecov Report

Merging #1450 into master will decrease coverage by 0.05%.
The diff coverage is 80.32%.

@@            Coverage Diff             @@
##           master    #1450      +/-   ##
==========================================
- Coverage    78.8%   78.74%   -0.06%     
==========================================
  Files         163      163              
  Lines       30220    30256      +36     
==========================================
+ Hits        23814    23825      +11     
- Misses       6406     6431      +25
Impacted Files Coverage Δ
src/modules/aggregator/aggregator.c 79.23% <100%> (ø) ⬆️
src/common/libkz/kz.c 83.26% <100%> (+0.71%) ⬆️
src/modules/wreck/wrexecd.c 75.33% <56.25%> (-0.38%) ⬇️
src/modules/wreck/job.c 72.47% <66.66%> (ø) ⬆️
src/bindings/lua/flux-lua.c 81.58% <82.35%> (+0.51%) ⬆️
src/modules/connector-local/local.c 72.95% <0%> (-2.87%) ⬇️
src/broker/module.c 83.79% <0%> (-1.4%) ⬇️
src/broker/content-cache.c 73.43% <0%> (-1.3%) ⬇️
src/common/libflux/rpc.c 93.38% <0%> (-0.83%) ⬇️
src/common/libflux/handle.c 83.66% <0%> (-0.25%) ⬇️
... and 4 more
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 12, 2018

Oops forgot to add a test for the KZ_FLAGS_NOFOLLOW usage in attach/dumplog

grondo added some commits Apr 11, 2018

flux-wreck: add "dumplog" subcommand
Add a flux-wreck dumplog command to dump the job error log from
a given jobid. Uses the oneshot logstream mode to dump whatever is
present and exit.
flux-wreck: also open logstream in flux-wreck attach
In flux-wreck attach, also open and read data from the job error
log stream, which otherwise might go ignored.
kz: add KZ_FLAGS_NOFOLLOW
Add a flag KZ_FLAGS_NOFOLLOW which will cause a kz reader to return
EOF once all known blocks have been read. This can be used to read
all existing blocks for a kz stream without need to install a KVS
watcher.
wreck/lua: add a 'oneshot' flag to logstream
If arg.oneshot is set on logstream then pass KZ_FLAGS_NOFOLLOW
to the underlying kz objects so that logstream kz objects never
sleep in the reactor.

grondo and others added some commits Apr 12, 2018

bindings/lua: accept kz_flags on iowatcher
For kz based iowatcher, accept an optional field "kz_flags", which is
a table of named flags to pass to the resulting kz_open(). For now
only "nofollow" is supported.
wreck: io.lua: support nofollow argument for ioplex
Add support for a "nofollow" argument to ioplex.create, which will
pass the "nofollow" flag to the created iowatcher, resulting in
KZ_FLAGS_NOFOLLOW being set on the kz reader.
wreck: add -n, --no-follow option to flux-wreck attach
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes #1420
kz: support KZ_FLAGS_NOFOLLOW in blocking mode
KZ_FLAGS_NOFOLLOW wasn't active in blocking mode, since lookup_next
is never called when user calls kz_get() directly. However, NOFOLLOW
can still be supported by checking for EAGAIN from getnext(), so add
support for this uncommon case.
t/kz/kzcopy: add support for KZ_FLAGS_NOFOLLOW
Add support in the kzcopy kz test utility for reading kz streams
with the KZ_FLAGS_NOFOLLOW flag. Additionally, when writing a kz
stream and --no-follow flag is used, avoid writing an EOF to the
stream for testing purposes.
t/t1104-kz: Add tests for KZ_FLAGS_NOFOLLOW
Add tests for KZ_FLAGS_NOFOLLOW in both blocking and non-blocking
read mode.
t/t2000-wreck: change 4 space indent to tabs
In the flux-wreck include,exclude tests, switch 4 space indent
to tabs to stop the shell complaining about no EOF in here document.
t/t2000-wreck.t: fix chain-lint errors in tests
Add '&&' to suppress errors from chain-lint in the include,exclude
tests in t2000-wreck.t.
wreck: fix -t, --tasks-per-node option to submit/wreckrun
While fixing something else, support for the `-t, --tasks-per-node` option
was apparently broken.

Fix.
t/t2000-wreck.t: test --tasks-per-node=2
Add a check for --tasks-per-node=2 to *actually* check that this
options is properly propagated by wreckrun (and therefore submit)
wreck: ignore missing jobs in flux-wreck ls
Do not treat missing job entries as fatal error in flux-wreck ls.

Fixes #1453
flux-wreckrun: fix crash with -N
After fix for -t, --tasks-per-node, flux-wreckrun without scheduler
started crashing when creating the fake_resource_array because
tasks_per_node was *always* initialized to 1.

Since wreck.ntasks is initialized by tasks-per-node already, just
set ppn to `wreck.ntasks/nnodes` instead of using tasks-per-node
when set.
t/t2000-wreck: test for flux-wreck ls RANGE with missing jobs
Add a test for `flux-wreck ls RANGE` where at least one job
in RANGE is missing from the kvs.
t/t2000-wreck: add tests for dumplog and attach -n
Add tests for `flux wreck dumplog` and `flux-wreck attach --no-follow`

@grondo grondo force-pushed the grondo:splash-fixes branch from 7138b4a to 333ac81 Apr 12, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 12, 2018

Ok, sorry. Added tests for flux wreck dumplog and flux wreck attach --no-follow, and fixed a small issue in the logstream Lua code.

@garlick garlick merged commit 6293cac into flux-framework:master Apr 13, 2018

4 checks passed

codecov/patch 80.32% of diff hit (target 78.8%)
Details
codecov/project Absolute coverage decreased by -0.05% but relative coverage increased by +1.52% compared to 37499f8
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.06%) to 79.045%
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Apr 13, 2018

Thanks!

@grondo grondo deleted the grondo:splash-fixes branch Apr 26, 2018

@grondo grondo referenced this pull request May 10, 2018

Closed

0.9.0 Release #1479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.