Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-shell: add output handling #2246

Merged
merged 18 commits into from Jul 22, 2019

Conversation

@garlick
Copy link
Member

commented Jul 19, 2019

This PR adds the most basic output handling to the shell as proposed in #2190.

Task I/O is sent to a service on the leader shell (shell rank 0). Each line of I/O is represented as a JSON object. The lines are accumulated in a JSON array which is committed to the job's kvs guest namespace upon completion.
.
Flux attach then outputs the lines on its stdout and stderr.

Thus

$ flux srun -N4 -n8 printenv FLUX_TASK_RANK
0
1
2
4
5
3
6
7

or

flux job attach --label $(flux jobspec srun -n4 hostname | flux job submit)
2: morbo
1: morbo
0: morbo
3: morbo

A reusable I/O forwarding service library has been proposed in #2208. This PRwas worked on in parallel with that, so I will look at converting the quick and dirty "service" in here over to that (this one, for example, doesn't base64-encode output data, and doesn't provide a way to read the data back remotely).

@garlick garlick force-pushed the garlick:shell_stdio2 branch from 211b3b7 to 8509eeb Jul 19, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

Hmm, I did have one spurious failure in t2602-job-shell.t where it looks like the task got a SIGABRT (status 134 == 128 + 6), though I can't find evidence of an assertion anywhere. Restarting.

expecting success: 
        id=$(flux jobspec srun -N4 -n8 bash -c \
            "flux kvs put test2.\$FLUX_TASK_RANK=\$FLUX_TASK_LOCAL_ID" \
            | flux job submit) &&
        flux job attach --show-events $id &&
        kvsdir=$(flux job id --to=kvs $id) &&
	sort >test2.exp <<-EOT &&
	${kvsdir}.guest.test2.0 = 0
	${kvsdir}.guest.test2.1 = 1
	${kvsdir}.guest.test2.2 = 0
	${kvsdir}.guest.test2.3 = 1
	${kvsdir}.guest.test2.4 = 0
	${kvsdir}.guest.test2.5 = 1
	${kvsdir}.guest.test2.6 = 0
	${kvsdir}.guest.test2.7 = 1
	EOT
	flux kvs dir ${kvsdir}.guest.test2 | sort >test2.out &&
	test_cmp test2.exp test2.out
job-event: submit {"userid":2000,"priority":16,"flags":0}
job-event: depend
job-event: alloc {"note":"rank[0-3]/core[0-1]"}
job-event: start
job-event: finish {"status":134}
job-event: release {"ranks":"all","final":true}
job-event: free
job-event: clean
not ok 3 - job-shell: execute 2 tasks per rank
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2019

Pushed some fixups to

  • generalize shell_svc_*() to allow any shell rank to register a service
  • exchange some asprintf()s for snprintf()s to reduce error path complexity (hopefully increasing coverage)
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

This looks great! I will try to find time to do a review this weekend.

It would actually be nice to get this in before #2244 because there are some tests that check stdout/err of tasks in that PR. It would be nice not to use flux-dmesg for that..

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2019

Thanks! I can squash the fixups now if you prefer.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2019

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 21, 2019

I added some tests to exercise output capture a bit more and began seeing a frequent (but not always) failure of t2602-job-shell.t where the shell hits an assertion in the malloc code (corrupted doubly linked list). Here's a stack trace, although I think this one is likely just a free() call that trips on the previously corrupted malloc pool:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f5e4459d801 in __GI_abort () at abort.c:79
#2  0x00007f5e445e6897 in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7f5e44713b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f5e445ed90a in malloc_printerr (
    str=str@entry=0x7f5e44711cba "corrupted double-linked list")
    at malloc.c:5350
#4  0x00007f5e445f513a in _int_free (have_lock=0, p=<optimized out>, 
    av=0x7f5e44948c40 <main_arena>) at malloc.c:4295
#5  __GI___libc_free (mem=<optimized out>) at malloc.c:3124
#6  0x00007f5e45038260 in zhashx_destroy ()
   from /usr/lib/x86_64-linux-gnu/libczmq.so.4
#7  0x00007f5e454a92cc in dispatch_usecount_decr (d=0x55d9d7ec6e80)
    at msg_handler.c:92
#8  0x00007f5e454c5c6b in aux_item_destroy (aux=0x55d9d7ec6780) at aux.c:35
#9  0x00007f5e454c5e74 in aux_destroy (head=0x55d9d7ebb648) at aux.c:157
#10 0x00007f5e454a67e6 in flux_handle_destroy (h=0x55d9d7ebb640)
    at handle.c:377
#11 0x00007f5e454a6909 in flux_close (h=0x55d9d7ebb640) at handle.c:315
#12 0x000055d9d68efb4e in shell_finalize (shell=0x7ffea64afd20) at shell.c:167
#13 main (argc=<optimized out>, argv=<optimized out>) at shell.c:348

Also: flux attach doesn't report anything on its stderr when this happens; it just exits with a nozero code. It definitely seems like it should be printing some error in that case, although I'm not sure it should always print something when it gets back a nonzero finish status (flux srun /bin/false seems like it should be silent...?)

I'm going to push my test which will likely make travis sad, and probably work on it tomorrow.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2019

Here's a stack trace, although I think this one is likely just a free() call that trips on the previously corrupted malloc pool:

Sorry if I introduced this in the last PR!

I've been able to run job-shell under valgrind with a wrapper script, e.g. (at the top-level of my flux-core working copy)

$ cat valgrind-shell.sh 
#!/bin/sh
libtool e valgrind $(dirname $0)/src/shell/flux-shell "$@"
$ flux setattr job-exec.job-shell=$(pwd)/valgrind-shell.sh
$ flux srun hostname
# Because valgrind's output is on shell's stderr, we have to get it from dmesg:
$ flux dmesg | sed -n 's/.*STDERR://p'
 valgrind: /home/grondo/git/f.git/src/flux-shell: No such file or directory
 ==26092== Memcheck, a memory error detector
 ==26092== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
 ==26092== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
 ==26092== Command: /home/grondo/git/f.git/src/shell/.libs/flux-shell 1080687591424
 ==26092== 
 ==26158== Warning: invalid file descriptor 1024 in syscall close()
 ==26158== Warning: invalid file descriptor 1025 in syscall close()
 ==26158== Warning: invalid file descriptor 1026 in syscall close()
 ==26158== Warning: invalid file descriptor 1027 in syscall close()
 ==26158==    Use --log-fd=<number> to select an alternative log fd.
 ==26158== Warning: invalid file descriptor 1028 in syscall close()
 ==26158== Warning: invalid file descriptor 1029 in syscall close()
 ==26092== 
 ==26092== HEAP SUMMARY:
 ==26092==     in use at exit: 4,640 bytes in 4 blocks
 ==26092==   total heap usage: 1,185 allocs, 1,181 frees, 11,208,571 bytes allocated
 ==26092== 
 ==26092== LEAK SUMMARY:
 ==26092==    definitely lost: 0 bytes in 0 blocks
 ==26092==    indirectly lost: 0 bytes in 0 blocks
 ==26092==      possibly lost: 512 bytes in 1 blocks
 ==26092==    still reachable: 4,128 bytes in 3 blocks
 ==26092==         suppressed: 0 bytes in 0 blocks
 ==26092== Rerun with --leak-check=full to see details of leaked memory
 ==26092== 
 ==26092== For counts of detected and suppressed errors, rerun with: -v
 ==26092== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2019

flux attach doesn't report anything on its stderr when this happens; it just exits with a nozero code. It definitely seems like it should be printing some error in that case, although I'm not sure it should always print something when it gets back a nonzero finish status (flux srun /bin/false seems like it should be silent...?)

For flux srun we should probably just copy srun, which I think summarizes non-zero task exit by default, e.g.

$ srun -N2 -n8 /bin/false
srun: error: fluke43: tasks 0-3: Exited with exit code 1
srun: error: fluke44: tasks 4-7: Exited with exit code 1

(We'll need the shell to aggregate individual task exit status before we can give this detail though. New issue?)

For our other run interfaces I agree it might not make sense to be verbose about non-zero exit codes by default, though I imagine users find the error messages useful in batch script output.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 21, 2019

Thanks! I was trying to figure out how to get valgrind running on the shell in that mode and got a little stuck. Your method works and I immediately found some invalid memory accesses in the io module (so likely my bug) More soon...

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

It was a json_pack/unpack big-o/little-o thing in the IO forwarding code. Sheesh.

Latest push ought to work.

I didn't address the flux-job attach behavior.

garlick and others added 14 commits Jul 17, 2019
I/O handling will need shell completion references,
so pass flux_shell_t into the shell_io constructor.
Add shell_task_io_at_eof() for testing if a task's
stream has reached EOF, to be called after
shell_task_io_readline() returns len == 0.

Modify shell_task_io_readline() so that it tests for
EOF rather than subprocess exited to return final
data, which may not be terminated as a line.

To avoid signaling EOF twice, don't call task->io_cb()
in shell_task_destroy().
Problem: flux_rpc_vpack() is not exported, but it
is useful for creating a wrapped RPC function
in the shell.

Define in the public rpc.h header and drop 'static'.
Add helpers for registering services on the leader
shell, and for making RPCs to them from any shell
without needing to build topic strings or calculate
leader rank.

This will be useful for getting some simple distributed
services implemented, such as logging the first task
exit or aggregating task stdio on the leader shell.
Problem: services of the leader shell cannot be
used reliably until leader initialization has
completed.

To ensure init has completed across all shells,
add a synchronous barrier between init and task
start.  This is a no-op if --standalone or the
job consists of only one shell.
Problem: if jobspec and R are provided on command line,
and the shell is NOT in standalone mode, the shell sends
empty lookup request to job-info.

Change shell_init_jobinfo() logic so that the look occurs
only if R or jobspec are missing.

Update descriptive comment block over function.
Problem: if flux-shell attempts to use loop:// connector
in standalone mode, it needs a valid FLUX_CONNECTOR_PATH.

Set this manually to point to the source tree build dir
in the t2601-job-shell-standalone.t test.
Problem: leader services don't work in standalone mode.

It's convenient to be able to send messages to ones
self, e.g. so there don't need to be two code paths
for stdio aggregation.  Change standalone mode to use
a valid loop:// broker handle so it is still possible
to send messages to self.
Send each line of I/O from tasks to the leader rank,
accumulating them as objects in a json array.

Once EOF has been received from all tasks, release
a completion reference on the leader.

When I/O object is destroyed on leader, commit json
array to job's guest KVS namespace under "output" key.
If running in standalone mode, simply emit output
on shell's stdio as before, with labels.
After job is finished, display output.
If -l,--label option is provided, prepend
the task id to each line.
Update tests to find output data in the kvs via
flux job attach, rather than looking for it in
the broker log, where it was originally going.
Problem: valgrind job cancel test relies on prototype
job-exec behavior of sleeping for duration (from before
we had a job shell).

Run sleep 60 instead of /bin/true to extend the window
for cancellation.

To tidy things up, split up the 'job' script:
- split out info test to job-info script.
- split out cancellation test to job-cancel script.
- in job script, attach to jobs to make sure they are
  finished, don't drain as this makes other scripts unable
  to submit jobs.
Fix a cut-and-paste error in the job-exec/exec.c kill callback
which increased the reference count of a signaled jobinfo object
instead of removing the reference.
Bring back t3000-mpi-basic.t from the v0.11 series
with the following modifications:
- TEST_MPI=t not required to run; just check for mpi/hello prog
- use default rc "personality" (not wreck)
- use flux srun not flux wreckrun
- eliminate OPTS loop since we don't have any options yet
- drop "oversubscribe" test

Fixes #2192
garlick added 4 commits Jul 22, 2019
Problem: flux attach will look for output key in KVS
and fail if it doesn't find it, which is incompatible with
the flux-exec test mode that doesn't produce output.
The t2500-job-attach.t combines flux attach and test mode,
and thus will fail.

Alter the test to use the real job shell, which was
unavailable when the test was written.

Drop jq dependency.
Add "ripple test" program like lptest(1) from the
lpr package, for testing shell stdio transport.

It's so trivial, it's probably better to write one
and include it as a test program than add a dependency
on the 'lpr' package, complicated by the fact that
cups provides 'lpr' too, but not 'lptest'.
Run 'lptest' and ensure stdout and stderr streams are
accurately captured.
Problem: if something fails, flux job attach only reflects
it in the exit code.

If a task is signaled or exits with nonzero status,
print a message on stderr.
@garlick garlick force-pushed the garlick:shell_stdio2 branch from 7929f71 to d775794 Jul 22, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

Rebased on master after merging @chu11's PR (thanks!). I also squashed the fixups, tweaked a few commit messages, and dropped the TODO test oversubsribe MPI test that it turns out could pass given the right hwloc setup.

I think this is probably ready to consider for merging so I'll drop the WIP prefix.

@garlick garlick changed the title WIP: add output handling to flux-shell flux-shell: add output handling Jul 22, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

Restarted centos builder that failed here:

ERROR: t3000-mpi-basic
======================
# ./t3000-mpi-basic.t: flux session size will be 4
ok 1 - mpi hello singleton
PASS: t3000-mpi-basic.t 1 - mpi hello singleton
ok 2 - mpi hello all ranks
PASS: t3000-mpi-basic.t 2 - mpi hello all ranks
# passed all 2 test(s)
flux-start: 0 (pid 6761) Killed
1..2
ERROR: t3000-mpi-basic.t - exited with status 137 (terminated by signal 9?)
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2019

This looks good to me, and I think ready for a merge so I can rebase #2244 on top.

I did try a run with an insane amount of output (600K lines), and got this surprising error in a loop:

2019-07-22T18:03:53.104269Z flux-shell.err[0]: tagpool-normal temporarily out of tags

which made me wonder if there was a matchtag leak lurking here (or maybe just messages being sent too quickly to process -- I imagine json_array_append() starts to get very slow with lots of lines)

Since the current code is a stopgap I vote we merge anyway, perhaps opening an issue for the observation above?

@grondo grondo merged commit 0cde26a into flux-framework:master Jul 22, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

Thanks!

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2019

Merged! Handling large amounts of output will need a more nuanced solution in the long run anyway.

@codecov-io

This comment has been minimized.

Copy link

commented Jul 22, 2019

Codecov Report

Merging #2246 into master will decrease coverage by 0.01%.
The diff coverage is 78.01%.

@@            Coverage Diff             @@
##           master    #2246      +/-   ##
==========================================
- Coverage   80.75%   80.73%   -0.02%     
==========================================
  Files         209      210       +1     
  Lines       33056    33223     +167     
==========================================
+ Hits        26693    26822     +129     
- Misses       6363     6401      +38
Impacted Files Coverage Δ
src/shell/task.c 89.65% <100%> (-0.12%) ⬇️
src/common/libflux/rpc.c 93.64% <100%> (ø) ⬆️
src/modules/job-exec/exec.c 74.44% <100%> (ø) ⬆️
src/shell/info.c 80.85% <66.66%> (ø) ⬆️
src/shell/shell.c 82.05% <70%> (-2.24%) ⬇️
src/shell/io.c 78.4% <76.54%> (-21.6%) ⬇️
src/cmd/flux-job.c 85% <79.16%> (-0.31%) ⬇️
src/shell/svc.c 81.03% <81.03%> (ø)
src/common/libutil/aux.c 90.74% <0%> (-3.71%) ⬇️
src/cmd/flux-module.c 83.72% <0%> (-0.24%) ⬇️
... and 4 more
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.