Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: add basic job shell with subprocess, environment, and PMI support #2211

Merged
merged 20 commits into from Jul 15, 2019

Conversation

@garlick
Copy link
Member

commented Jun 27, 2019

(Edit: updated 4 July 2019)
(Edited to reflect the current state of this PR 3 July 2019.)

Here is a redo on #2196, a skeletal flux-shell.

Usage: flux-shell [OPTIONS] JOBID
  -R, --resources=FILE   Get R from FILE, not job-info service
  -h, --help             Display this message.
  -j, --jobspec=FILE     Get jobspec from FILE, not job-info service
  -r, --broker-rank=RANK Set broker rank, rather than asking broker
  -s, --standalone       Run local program without Flux instance
  -v, --verbose          Log actions to stderr

This closes #2186 (basic), #2187 (subproc, environ, exit codes), and #2189 (PMI).

The algorithm for mapping tasks to ranks is copied from wreck. Specifically the rcalc class and its tests were brought back from the abyss and modified to accept the new, compact and versioned R_lite version 1.

The jobspec parser is patterned after libjj in the simple-sched. It is limited to one task per slot, and resource hierarchies of node->slot->core or slot->core. Sorry I didn't improve upon existing technology for R and jobspec. Perhaps that can be follow-on work.

Stdio is conspicuously not implemented, so stdio from tasks is mixed with that of the shell, which job-exec logs to the flux instance.

Tasks are not bound to cores.

PMI works but has not been extensively tested. When only one shell is involved, the KVS is not used. When multiple shells are involved, the guest KVS namespace is used, e.g. kvs namespace looks like this for a 4 node MPICH hello:

job.0000.02cc.ff00.0000.J = dmVyc2lvbgBpMQB1c2VyaWQAaTU1ODgAbWVjaABzbm9uZQA=....
job.0000.02cc.ff00.0000.R = {"execution": {"R_lite": [{"children": {"core": "...
job.0000.02cc.ff00.0000.eventlog = {"timestamp":1562189451.0888035,"name":"su...
job.0000.02cc.ff00.0000.guest.exec.eventlog = {"timestamp":1562189451.1050961...
job.0000.02cc.ff00.0000.guest.pmi.P0-businesscard = description#morbo$port#44...
job.0000.02cc.ff00.0000.guest.pmi.P1-businesscard = description#morbo$port#38...
job.0000.02cc.ff00.0000.guest.pmi.P2-businesscard = description#morbo$port#44...
job.0000.02cc.ff00.0000.guest.pmi.P3-businesscard = description#morbo$port#56...
job.0000.02cc.ff00.0000.guest.pmi.hostname[0] = morbo
job.0000.02cc.ff00.0000.guest.pmi.hostname[1] = morbo
job.0000.02cc.ff00.0000.guest.pmi.hostname[2] = morbo
job.0000.02cc.ff00.0000.guest.pmi.hostname[3] = morbo
job.0000.02cc.ff00.0000.guest.pmi.sharedFilename[0] = /dev/shm/mpich_shar_tmp...
job.0000.02cc.ff00.0000.jobspec = {"attributes": {"system": {"cwd": "/home/ga...

(So like before, except private namespace). I note that our jobid's (uint64_t) don't map extremely well to the (integer) PMI appnum. I'm not sure if that's a big deal.

The job-exec module was modified to exec this shell instead of /bin/true if another one is not specified by broker attribute or jobspec. It was also modified to set the FLUX_KVS_NAMESPACE environment variable for the shell (and the job), pointing at the guest namespace.

Upon changing the defualt job shell early on, the valgrind sharness test started failing in what looked like a module (couldn't tell which one). I had worked around this by setting the shell in the sharness valgrind test back to /bin/true. I just dropped the workaround to come back to it, and isn't there anymore on my development system. We'll see about travis. [Edit: it's back in travis. I opened #2217 and added the workaround back to this PR]

Apologies in advance for the rush job - as usual things took longer than I thought, and to get this merged on our two-week cadence, there is inevitably some yuck in here that shouldn't be.

@codecov-io

This comment has been minimized.

Copy link

commented Jun 27, 2019

Codecov Report

Merging #2211 into master will decrease coverage by 0.14%.
The diff coverage is 70.45%.

@@            Coverage Diff             @@
##           master    #2211      +/-   ##
==========================================
- Coverage   80.69%   80.55%   -0.15%     
==========================================
  Files         202      206       +4     
  Lines       32250    32686     +436     
==========================================
+ Hits        26025    26329     +304     
- Misses       6225     6357     +132
Impacted Files Coverage Δ
src/common/libflux/conf.c 100% <ø> (ø) ⬆️
src/shell/info.c 54.05% <54.05%> (ø)
src/shell/jobspec.c 59.01% <59.01%> (ø)
src/shell/shell.c 65.21% <65.21%> (ø)
src/modules/job-exec/exec.c 74.44% <71.42%> (-1.42%) ⬇️
src/shell/rcalc.c 84.15% <84.15%> (ø)
src/common/libflux/mrpc.c 87.79% <0%> (-1.19%) ⬇️
src/common/libflux/message.c 80.5% <0%> (-0.13%) ⬇️
... and 3 more
Copy link
Contributor

left a comment

two minor things i noticed after a quick skim

struct shell_info {
flux_jobid_t jobid;
int shell_rank;
int shell_size;

This comment has been minimized.

Copy link
@chu11

chu11 Jun 27, 2019

Contributor

shell_size made me think "size of the shell", which makes no sense. Should it really be job_node_count or something more like that?

This comment has been minimized.

Copy link
@garlick

garlick Jun 27, 2019

Author Member

I struggled with that one! I was trying to make it clear that the shells are themselves a parallel program of sorts with a rank and size. Hmm, let me ponder some more.

This comment has been minimized.

Copy link
@garlick

garlick Jul 3, 2019

Author Member

Er this was left as is in the current PR for now. I'm open to changing later.

src/modules/job-exec/exec.c Outdated Show resolved Hide resolved
@garlick garlick force-pushed the garlick:shell_skel3 branch from 641eff2 to 8e187bc Jun 27, 2019
src/shell/jobspec.c Show resolved Hide resolved
@garlick garlick force-pushed the garlick:shell_skel3 branch from 1563054 to e23c25d Jul 2, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

This is lauching tasks now, albeit without PMI or stdio support. I hope to get some tests written tomorrow.

I still haven't run down the valgrind failure (currently working around it by adding -o,--setattr=job-exec.job-shell=/bin/true to the valgrind broker command line.)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

Restarted builder that failed here (appears unrelated)

ERROR: t2005-hwloc-basic.t - exited with status 137 (terminated by signal 9?)
@garlick garlick force-pushed the garlick:shell_skel3 branch 5 times, most recently from c388294 to ad2170e Jul 2, 2019
@garlick garlick changed the title WIP: add skeletal job shell add skeletal job shell Jul 3, 2019
@garlick garlick requested a review from SteVwonder Jul 3, 2019
@garlick garlick force-pushed the garlick:shell_skel3 branch from 1208e82 to 6d7d1e1 Jul 3, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

OK, forced a push with a couple more sharness tests that bumped up coverage to an acceptable level. All done poking at it for today.

I'm leaving town on Fri the 5th to return on the 15th, but I'll keep one eye on github and address any egregious issues I've created with this. I'm for merging ASAP and getting on with things though.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2019

OK, I couldn't leave it alone. Pushing the following changes (squashed):

  • drop gratuitous barrier
  • add big comment block in pmi.c
  • stub out stdio, stderr handling in io.c
@garlick garlick force-pushed the garlick:shell_skel3 branch 2 times, most recently from bee89cf to 82c35b0 Jul 4, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2019

Argh, seems as though I'm randomly losing stdout from one rank in a two rank job in the standalone shell test in travis, after dropping the subprocess FLUX_SUBPROCESS_FLAGS_STDIO_FALLTHROUGH flag and adding a callback for stdout, stderr. I must have some racy synchronization or not being using libsubprocess right.

Going to start packing up soon to leave on vacation, so I'll back that change out if I can't fix it, so I can leave this PR in a passing state.

@garlick garlick force-pushed the garlick:shell_skel3 branch from 2023143 to 08bee1a Jul 5, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 5, 2019

Ok, apparently the reactor can exit before all subprocess I/O has been consumed. I added a "flush" of sorts when tasks are destroyed and now it seems all the output is showing up. I squashed the fix and forced a push.

Possibly some review from @chu11 on the usage of the subprocess API would be useful here. Note: at this early stage I didn't implement special handling of early task exit, etc., thus the subprocess state callback is currently not registered, but I expect it will need to be. We should talk over what events we would like in the guest eventlog (all tasks started, first task exited, etc).

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 5, 2019

Right, strike that, apparently I just got lucky on one travis run. The tests failed after squashing the changes. I don't have time to work on this right now so I just pushed a change which disables the I/O capture for the time being. It didn't really accomplish anything anyway - just laying groundwork for #2190. Perhaps it should be looked at comprehensively after this goes in.

Copy link
Contributor

left a comment

still in the middle of reviewing, but wanted to drop off these comments before I leave for the day

src/shell/task.h Show resolved Hide resolved
src/shell/task.h Show resolved Hide resolved
src/shell/task.c Show resolved Hide resolved
src/cmd/flux-job.c Show resolved Hide resolved
src/shell/pmi.c Outdated Show resolved Hide resolved
src/shell/pmi.c Outdated Show resolved Hide resolved
t/t2601-job-shell-standalone.t Outdated Show resolved Hide resolved
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

Thanks for the input @chu11! I will try to get a time slice to work on this on Wed or Thurs.

Welcome back @grondo - if you have a moment to review I'd appreciate any feedback/ideas for improvement to get this on the right footing for the next steps.

The two major outstanding problems are 1) why I lose stdio lines when capturing task I/O, and 2) what memory leak is triggered when this shell is the default for the valgrind sharness test.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

if you have a moment to review I'd appreciate any feedback/ideas for improvement to get this on the right footing for the next steps.

Yes, was just starting to look at this PR along with #2208, since I didn't have any time to follow this the past 2 weeks it might take more than a moment though...

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

That is fine - for the most part I won't be able to dig into this again until next week.

grondo and others added 20 commits Jun 20, 2019
Problem: R_lite evolved a bit since its introduction
in wreck, so the rcalc class needs an update.

Change rcalc to parse the R version 1 outer object,
and change the "rank" type from intgeger to string-
encoded idset.  Note: to get this working quickly,
the new compact array form is simply expanded to
the old.

Update test input.
Add code (initially patterned after libjj in
sched-simple) that parses jobspec and extracts
information important to the job shell.
Add initial job shell, with source located in
src/shell, installed to ${fluxlibexec}/flux-shell.

Usage: flux-shell [OPTIONS] JOBID
  -R, --resources=FILE   Get R from FILE, not job-info service
  -h, --help             Display this message.
  -j, --jobspec=FILE     Get jobspec from FILE, not job-info service
  -r, --broker-rank=RANK Set broker rank, rather than asking broker
  -s, --standalone       Run local program without Flux instance
  -v, --verbose          Log actions to stderr

R, jobspec, and the local broker rank are fetched
from the job-info service, unless provided on the
command line.  A 'struct shell_info' created to hold
R, jobspec, and other parameters for use by shell
components.

Create flux handle and reactor capable of handling
SIGCHLD, and start tasks as subprocesses and capture
exit codes.  When the reactor finishes, all tasks have
exit.

Add PMI support using internal libpmi_server library.

Capture stdio of tasks and emit with labels on shell's
stdout, stderr, to be replaced with proper stdio
handling later.

Fixes #2186 flux-shell: implement skeleton job shell
Fixes #2187 flux-shell: add subproc, environ, exit codes
Fixes #2189 flux-shell: add PMI support
Add flux_conf_get() attribute named "shell_path" which
can be used to retrieve the default job shell.
job-exec used /bin/true as a placeholder for the default
job shell.  Now that flux-shell is coming together,
run it as the default, and don't log the fallback since
it's probably the common case.

Drop sharness test for logging of /bin/true fallback.
Problem: the job shell may not be running as the
instance owner in the future, thus should redirect
all KVS operations to the guest namespace.

Set FLUX_KVS_NAMESPACE in the shell's environment
so all KVS accesses by the shell and the job will be
rooted in the guest namespace.

Update dummy job shell to verify that FLUX_KVS_NAMESPACE
is set, and skip setting KVSDIR and NAMESPACE.

Update one sharness test that depended on the job shell
writing to the primary namespace so that it now gathers
output from the guest namespace.
In the shell, it is useful to construct KVS paths relative to
the job's guest directory, or root depending on whether
FLUX_KVS_NAMESPACE is set.

Implement a variant of flux_job_kvs_key() that does this.

Add unit tests.
Problem: if FLUX_KVS_NAMESPACE is set for the job shell,
this may leak into an instance launched by the job shell.

In the broker, unset FLUX_KVS_NAMESPACE when we unset FLUX_URI.
Problem: flux job attach exits the wait status,
not the exit status.

If the job exited, exit with WEXITSTATUS(status).
If the job was signaled, exit with WTERMSIG(status).
Any other failure, exit with 1.

Update t2500-job-attach.t sharness test to use
shell truth rather than the test_must_fail
macro to check for cancellation, since 128 + sig
is considered abnormal by test_must_fail.
Problem: in-tree python wrapper can be generated
that recurses infinitely.

If flux is configured in the environment of a running
instance, a python wrapper is generated that contains a
reference to itself.  Add a check to the Makefile to abort
the build if this situation is detected.

Fixes #2207
Problem: the inline comment describing the PMI-1
simple_server barrier callback is inaccurate.

Rework the description to reflect that the barrier
count is over the number of "shells" (or whatever),
not the number of tasks.
Problem: the pminfo test program is for displaying
PMI info, so it should have two i's not one.

Rename test progaram to pmi_info.

Update hydra sharness test that calls it.
Problem: flux job attach -E shows job eventlog entries
on stderr as they are added to the eventlog, but does not
show the event context (if any).  This makes the option
less useful for debugging than it could be.

Add the json-format context to the event output.
Problem: job manager "list" priority test always
succeeds because its logic is absorbed into a
runaway here-is document.

Move && to top of here-is doc, then fix masked problem
where test improperly prepared its flux job list
output to include only the priority column.
Problem: dmesg logger blank line test always
succeeds because its logic is absorbed into a
runaway here-is document.

Change pipeline to file redirect, and chain
next command with &&, then fix masked problem where
logger improperly set --appname.
@garlick garlick force-pushed the garlick:shell_skel3 branch from bbdeb2b to b9a4885 Jul 15, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2019

Squashed and rebased, and annotated commits to close relevant issues upon merge.

@grondo
grondo approved these changes Jul 15, 2019
Copy link
Contributor

left a comment

Approved!

@codecov-io

This comment has been minimized.

Copy link

commented Jul 15, 2019

Codecov Report

Merging #2211 into master will increase coverage by 0.01%.
The diff coverage is 82.6%.

@@            Coverage Diff             @@
##           master    #2211      +/-   ##
==========================================
+ Coverage   80.74%   80.76%   +0.01%     
==========================================
  Files         202      209       +7     
  Lines       32288    32988     +700     
==========================================
+ Hits        26071    26642     +571     
- Misses       6217     6346     +129
Impacted Files Coverage Δ
src/common/libpmi/simple_server.c 88.12% <ø> (ø) ⬆️
src/common/libflux/conf.c 100% <ø> (ø) ⬆️
src/shell/io.c 100% <100%> (ø)
src/broker/broker.c 73.52% <100%> (+0.02%) ⬆️
src/common/libsubprocess/local.c 76.61% <100%> (ø) ⬆️
src/common/libjob/job.c 75.65% <63.63%> (-0.94%) ⬇️
src/shell/info.c 71.25% <71.25%> (ø)
src/modules/job-exec/exec.c 74.44% <71.42%> (-1.42%) ⬇️
src/shell/pmi.c 74.26% <74.26%> (ø)
src/shell/rcalc.c 85.24% <85.24%> (ø)
... and 15 more
@mergify mergify bot merged commit 0a0d85b into flux-framework:master Jul 15, 2019
2 checks passed
2 checks passed
Summary 1 rule matches
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2019

Thanks! That takes a load off.

@garlick garlick deleted the garlick:shell_skel3 branch Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.