Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flux-srun #2179

Merged
merged 8 commits into from Jun 11, 2019

Conversation

@garlick
Copy link
Member

commented Jun 8, 2019

This is probably wrong.

Recall the idea for flux srun is to provide a simple run command that can be backported to the 0.11 series and provide a stable porting target across 0.11 -> 0.12, and take a little pressure off of having a complete porcelain run command right off in 0.12.

I added a --wait option to flux job submit so that the command will pause until the job reaches INACTIVE state, exiting with the highest exit code of the job shells, or 1 if the job was terminated by an exception.

Then I added the worlds simplest shell script that just called flux jobspec srun $* | flux job submit --wait.

I kind of think it would be better to do this as a python script that generates the jobspec (replicating some of what's in flux jobspec and then starts flux job submit as a subprocess. The shell script just handing all its arguments to flux jobspec srun feels lame - it's not possible to print a reasonable usage message for example, nor to pass options to flux job submit without interpreting options. And bash's getopts is not really longopt friendly.

Is the --wait option the proper default for this command?

I haven't written any tests b/c I wanted to get some early feedback on the direction.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

I think this is a great start!

Eventually the script would want to invoke flux run or the equivalent so that I/O, signals, etc could be handled as expected even for a "simpler run" command, so no point in doing a lot of work now that would have to be undone later, imo.

Maybe as part of development of run/submit commands, the bits that parse a "run" commandline and generate jobspec and submit a job could be wrapped up in a nice python module making one-off wrapper commands like flux srun as trivial as the script here.

I usually find /usr/bin/getopt a good replacement for shell getopts builtin, however that's probably not too portable. (I agree best long term solution is to use python)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

Eventually the script would want to invoke flux run or the equivalent so that I/O, signals, etc could be handled as expected even for a "simpler run" command, so no point in doing a lot of work now that would have to be undone later, imo.

Maybe we should add a plumbing flux job run command with the "wait" behavior instead of an option to flux job submit? It could be structured as a reactor loop so adding those other things in as they become available would be easy. (And why not trap ctrl-C and send a cancel exception now, for a start)?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

Actually, maybe a flux job attach would be a better factored plumbing command. (so submit + attach would be a stripped down run).

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

@garlick garlick force-pushed the garlick:flux_srun branch from 186e05c to d337857 Jun 8, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

Just pushed a version with a flux job attach subcommand. In this iteration, if a SIGINT is received, it cancels the job.

Having said that out loud, now I'm wondering if that was a bad choice. There is no way to interactively detach from a running job without canceling it. Hmm.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

@garlick garlick force-pushed the garlick:flux_srun branch from d337857 to 14201e9 Jun 8, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

OK, the version just force-pushed behaves as you described. That was a good idea, nice improvement IMHO.

@dongahn

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

Nice work @garlick! @SteVwonder, @trws and I plan to talk to SNL next Monday. This should be one of the tools we will introduce to them as well.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2019

Thanks, hopefully we can make this into a more polished command as we go along. Please make sure to stress that this is not intended to be a replica of SLURM's srun, as that would be a much bigger project :-)

@garlick garlick force-pushed the garlick:flux_srun branch from 14201e9 to 5267bba Jun 8, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member

commented Jun 9, 2019

Maybe as part of development of run/submit commands, the bits that parse a "run" commandline and generate jobspec and submit a job could be wrapped up in a nice python module making one-off wrapper commands like flux srun as trivial as the script here.

100% agree.

There is some basic submission functionality in the python bindings currently, but it is really just a thin wrapper around the C submission interface. Off the top of my head, I'm not sure what kind of support there is for "block until this job hits this state" and "wire-up to receive the stdout/stderr of a job".

The get_slurm_common_parser from flux-jobspec can probably be pulled off into the utils (or similar) library. I'm running into the same need for easy jobspec creation when dealing with the simulator. If we can get a level of abstraction that easily supports the common SLURM arguments as well as the simluator's needs, that's probably a good starting point.

@garlick garlick force-pushed the garlick:flux_srun branch 2 times, most recently from 8ca694c to 05aa4e3 Jun 9, 2019
garlick added 3 commits Jun 8, 2019
Problem: there is no convenient command to block until
a job has completed and then obtain its exit status.

Implement a new plumbing subcommand:

Usage: flux job attach ID

This command watches the job's eventlog, handling the
following events:
- exception: log any exceptions
- finish: capture job shell exit status
- clean: terminate reactor loop

If the 'finish' event did not occur because of a fatal
exception, exit with rc=1;  otherwise, exit with the
job shells' highest exit code.

If user presses ctrl-C while attached to a job, then
- another ctrl-C within 2s, cancel the job
- a ctrl-Z within 2s, detach
Problem: flux job submit on the end of a pipeline
emits a JSON parsing error if stdin is empty,
which is kind of an obtuse error.

The error comes from the ingest validator.  Detect
empty jobspec in the submit sub-command instead, and
say "required jobspec is empty".
Problem: in order to write tests of interactive cancel
and detach, it is necessary to know when flux job attach
has started the reactor and begun handling events.

Add a --show-events option that will allow such tests to
be written, and may also be useful to users.
@garlick garlick force-pushed the garlick:flux_srun branch from 36cc08f to 06f6fc2 Jun 11, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

Whew, finally fixed that troublesome test with a nice hint from @grondo on his way out the door.

I think I'm done here.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Great!

It would be nice if flux srun --help also displayed the usage message and not:

flux-job: jobspec: invalid JSON: '[' or '{' expected near 'usage'

Something like this might be sufficient for now:

diff --git a/src/cmd/flux-srun b/src/cmd/flux-srun
index af19d04..28de366 100755
--- a/src/cmd/flux-srun
+++ b/src/cmd/flux-srun
@@ -1,11 +1,22 @@
 #!/bin/bash
 
+got_help() {
+    while [[ $1 == -* ]]; do
+        case "$1" in
+            --)          return 1 ;;
+            -h|--help)   return 0 ;;
+        esac
+        shift
+    done
+    return 1
+}
+
 submit_job() {
     set -o pipefail
     flux jobspec srun "$@" | flux job submit
 }
 
-if test $# -eq 0; then
+if test $# -eq 0 || got_help "$@"; then
     echo "Usage: flux srun OPTIONS command ..." >&2
     echo "(see flux-jobspec srun for available OPTIONS)" >&2
     exit 1
$ flux srun --help
Usage: flux srun OPTIONS command ...
(see flux-jobspec srun for available OPTIONS)
garlick added 5 commits Jun 7, 2019
Add a simple run comand that wires together:
- flux jobspec srun
- flux job submit
- flux job attach

The idea is that this simple interface can be
back-ported to 0.11 so that users can change their
workflow scripts now with some guarantee that they
won't break when we finalize our more sophisticated
porcelain run command.

Despite the name, the plan is not to be a faithful
replica of the SLURM srun command, just something
simple and familiar that leverages the work Stephen
already invested in the flux jobspec command.
Thus, only the options supported by flux jobspec srun
are supported here.
Problem: misspelled words in one comment, and
one sharness test output message.

Fix typos.
Problem: several sharness test scripts contained
lines with trailing white space.

Chop the white space.
@garlick garlick force-pushed the garlick:flux_srun branch from 06f6fc2 to 18d95ad Jun 11, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

Ok I added that, and a sharness subtest to make sure Usage: is displayed for no args, and for -h. (Squashed)

@codecov-io

This comment has been minimized.

Copy link

commented Jun 11, 2019

Codecov Report

Merging #2179 into master will decrease coverage by 0.02%.
The diff coverage is 79.76%.

@@            Coverage Diff             @@
##           master    #2179      +/-   ##
==========================================
- Coverage   80.87%   80.85%   -0.03%     
==========================================
  Files         199      199              
  Lines       31631    31715      +84     
==========================================
+ Hits        25582    25643      +61     
- Misses       6049     6072      +23
Impacted Files Coverage Δ
src/cmd/flux-job.c 85.27% <79.76%> (-1%) ⬇️
src/modules/barrier/barrier.c 78.52% <0%> (-2.02%) ⬇️
src/common/libflux/mrpc.c 87.79% <0%> (-1.19%) ⬇️
src/broker/modservice.c 79.8% <0%> (-0.97%) ⬇️
src/modules/sched-simple/rlist.c 81.73% <0%> (+0.23%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

diff coverage dropped below the green threshold, but that must be due to random variation as nothing I just changed should have affected coverage. Should be OK to go in.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Great!

@grondo grondo merged commit edfcd12 into flux-framework:master Jun 11, 2019
3 of 4 checks passed
3 of 4 checks passed
codecov/patch 79.76% of diff hit (target 80.87%)
Details
Summary 1 potential rule
Details
codecov/project 80.85% (-0.03%) compared to 62b96ef
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.