Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/flux-mini: add mini run/submit #2390

Merged
merged 10 commits into from Sep 28, 2019

Conversation

@garlick
Copy link
Member

commented Sep 24, 2019

As discussed in #2379, here's a first cut on a flux mini command (with run and submit subcommands), for early review.

It borrows big hunks shamelessly from @SteVwonder's flux-jobspec.py.

Copy link
Member

left a comment

LGTM so far! I'll hold off on approving until it's no longer a WIP.

src/cmd/flux-mini.py Outdated Show resolved Hide resolved
src/cmd/flux-mini.py Outdated Show resolved Hide resolved
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

Do we need some way to run more tasks than cores like slurms --oversubscribe?

@dongahn

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

This wouldn't have implication to the schedulers, would it?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

Do we need some way to run more tasks than cores like slurms --oversubscribe?

Maybe eventually but not now?

In general, maybe we shouldn't take direct inspiration from Slurm commands. For instance, maybe we don't need --oversubscribe when we support tasks_per_slot?

@garlick garlick force-pushed the garlick:flux_mini_run branch from e2b13d8 to c84d5af Sep 24, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

Just squashed and force-pushed an update that

  • uses Flux standard duration for --time
  • drops --setenv
  • replaces --show-jobid, --show-events --show-exec with a --verbose option that can be specified multiple times (easier to maintain interface stability this way)

Also: added --diff option to scripts/check-format to find out why travis check was failing, then added a fixup to address whitespace issue.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

I was wondering if it might improve clarity a bit if --ntasks must be specified if greater than 1, rather than allowing --nnodes to set the task count? E.g. require

flux mini run --nnodes=2 --ntasks=2 cmd ...

rather than just

flux mini run --nnodes=2 cmd ...

Let that fail with

flux-mini: node count must not exceed task count
@trws

This comment has been minimized.

Copy link
Member

commented Sep 24, 2019

It would certainly make it more explicit, I wouldn't mind that change personally.

@garlick garlick force-pushed the garlick:flux_mini_run branch from a8b3be1 to 2dc7ea6 Sep 24, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2019

Just rebased and squashed some fixups with minor cleanup.

I was trying to add the ability to set shell options but didn't quite get that done today.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2019

OK, latest fixup is a refactoring into a JobSpec class, SubmitCmd class, and RunCmd class. Sorry, that change touches everything, but I think it's a good cleanup. I hope the JobSpec class is starting to look more like something that could be factored out to a reusable module.

I did change the behavior of --nodes and --ntasks as described above (tentatively, easy to change back)

I'm pretty tentative about this change so would appreciate any feedback from people with more python experience. I can easily drop it if it's not desirable.

Otherwise I'll continue and try to add the shell options today.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2019

I did change the behavior of --nodes and --ntasks as described above (tentatively, easy to change back)

I agree that seems to be a good change.

It is probably just me, but the usage and help output from python argparse based stuff is just ugly:

usage: flux-mini.py [-h] {run,submit} ...

positional arguments:
  {run,submit}
    run         run subcommand
    submit      submit subcommand

optional arguments:
  -h, --help    show this help message and exit

I guess this is probably for a future PR, but could we make the python commands behave and output help like our other commands (which are modeled after git, the gold standard)

It would also be nice to add a bit more info in the default help output than just "run command" and "submit command", and also flux mini help doesn't work.

Finally, before we tag we may want to add a short manpage for flux-mini(1) since then it will appear in the listing of available commands in flux help.

Copy link
Contributor

left a comment

No major comments, just some suggestions for slightly tweaking the help output etc, since that is all I'm qualified for.

I do like how the refactoring into classes is coming along.

src/cmd/flux-mini.py Outdated Show resolved Hide resolved

@util.CLIMain(logger)
def main():
parser = argparse.ArgumentParser()

This comment has been minimized.

Copy link
@grondo

grondo Sep 25, 2019

Contributor

maybe set prog="flux mini" for argparse so that help doesn't reference flux-mini.py which breaks the 4th wall of the flux(1) command driver ;-)

This comment has been minimized.

Copy link
@garlick

garlick Sep 25, 2019

Author Member

Ok, applying the fix to the util class discussed in #2234, and setting the name to "flux-mini".

This comment has been minimized.

Copy link
@grondo

grondo Sep 26, 2019

Contributor

Was this pushed, still seeing:

$ flux mini 
usage: flux-mini.py [-h] {run,submit} ...
flux-mini.py: error: too few arguments

This is minor enough that it could wait for a future pr.

This comment has been minimized.

Copy link
@garlick

garlick Sep 26, 2019

Author Member

Sorry, my my comment was about the logging prefix. I had not set the ArgumentParser(prog="flux-mini"). I can also override the usage string to print something more like our standard boiler plate for commands with subcommands here.

Fixup coming.

Edit: setting ArgumentParser(usage=usages_string) affects subcommands too so not doing that. (I had tried that before, duh)

This comment has been minimized.

Copy link
@grondo

grondo Sep 26, 2019

Contributor

Ah, sorry, my misinterpretation. I see now that it is obvious what you meant.

src/cmd/flux-mini.py Outdated Show resolved Hide resolved
"""
if type(opts) is not dict:
raise ValueError("shell options must be a dictionary")
self.jobspec["attributes"]["system"]["shell"]["options"] = opts

This comment has been minimized.

Copy link
@grondo

grondo Sep 25, 2019

Contributor

This function isn't used yet, but my feeling is that we want to be able to specify -o multiple times on the command line with something like -o mpi=spectrum -o cpu-affinity=off (or equivalently a comma separated list). Does this function completely reassign all options? We might want to add a way to incrementally add them instead.

This comment has been minimized.

Copy link
@garlick

garlick Sep 25, 2019

Author Member

Thanks! Good feedback as I start working on that.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2019

Pushed some fixups on the review comments so far (minus prettifying help output which I struggled with a bit, and man page which I agree needs to be added).

Copy link
Member

left a comment

Looking good. Comments below:

scripts/check-format Show resolved Hide resolved
Constructor builds the minimum legal v1 jobspec.
Use setters to assign additional properties.
"""
if type(command) is not list:

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 25, 2019

Member

I think this is close to what you want, but not exactly. Python's duck typing means you'll trim out a bunch of things that act like lists but aren't explicitly lists (e.g., tuple, deque). (relevant stackoverflow answer). In _create_resource, there is an example of checking for that it is collectionsAbc.Sequence and not a str. (Although now that I look at that again, it only works for python 3 and not python 2....grr....maybe we should wrap that functionality up into a utility function.....feel free to not include that in this PR, we can always open an issue and loop back around to it).

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 25, 2019

Member

One other thing, you'll want to switch to using isintance rather than type to account for inheritance (relevant stackoverflow post).

This comment has been minimized.

Copy link
@garlick

garlick Sep 25, 2019

Author Member

OK, I'll use isinstance and try to be a bit more inclusive but hold off on the collectionsAbc stuff.

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 25, 2019

Member

@trws : any insight on the "right way" to test for a thing that maps to a list in YAML?

I think the "Sequence not str" check above also allows buffers and other unintended types in.

This comment has been minimized.

Copy link
@trws

trws Sep 27, 2019

Member

I see this is now assertions etc. While that’s fine, we could be using typing comments and mypy if we are doing type-checking.

As to actually telling if it will become a list, the only way I know to be sure is to actually turn the thing into yaml and back and check, since types can register their own serializers. Just checking that it’s a sequence and not a string seems reasonable.

src/cmd/flux-mini.py Outdated Show resolved Hide resolved
src/cmd/flux-mini.py Outdated Show resolved Hide resolved
src/cmd/flux-mini.py Outdated Show resolved Hide resolved
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2019

Pushed a fixup with the more relaxed type checking.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2019

Added couple more changes

  • address stephen's feedback
  • fix for logging prefix
  • improve help descriptions of subcommands when one isn't specified
  • add --setopt KEY=VAL option for setting shell options (where val can be JSON or a string)
  • add --dryrun to print jobspec and exit

Still need to add option for setting output modes.
Edit: and man page and tests.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2019

add --setopt KEY=VAL option for setting shell options (where val can be JSON or a string)

Nicely done! I think options will be common enough that -o should also be allowed to set options as --setopt option=value --setopt option2=value2 starts to get a bit cumbersome.

Very minor nit on --dryrun: I think --dry-run seems to be more standard practice, (e.g. git commands like e.g.git-rm(1))

BTW, really good idea to add --dry-run option. This is all turning out nice IMO.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2019

Restarting a builder that failed with the well known valgrind test failure.

@garlick garlick force-pushed the garlick:flux_mini_run branch from 8dee12c to 909a400 Sep 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2019

Squashed the fixups and rebased on current master.

Still need -O,--output and -E,--error options, manpage, and sharness test.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2019

Just pushed a man page and fixups for

  • --setattr=KEY=VAL option to set any jobspec attr, as suggested by @grondo
  • fix to -o, --setopt=KEY=VAL to allow periods in keys (as path separators)
  • add --output and --error options, as discussed in #2241

On to add some tests and deprecate flux-srun.

Copy link
Member

left a comment

LGTM! Two nitpicks about wording and naming, but overall this looks great. Thanks @garlick for putting this together!

*-N, --nodes=N*::
Set the number of nodes to assign to the job. Tasks will be distributed
evenly across the allocated nodes. It is an error to request more nodes
than there are tasks. If unspecified, the minimum number of nodes

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 27, 2019

Member

Maybe I'm misinterpreting, but I'm interpreting this statement to mean that the front-end tool will request the minimal number of nodes required to satisfy the user's overall request. IIUC, if nodes is unspecified, the front-end tool omits nodes from the jobspec, and the scheduler will make it's decision without constraining the number of nodes in any way (you could get N nodes where N is the number of tasks, or 1 node, just depends on the scheduler).

This comment has been minimized.

Copy link
@garlick

garlick Sep 27, 2019

Author Member

Right, that statement was not only misinformed, but is an inappropriate level of detail for a behavior that is configurable elsewhere.

VAL may be valid JSON (bare values, objects, or arrays), otherwise VAL
is interpreted as a string.

*--dry-run*::

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 27, 2019

Member

With this option now supported, is the plan to deprecate/drop the flux jobspec command? (Fine by me, just wondering)

This comment has been minimized.

Copy link
@garlick

garlick Sep 27, 2019

Author Member

Since flux mini has deviated from the original slurm compatibility (FSD for time limit, --nodes doesn't alter --ntasks) flux jobspec srun may still be useful? There is the potential for providing jobspec generation from other convenient idioms in the future. My vote would be to keep it and possibly refactor to use JobSpec() class if appropriate down the road.

metavar="FILENAME",
)
parser.add_argument(
"--label",

This comment has been minimized.

Copy link
@SteVwonder

SteVwonder Sep 27, 2019

Member

Nit: I kinda prefer --label-io only because --label seems ambiguous as to what is being labelled. Is the job getting a label? Not a hill I will die on though, so please feel free to disregard.

This comment has been minimized.

Copy link
@grondo

grondo Sep 27, 2019

Contributor

I vote --label-io as well

This comment has been minimized.

Copy link
@garlick

garlick Sep 27, 2019

Author Member

On the subject of option names, would it be more descriptive to rename

  • --cpus-per-task to --cores-per-task?
  • --time to --time-limit or --max-duration?

This comment has been minimized.

Copy link
@grondo

grondo Sep 27, 2019

Contributor

I think both of those are a good ideas. (for --time, --time-limit is preferable just because the terminology will be more familiar to users)

@garlick garlick force-pushed the garlick:flux_mini_run branch from 7151f54 to 310502a Sep 27, 2019
@trws

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

The printing is completely configurable, or replaceable, if you’re sufficiently determined. What do you find particularly ugly about it?

Yeah, I think @garlick has fixed most of my minor complaints already. Just as examples, some things that cluttered the help output included:

  • By default, argparse picks its "metavar" as underscored and capitalized version of the long argument, and applies it to both the short and long option. e.g.
  -c CORES_PER_TASK, ---cores-per-task CORES_PER_TASK

vs

  -c, --cores-per-task=N
  • By default sub-commands are listed as "positional arguments" in curly braces which clutters output and most users probably don't even know what that means, e.g.
usage: flux-mini [-h] {run, submit} ...
positional arguments:
 {run,submit}
  run     run a parallel job
  submit  submit a parallel job

vs the more preferable

usage: flux-mini [-h] COMMAND ...

Supported commands:
   run      run a parallel job
   submit  submit a parallel job

As you said, it seems most of this is configurable, but personally I just find the defaults a bit off putting. (I agree just personal preference though)

Edit: I do feel it would be nice if C, python, and utilities in other languages all shared a common usage, help and error behavior though.

@trws

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2019

Thanks @trws, trying out

diff --git a/src/cmd/flux-mini.py b/src/cmd/flux-mini.py
index 84999057d..ad0003054 100755
--- a/src/cmd/flux-mini.py
+++ b/src/cmd/flux-mini.py
@@ -368,7 +368,7 @@ logger = logging.getLogger("flux-mini")
 @util.CLIMain(logger)
 def main():
     parser = argparse.ArgumentParser(prog="flux-mini")
-    subparsers = parser.add_subparsers()
+    subparsers = parser.add_subparsers(title="supported subcommands", description="")
 
     # run
     run = RunCmd()

yields

$ flux mini -h
usage: flux-mini [-h] {run,submit} ...

optional arguments:
  -h, --help    show this help message and exit

supported subcommands:

  {run,submit}
    run         run a job interactively
    submit      enqueue a job

an improvement I think.

@garlick garlick force-pushed the garlick:flux_mini_run branch from 82eb413 to a716edd Sep 27, 2019
@garlick garlick changed the title [WIP] cmd/flux-mini: add mini run/submit cmd/flux-mini: add mini run/submit Sep 27, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

Close to ready?

@trws

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Looking good @garlick. I forgot the brace-list option thing, if you set... I think it's meta_var or metavar or something, on the subcommands to something and all the same it should replace that brace-list with that, so metavar="" or something might squash that off too if we want rid of it.

@garlick garlick force-pushed the garlick:flux_mini_run branch from 6b0a2af to 19d1550 Sep 27, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2019

Oh sheesh, having a bit of a time with travis. I'll check in later to see what's gone wrong this time.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

Looks like more 3.6 issues.. error in travis logs is:

flux-mini: ERROR: '>' not supported between instances of 'NoneType' and 'int'

The TEST_INSTALL builder had a different error:

flux-mini: ERROR: [Errno 2] No such file or directory
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2019

Pushed one fix: it turns out print(flush=True) is only a python3 thing, so replaced that with os.stderr.flush() which should be portable.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2019

That's annoying that python chooses to make stderr buffered. posix says:

When opened, the standard error stream is not fully buffered; the standard input and standard output streams are fully buffered if and only if the stream can be determined not to refer to an interactive device.

Oh well.

Copy link
Contributor

left a comment

I was playing around on my Ubuntu 18.04 system and hit the following error:

$ flux mini run -vvv -n 2 --label-io hostname
jobid: 6700752961536
flux-mini: ERROR: 'module' object has no attribute 'stderr'
$ python -V
Python 2.7.15+
-----------
flux-mini(1) submits jobs to run under Flux. The job consists of
'N' copies of COMMAND launched together as a parallel job.
If '--nprocs' is unspecified, a value of 'N=1' is assumed.

This comment has been minimized.

Copy link
@grondo

grondo Sep 28, 2019

Contributor

--nprocs is used here and in SYNOPSIS above instead of ntasks.

Use the same option name in flux job attach and
flux mini submit|run.
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2019

I was playing around on my Ubuntu 18.04 system and hit the following error:

Oh duh, in my haste to fix that python3ism, I tacked on a flush of os.stderr, not sys.stderr. Sorry about that and thanks for poking at it!

garlick added 9 commits Sep 23, 2019
Usage: flux mini run|submit [OPTIONS] cmd ...

Both subcommands use options and logic borrowed from flux
jobspec srun to generate jobspec, which they feed the to
the python binding for flux_job_submit().

The submit subcommand prints the jobid on stdout and exits.

The run subcommand calls execs flux job attach to consume
stdout and stderr in real time.  run blocks until the job
has run to completion and all I/O has been consumed.

Unlike Slurm which allows either -n,--nprocs or -N,--nodes
to determine the task count, the -n,--nprocs option here
always defaults to 1 unless the option is present.

A -v,--verbose option that can be specified multiple times
was added in lieu of options that directly map to flux job attach
options like --show-events, because it reduces the interface
footprint facing users, which we want to stablize.
Problem: when the black format check fails in travis,
there is no output to tell you what went wrong.

Add the --diff option to display the required
reformatting.
Problen: The logger always prefixes messages with "util"
rather than the argument to logging.getLogger().

Use %(name) not %(module).

Fixes #2234
All of flux-srun's functionality is now available
in flux-mini run, so drop it.
Problem: sometimes errors are hard to track down with
the CLIMain default exception handler, which only prints
one line with no indication of what part of the code is
failing.

Since stack traces are still logged at DEBUG level, and
the default log level in CLIMain is INFO, add support for
an optional environment variable which can alter the debug
level.

Set FLUX_PYCLI_LOGLEVEL=10 to change the logging level to DEBUG.
@garlick garlick force-pushed the garlick:flux_mini_run branch from f81e00c to 0216aef Sep 28, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2019

Setting merge-when-passing since @SteVwonder was going to do it yesterday afternoon but I wanted to squash the fixups first. Now they're squashed and all comments addressed I think.

@mergify mergify bot merged commit da2469e into flux-framework:master Sep 28, 2019
2 checks passed
2 checks passed
Summary 1 rule matches
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@codecov-io

This comment has been minimized.

Copy link

commented Sep 28, 2019

Codecov Report

Merging #2390 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2390      +/-   ##
==========================================
+ Coverage   81.07%   81.08%   +<.01%     
==========================================
  Files         224      224              
  Lines       35770    35770              
==========================================
+ Hits        29002    29005       +3     
+ Misses       6768     6765       -3
Impacted Files Coverage Δ
src/cmd/flux-job.c 85.32% <100%> (ø) ⬆️
src/shell/shell.c 85.71% <0%> (-0.27%) ⬇️
src/common/libflux/message.c 80.36% <0%> (+0.13%) ⬆️
src/common/libflux/handle.c 86.48% <0%> (+0.67%) ⬆️
@garlick garlick deleted the garlick:flux_mini_run branch Sep 28, 2019
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.