Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-jobspec command #1964

Merged
merged 7 commits into from Feb 15, 2019

Conversation

@SteVwonder
Copy link
Member

commented Jan 26, 2019

As promised today over coffee, here is what I have so far. It currently supports several flags that are common amongst sbatch and srun. Including -N (nodes), -c (cores), -n (tasks), --time, and --output. I don't think there is anything else we need for Corona other than -g (gpus) and --error.

Oh, it also needs to actually do something when someone supplies --output (whoops, forgot that one).

Related #1929 (still needs oar support before closing it)
Closes #2003

@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Jan 26, 2019

Currently when a user provides --time, it gets validated against the SLURM walltime format and then the string the user provided at the command line is added straight into the jobspec without any modification. If we have a specific format for the duration in the jobspec that we want to use, it would be really easy to convert the SLURM format into something else. We just need to make a decision on that.

@garlick

This comment has been minimized.

Copy link
Member

commented Jan 26, 2019

This is great - both instructive and useful for testing.

Maybe no big deal if it's already done, but

  • Given this is plumbing, jq could perform any prettification on the pipeline, and the --pretty and --compact options could be dropped.
  • I would vote for srun only, leave off sbatch since srun seems to be a superset?
  • no need to go nuts with the distributions IMHO since we can't do much with them in the new system yet

I'd like to see this merged sooner rather than later. Nice work :-)

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2019

This doesn't have to be done before this PR is merged (the current version is quite useful as it is), however it would be nice if eventually the flux-jobspec command had an interface similar to other flux subcommands (as much as possible), e.g. treat srun or sbatch or whatever as sub-subcommands, not a "choice".

Also flux jobspec help should print the help for the jobspec command, including list of subcommands, and flux jobspec srun --help should print help for just the flux jobspec srun subcommand.

E.g. from flux content:

$ flux content help
flux content: Unknown subcommand: help
Usage: flux content dropcache [OPTIONS]
   or: flux content flush [OPTIONS]
   or: flux content load [OPTIONS] BLOBREF
   or: flux content spam N [M]
   or: flux content store [OPTIONS]
Access content store
  -h, --help             Display this message
$ flux content spam --help
Usage: flux content spam N [M]
Store N random entries, keeping M requests in flight (default 1)
  -h, --help             Display this message.

It probably seems minor, but having all the flux blah commands behave similarly will reduce user frustration in the long run I think.

I wonder if we could subclass the python argparse library to make this happen easily for this and future python commands?

On --yaml vs --json, would it be simpler and more extensible to just have a --output-format, --of or similar option? I'm actually not sure it matters either way, but it seems a more tractable way to add other output formats in the future (if that would even be needed)

@dongahn

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2019

It probably seems minor, but having all the flux blah commands behave similarly will reduce user frustration in the long run I think.

I wonder if we could subclass the python argparse library to make this happen easily for this and future python commands?

@grondo and @SteVwonder: flux resource would be an example which can benefit from this.

https://github.com/flux-framework/flux-sched/blob/master/t/scripts/flux-resource

But given that Python did a fair good job of abstracting this out. Do we want to abstract this on our own, though.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2019

But given that Python did a fair good job of abstracting this out. Do we want to abstract this on our own, though.

I don't think we should rewrite argparse, but I was hoping the module was flexible enough to at least customize the default help and other output to match existing flux commands.

@dongahn

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2019

I don't think we should rewrite argparse, but I was hoping the module was flexible enough to at least customize the default help and other output to match existing flux commands.

I see. Maybe flux-resource can be another data point then.

I'm not a python expert but I think it was flexible enough when I used it. I was at least trying to micmic flux commands/subcommands. Do you think it matches with what you had in mind or lacks some important features?

@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2019

Given this is plumbing, jq could perform any prettification on the pipeline, and the --pretty and --compact options could be dropped.

Done (push coming shortly)

I would vote for srun only, leave off sbatch since srun seems to be a superset?

Done

no need to go nuts with the distributions IMHO since we can't do much with them in the new system yet

Ok. Should I pull that out as well, or just leave it in as-is?

Also flux jobspec help should print the help for the jobspec command, including list of subcommands, and flux jobspec srun --help should print help for just the flux jobspec srun subcommand.
I wonder if we could subclass the python argparse library to make this happen easily for this and future python commands?
I don't think we should rewrite argparse, but I was hoping the module was flexible enough to at least customize the default help and other output to match existing flux commands.

I played around with this for a few hours last night. Unfortunately, argparse does not make it easy to customize the output of the usage/help. I opened a ticket to track that discussion/issue further: #1993

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

I would vote for srun only, leave off sbatch since srun seems to be a superset?

Done

I didn't have a chance to look at the initial implementation, but isn't the key difference between flux jobspec srun and sbatch that sbatch should insert a flux start or flux broker before the provided COMMAND (which is assumed to be a batch script), and ensure that the tasks run once per node? (or something similar

@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 6, 2019

isn't the key difference between flux jobspec srun and sbatch that sbatch should insert a flux start or flux broker before the provided COMMAND (which is assumed to be a batch script), and ensure that the tasks run once per node

My implementation is/was not that advanced 😆 They were just aliases, basically. That is a really good idea though. To make that last piece about one per node work, I think --nodes would have to be a required argument. If it is optional and the user only specifies --cores, then the tool will have to make a determination/assumption about the number of cores per node......actually, now that I think about this, is it even possible to specify that kind of request in the canonical jobspec? The request of thinking of is along the lines of: "I want N cores. I don't care about the number of nodes, but I want 1 process per node."

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

The request of thinking of is along the lines of: "I want N cores. I don't care about the number of nodes, but I want 1 process per node.

Good question! This was much easier to reason about when the slot: could appear along with the task intead of the resource section. 😑

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

It does feel like we got ourselves into a bind by requiring that the slot shape be defined as part of the resource request. There are certain things that are now not possible or very difficult to express (though it could just be my misunderstanding of current jobspec definition). In general, I don't see how you can define a slot at a parent of the resource type you want to request without affect the exact count (and exclusivity) of the parent resource type (if that makes any sense, specific example is what we're discussing here where you want N cores with a slot of a node, but don't care how many nodes. In fact, the exact slot shape can only be determined from a concrete resource set)

A couple ideas on how to handle this come to mind:

  • add a concept of predefined slots that match resource names, e.g. node, socket, core that allow the slot to be left out of the jobspec resources section, e.g. in the case of flux jobspec sbatch you could emit a resources section with no slot and a task: with slot: label node:

  • add a special value for count: like any or 0 that indicates no preference for the number of this type, and any slot that contains an any resource as a direct child would by definition apply to exactly 1 of that resource.

I'll be embarrassed if we've already discussed this somewhere and I missed it...

@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch from dde404a to 5ba92b8 Feb 6, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 6, 2019

I'll be embarrassed if we've already discussed this somewhere and I missed it...

I'm not seeing it anywhere, so I'll open a new issue and copy over the conversation so far.

EDIT: New issue: flux-framework/rfc#150

@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch 2 times, most recently from 7b2b344 to 761e92e Feb 10, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 10, 2019

This PR is ready for another review. Thanks @grondo and @trws for whiteboarding the different situations with me on Friday. That made writing tests super simple.

I opted to remove the --distribution flag. It didn't seem necessary after our conversation on Friday. I also switched from --yaml and --json to --format yaml and --format json as @grondo suggested above.

@SteVwonder SteVwonder changed the title [WIP] flux-jobspec command flux-jobspec command Feb 10, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 11, 2019

Awesome!

One minor thing is that because this script lacks the .py suffix, it's not subject to the black format check (and it finds some small nits). I'm not sure what we should do about that - have the check-format script also look for scripts containing a python shebang? I'll open a separate issue on that but the formatting should probably be adjusted here.

@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch 2 times, most recently from 1692571 to 7894a8c Feb 11, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2019

Great catch @garlick!

Rebased, formatter applied to src/cmd/flux-jobspec at relevant commits, and added a fix to scripts/{check-,}format so that they run on src/cmd/flux-jobspec (#2003). I can pull that last addition out pretty easily if we want to go a different route.

@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch from 7894a8c to bb96a5f Feb 11, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

Really nice!

Just a couple very minor things while playing around with this really nice tool:

Errors should probably be prefixed with flux-jobspec: so that in a pipeline or scripted invocation, the user knows from whence the error came. Actually I just noticed some errors do have flux-jobspec prefix, others do not:

$ flux jobspec srun -n0
usage: flux-jobspec [-h] [--format {json,yaml}] {srun} ...
flux-jobspec: error: command is required
$ flux jobspec srun -n0 hostname
Unknown error: resource count must be > 0
$ flux jobspec srun -n1 -N2 hostname
Input Error: Number of nodes greater than the number of tasks

I'm probably confused from our prior discussion, but this one doesn't seem right. I want to run 2 tasks with 1 core per task (implied) on a single node (is this just a need to special case nnodes == 1?)

$ flux jobspec srun -n2 -N1 hostname | jq -S .
Number of tasks is not an integer multiple of the number of nodes. More resources than required will be requested to satisfy your job
{
  "attributes": {
    "system": {}
  },
  "resources": [
    {
      "count": 1,
      "type": "node",
      "with": [
        {
          "count": 2,
          "label": "task",
          "type": "slot",
          "with": [
            {
              "count": 1,
              "type": "core"
            }
          ]
        }
      ]
    }
  ],
  "tasks": [
    {
      "attributes": {},
      "command": [
        "hostname"
      ],
      "count": {
        "total": 2
      },
      "slot": "task"
    }
  ],
  "version": 1
}

Actually it does that for any case where ntasks < nnodes?

$ flux jobspec srun -n8 -N2 hostname >/dev/null
Number of tasks is not an integer multiple of the number of nodes. More resources than required will be requested to satisfy your job
@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch 3 times, most recently from d970f1e to 9a81d91 Feb 13, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 13, 2019

Errors should probably be prefixed with flux-jobspec: so that in a pipeline or scripted invocation, the user knows from whence the error came.

Good catch! I switched from using manual print statements to using the python logging module. How does this look?

$ ./flux jobspec srun -n0
usage: flux-jobspec [-h] [--format {json,yaml}] {srun} ...
flux-jobspec: error: command is required
$ ./flux jobspec srun -n0 hostname
flux-jobspec: ERROR: resource count must be > 0
$ ./flux jobspec srun -n1 -N2 hostname
flux-jobspec: ERROR: Number of nodes greater than the number of tasks
$ ./flux jobspec srun -n3 -N2 hostname > /dev/null
flux-jobspec: WARNING: Number of tasks is not an integer multiple of the number of nodes. More resources than required will be requested to satisfy your job.

Any errors from the argparse parser will have a lowercase error and print out the usage, while exceptions thrown during validation or program execution will have the all caps ERROR. I vote to mess around with the exact formatting of the argparse errors in #1993.

I'm probably confused from our prior discussion, but this one doesn't seem right.

Great catch! I had my modulo test backwards. I was testing num_nodes % num_tasks == 0 when I should have been testing num_tasks % num_nodes == 0. I flipped it and added a test to check that:

  • no warning is printed on a correct invocation (i.e., -N8 -n16)
  • num_tasks that are integer multiples of num_nodes produce a jobspec that uses .tasks[0].count.per_slot rather than .tasks[0].count.total.

End result:

$ ./flux jobspec srun -n2 -N1 hostname | jq -S .    
{
  "attributes": {
    "system": {}
  },
  "resources": [
    {
      "count": 1,
      "type": "node",
      "with": [
        {
          "count": 2,
          "label": "task",
          "type": "slot",
          "with": [
            {
              "count": 1,
              "type": "core"
            }
          ]
        }
      ]
    }
  ],
  "tasks": [
    {
      "attributes": {},
      "command": [
        "hostname"
      ],
      "count": {
        "per_slot": 1
      },
      "slot": "task"
    }
  ],
  "version": 1
}
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Sooo...is this one good to go after a rebase?

@grondo
grondo approved these changes Feb 14, 2019
Copy link
Contributor

left a comment

LGTM! Thanks for remembering to update README.md with the jq requirement for make check!

SteVwonder added 3 commits Feb 6, 2019
This should have been apart of #1920. Whoops!
Problem: src/cmd/flux-jobspec does not have a .py extension, so `format`
was not running the formatter on it

Solution: if the file does not end in .py, run awk to check for a python
shebang at the top of the file
@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch from 9a81d91 to 09d534f Feb 15, 2019
SteVwonder added 4 commits Jan 21, 2019
e.g., `flux jobspec srun -N4 -n8 -c2 -t 00:30:00 hostname`

Add support for the `--distribution` and `--output` SLURM flags and the
`--time` string format.

Expose SLURM-style argument parser via the `get_slurm_common_parser`
function.  This could potentially be moved to the `src/python`
directory, if useful enough.
specifically, verify that the total number nodes, tasks, and cores
emitted in the jobspec are correct w.r.t SLURM semantics

also verify that a warning is printed when the semantics differ from
SLURM's
the former is now an error; the latter results in a warning that more
resources are being requested than will be used

also adds a test to ensure that correct command executions do not
produce a warning
@SteVwonder SteVwonder force-pushed the SteVwonder:flux-jobspec branch from 09d534f to f21cfb0 Feb 15, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Feb 15, 2019

Rebased and did a little squashing. I think this is good to go once Travis is green.

@codecov-io

This comment has been minimized.

Copy link

commented Feb 15, 2019

Codecov Report

Merging #1964 into master will decrease coverage by 0.36%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1964      +/-   ##
==========================================
- Coverage   80.53%   80.17%   -0.37%     
==========================================
  Files         178      195      +17     
  Lines       28727    35191    +6464     
==========================================
+ Hits        23136    28214    +5078     
- Misses       5591     6977    +1386
Impacted Files Coverage Δ
src/bindings/lua/lutil.c 78.12% <0%> (-11.17%) ⬇️
src/common/libaggregate/aggregate.c 80.89% <0%> (-5.92%) ⬇️
src/bindings/lua/flux-lua.c 82.32% <0%> (-5.08%) ⬇️
src/modules/cron/cron.c 78.73% <0%> (-1.18%) ⬇️
src/broker/modservice.c 78.84% <0%> (-0.97%) ⬇️
src/common/libflux/future.c 87.54% <0%> (-0.33%) ⬇️
src/cmd/builtin/hwloc.c 80.53% <0%> (-0.1%) ⬇️
src/broker/broker.c 77.41% <0%> (-0.07%) ⬇️
src/common/libflux/conf.c 100% <0%> (ø) ⬆️
src/cmd/flux-aggregate.c
... and 32 more
@garlick

This comment has been minimized.

Copy link
Member

commented Feb 15, 2019

Thanks - this is a high value one!

@garlick garlick merged commit 7af29ea into flux-framework:master Feb 15, 2019
2 checks passed
2 checks passed
Mergify — Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@SteVwonder SteVwonder deleted the SteVwonder:flux-jobspec branch Feb 15, 2019
@garlick garlick referenced this pull request Feb 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.