Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: track active jobs #1481

Merged
merged 10 commits into from Apr 26, 2018

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Apr 20, 2018

This PR adds a bit of code to the job module to track active jobs.

It listens for all job state transitions, enters jobs into a hash when they first appear, and removes them when the terminal "complete" or "failed" states are reached. The hash exists on all ranks.

There's a simple job.list request that returns a sorted array of jobs including id, kvs_path, and state, e.g.

[
  {"jobid":1, "kvs_path":"lwj.0.0.1", "state":"running"},
  {"jobid":2, "kvs_path":"lwj.0.0.2", "state":"starting"},
  {"jobid":3, "kvs_path":"lwj.0.0.3", "state":"submitted"}
]

Next step is to add an option to flux wreck ls to show only active jobs. The addition of that command would address #1456.

@garlick garlick force-pushed the garlick:active_jobs branch from 99c14c2 to 690c9bf Apr 20, 2018

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 20, 2018

Coverage Status

Coverage increased (+0.06%) to 79.055% when pulling 3932537 on garlick:active_jobs into 866f4c7 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 20, 2018

Codecov Report

Merging #1481 into master will increase coverage by 0.06%.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##           master    #1481      +/-   ##
==========================================
+ Coverage   78.69%   78.75%   +0.06%     
==========================================
  Files         164      164              
  Lines       30475    30585     +110     
==========================================
+ Hits        23982    24087     +105     
- Misses       6493     6498       +5
Impacted Files Coverage Δ
src/modules/wreck/job.c 72.89% <68.42%> (-0.3%) ⬇️
src/modules/wreck/wreck_job.c 94.73% <95.12%> (+4.49%) ⬆️
src/common/libflux/response.c 83.73% <0%> (-0.82%) ⬇️
src/modules/kvs/kvs.c 65.87% <0%> (-0.17%) ⬇️
src/common/libflux/message.c 81.36% <0%> (ø) ⬆️
src/common/libflux/future.c 89.25% <0%> (+0.46%) ⬆️
src/common/libflux/mrpc.c 86.66% <0%> (+1.17%) ⬆️
src/modules/connector-local/local.c 74.38% <0%> (+1.43%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

I'm assuming the example job.list output above is just a bad example since it includes a job that is complete, or did I misunderstand the job.list API?

This looks super useful, I'll push a commit here with some users job.list so it can be tested.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Sorry, bad example yes.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Oops, I forgot that JSON payloads are required to be objects (not arrays) per RFC 3. Fix coming, I'll go with

{ "jobs":[
  {"jobid":1, "kvs_dir":"lwj.0.0.1", "state":"running"},
  {"jobid":2, "kvs_dir":"lwj.0.0.2", "state":"starting"},
  {"jobid":3, "kvs_dir":"lwj.0.0.3", "state":"submitted"}
]}
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Also, I don't have a preference on this, but other parts of the code use kvspath (e.g. job.kvspath service name and FLUX_JOB_KVSPATH) for the lwj.x.y.z directory path and here it is kvs_dir. We should pick one to avoid confusion.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Let's go with kvspath. I'll fix.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Sorry! My comments above and in the header were inaccurate.

In the code I used kvs_path which matches what is used for other job.* messages. I'll just correct the header and edit the description above to avoid confusion.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

In the code I used kvs_path which matches what is used for other job.* messages. I'll just correct the header and edit the description above to avoid confusion.

Ah, thanks for catching that!

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

This is working in my test code. I'll have a set of commits to push here in a bit that implements desired behavior for flux wreck ls. Here's an example of the current approach. First, the list of active jobs is gathered and if that doesn't exceed the value of --max, then the traditional kvs walk is appended to this result list up to max, skipping the paths already gathered from job.list:

$ flux wreck ls
    ID NTASKS STATE                    START      RUNTIME    RANKS COMMAND
     1      2 running    2018-04-20T14:50:55       4.739m    [0-1] sleep
     2      2 running    2018-04-20T14:50:56       4.719m    [0-1] sleep
     8      2 running    2018-04-20T14:51:25       4.232m    [0-1] sleep
    13      2 running    2018-04-20T14:55:37       2.197s    [0-1] sleep
     3      1 exited     2018-04-20T14:51:00       0.041s        0 hostname
     4      1 exited     2018-04-20T14:51:01       0.037s        0 hostname
     5      1 exited     2018-04-20T14:51:01       0.043s        0 hostname
     6      1 exited     2018-04-20T14:51:02       0.041s        0 hostname
     7      1 exited     2018-04-20T14:51:02       0.041s        0 hostname
     9      1 exited     2018-04-20T14:55:24       0.045s        0 hostname
    10      1 exited     2018-04-20T14:55:25       0.040s        0 hostname
    11      1 exited     2018-04-20T14:55:25       0.038s        0 hostname
    12      1 exited     2018-04-20T14:55:26       0.039s        0 hostname

I'll also add a way to filter by job state.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Nice!

Thinking out loud really, but would it be slightly better to only add on the inactive jobs with an --inactive flag?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Thinking out loud really, but would it be slightly better to only add on the inactive jobs with an --inactive flag?

It was a big enough change I didn't think it was necessary at this point. Also it seemed weird to print nothing when there were no "active" jobs (on a quiescent system). For principle of least surprise I thought that this small change which basically moves the "active" jobs to the top was actually a nice solution.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

Ah, makes sense.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Oh, sorry, meant to add that I'd be fine changing the behavior if consensus was that flux wreck ls shouldn't list complete jobs by default. I was just explaining why I started out with the current approach.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 20, 2018

I'm fine with it either way - and we can always change it later if it needs to be changed.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 21, 2018

Ok, pushed some work to my own active_jobs branch (refrained from pushing here just yet) if you want to cherry pick these commits. Main improvements:

  • job state filtering is now enabled by the wreck.joblist function
  • Only job state "running" is reported by flux wreck uri by default
  • New -x, --exclude-complete option to flux wreck ls only lists "active" jobs

Commits may need some cleanup and there are no tests yet.

@garlick garlick force-pushed the garlick:active_jobs branch 2 times, most recently from b78901f to ca78504 Apr 22, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 22, 2018

Pulled in @grondo's work, squash my incremental development, and rebased.

@garlick garlick force-pushed the garlick:active_jobs branch 3 times, most recently from e24b2ca to d0e53c8 Apr 22, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 22, 2018

I added a couple tests to my branch and found a problem with purge. Fixing that and I'll update my branch. Sorry for the premature request to cherry-pick my commits.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 23, 2018

Ok, I fixed some issues and made some improvements on my active_jobs branch, which is rebased on your commits from the latest version of this pr.

The main change here is a fix in the t2000-wreck.t ls RANGE test which was confusing flux-wreck purge test later on. Addition of a flux wreck ls test to ensure "active" jobs are listed first, and an improvement to flux wreck purge to only consider non-active jobs.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 23, 2018

Thanks! I'll update this branch

@garlick garlick force-pushed the garlick:active_jobs branch from d0e53c8 to 7360b38 Apr 23, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 23, 2018

Thanks!

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 23, 2018

Thanks for all that work! I'll look at the coverage tomorrow and see if I can come up with anything to help increase it.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 23, 2018

Not sure there is much that could be done about coverage, most of the missing coverage is in the error path handling of wreck_job_list. (There is a few duplicated cleanup lines under the error: label, but it seemed like trying to move those into the standard return path wouldn't have a large payoff)

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 24, 2018

Ok, done. This is a bit more palatable.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 24, 2018

Perhaps @chu11 could give this a spin and review and push the button if all statuses return green.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 25, 2018

I'll go ahead and rebase.

@garlick garlick force-pushed the garlick:active_jobs branch from ea59050 to 620d1d3 Apr 25, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 25, 2018

@chu11, want to merge this if it looks basically OK?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 26, 2018

@garlick, it is probably ok for one of us to merge this one rather than letting it languish and need multiple more rebases. I'm trying to work on top of this branch and it would be more convenient if it were merged to master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 26, 2018

Ok, let me rebase and then you can press the button OK?

@garlick garlick force-pushed the garlick:active_jobs branch from 620d1d3 to c99b78b Apr 26, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 26, 2018

I thought it might be easier to merge #1488 first, but up to you. (sorry should have been more patient)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 26, 2018

Oops sorry, I misunderstood. OK, let me go rebase that one.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 26, 2018

Oops sorry, I misunderstood. OK, let me go rebase that one.

No that was my fault.

garlick and others added some commits Apr 20, 2018

wreck: track/list state of active jobs
Track the state of active jobs in a hash in the job
module.  A job is no longer active when it reaches
the "complete" or "failed" states.

Add a 'job.list' RPC which returns a list of active
jobs in the form returned by wreck_jobs_list(),
passing max, include, and exclude parameters in
the request to the function.
wreck: add wreck_job_list() to wreck_job class
Add a function to produce an object containing
a JSON array that is a sorted list of jobs
from a zhash_t of struct wreck_job entries.

Each time the state of a job is updated with
wreck_job_set_state(), set "mtime" of the job
to the current time.  Then sort the jobs in
the array in mtime order.

Limit the number of jobs returned in array
to 'max' if > 0.

If non-NULL, 'include_states' and 'exclude_states'
are parsed as comma-separated lists of job states
to include or exclude from the list, with
blacklist/whitelist semantics.
wreck/lua: multiple improvements for wreck.joblist
Improve wreck.joblist with the following additions:

 * retrieve a list of "active" jobs first using the new `job.list`
   rpc available from the wreck/job module. Inactive jobs are
   appended to this active list only if the returned list of
   active jobs does not meet or exceed arg.max.
   Fixes #1456

 * Allow filtering jobs by state in wreck.joblist with a states
   table with two allowable members

    - include:  include *only* states where include[state] == true
    - exclude:  exclude states in this table where exclude[state] == true

   These states are passed to `job.list` rpc for the active job list
   and directly filtered on kvs job state for the jobs retrieved from
   the kvs.

 * Add an active_only flag to wreck.joblist which returns immediately
   after retrieving the active job list from the job module. This effectively
   skips kvs traversal and all complete and failed jobs.

 * Add a kvs_only flag to wreck.joblist which skips the retrieval of active
   jobs from the `job.list` rpc. This avoids an unnecessary rpc when it is
   known that no active jobs are required to be returned from the function.
   (Should be used in combination with exclude/include to restrict job
   states returned from kvs)
wreck: flux-wreck: filter job states in wreck commands
Use state filter and kvs_only, active_only flags to wreck.joblist
to make some changes to behavior of flux-wreck commands:

 * Include only running jobs in flux-wreck uri
 * Skip complete jobs in flux-wreck `ls -x` and `uri`
 * Do not consider active jobs in flux-wreck purge
wreck: flux-wreck: pass new supported args to wreck.joblist
For extensibility, change joblist_from_args from taking a parameter
list to a table with parameters set by well-known keys. Include
keys supported by the updated `wreck.joblist` function, including:

 - states dictionary for including, excluding states
 - kvs_only and active_only flags for skipping `job.list` or kvs
   walk respectively
t2000-wreck: don't unlink kvsdir in wreck ls RANGE test
The unlink of the last job's kvs directory in the flux-wreck ls RANGE
test confuses flux-wreck purge test later on, since the job completion
link under lwj-complete is not similarly removed.

Avoid the unlink by instead moving the kvsdir to a temporary location
then moving it back again after the `ls` output is captured.
t2000-wreck: add tests to exercise new job.list rpc
Add a test to ensure that active jobs show up first in flux-wreck ls
output, even when the active job is not the most recent, and ensure
that flux-wreck purge doesn't consider "active" (i.e. running) jobs.
wreck: flux-wreck: ensure jobid is a number in kvs_path()
Ensure jobid is converted to a number before passing to
wreck.id_to_path().
wreck: flux-wreck ls: allow --max=0 to disable max
Allow --max=0 to set maximum number of jobs to return to unlimited.

@garlick garlick force-pushed the garlick:active_jobs branch from c99b78b to 3932537 Apr 26, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 26, 2018

OK this is rebased and ready to go when travis completes.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 26, 2018

Thanks for rebasing so many times @garlick!

@grondo grondo merged commit 33b5871 into flux-framework:master Apr 26, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.06%) to 79.055%
Details
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Apr 26, 2018

Thanks!

@garlick garlick deleted the garlick:active_jobs branch Apr 26, 2018

@grondo grondo referenced this pull request May 10, 2018

Closed

0.9.0 Release #1479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.