Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-wreck: address scalability issue in 'ls' and 'timing' subcommands #1372

Merged
merged 7 commits into from Mar 19, 2018

Conversation

Projects
None yet
4 participants
@grondo
Copy link
Contributor

grondo commented Mar 19, 2018

The flux-wreck ls and timing commands currently always attempt to list information for all jobs stored in the kvs. For large sessions running many jobs, the commands will always be very slow, and in some cases won't work if the table of jobs grows beyond the max size of a Lua table.

This PR addresses this problem with the following changes to these commands:

  • By default, flux-wreck ls and flux-wreck timing will only list the last (by jobid number) 25 jobs in the kvs. This can be adjusted with a new -n, --max=COUNT option.
  • These commands can now accept a list of JOBIDs as arguments (after all option arguments). The JOBIDs can appear as host ranges for convenience or multiple arguments on the command line, e.g: flux wreck ls 1 2 3 4 5 10 or flux wreck ls 1-5,10 or flux wreck ls 1-5 10

There may still be a scalability issue with the purge command, but purge does have to list all jobs for now so it knows how many it can remove. If this is an issue it may be addressed in some other manner and will get a different issue opened.

Fixes issue #1366

@grondo grondo force-pushed the grondo:issue#1366 branch from 0efcc6f to 3b2fb4f Mar 19, 2018

grondo added some commits Mar 17, 2018

wreck/lua: improve wreck:joblist() method
In wreck:joblist(), iterate job directories in the KVS in
reverse order, and bail out early if a new "max" argument to
the function is set by the caller. This allows the function
to avoid becoming very slow or broken in instances with many
thousands of jobs.
cmd/flux-wreck: add -n, --max option to some cmds
Add -n, --max=COUNT argument to the `ls` and `timings` commands
(default = 25), which use the max argument to wreck:listjobs()
to avoid fetching huge numbers of jobs from the kvs.
wreck/lua: add jobids_to_kvspath function
Add a function to convert multiple jobids into kvs paths, instead
of handling only one jobid at a time.
flux-wreck: support jobid arg to ls and timing cmds
Enhance flux-wreck ls and flux-wreck timing commands by allowing
them to take an option list of one or more jobids on the command
line.
doc: update flux-wreck(1) manpage
Update the flux-wreck manpage with information about new
options and arguments to the ls and timing subcommands.
t/t2000-wreck.t: test extra args to flux-wreck ls
Sanity check that flux-wreck ls -n, --max=COUNT option works
and that JOBID argument can be provided to the command.

@grondo grondo force-pushed the grondo:issue#1366 branch from 3b2fb4f to 609a9bd Mar 19, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Mar 19, 2018

rebased on current master

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 19, 2018

Coverage Status

Coverage decreased (-0.01%) to 78.786% when pulling 609a9bd on grondo:issue#1366 into 3fea0a0 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Mar 19, 2018

Just kicked the tires a little bit with 1000 jobs in the history, and this seems solid, and faster too, e.g.

$ time flux wreck ls 1-1000 >/dev/null

real	0m4.436s
user	0m1.260s
sys	0m0.336s
$ time flux wreck ls  >/dev/null

real	0m0.156s
user	0m0.044s
sys	0m0.020s

@garlick garlick merged commit 537f4fc into flux-framework:master Mar 19, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.01%) to 78.786%
Details
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 19, 2018

Codecov Report

Merging #1372 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1372      +/-   ##
==========================================
- Coverage   78.48%   78.47%   -0.02%     
==========================================
  Files         162      162              
  Lines       29741    29741              
==========================================
- Hits        23342    23339       -3     
- Misses       6399     6402       +3
Impacted Files Coverage Δ
src/common/libflux/request.c 87.17% <0%> (-1.29%) ⬇️
src/common/libutil/dirwalk.c 93.57% <0%> (-0.72%) ⬇️
src/common/libutil/base64.c 95.07% <0%> (-0.71%) ⬇️
src/common/libflux/message.c 81.25% <0%> (-0.24%) ⬇️
src/common/libflux/response.c 84.55% <0%> (+0.81%) ⬆️
src/common/libflux/rpc.c 94.21% <0%> (+0.82%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Mar 20, 2018

Just kicked the tires a little bit with 1000 jobs in the history, and this seems solid, and faster too, e.g.

$ time flux wreck ls 1-1000 >/dev/null

real	0m4.436s
user	0m1.260s
sys	0m0.336s
$ time flux wreck ls  >/dev/null

real	0m0.156s
user	0m0.044s
sys	0m0.020s

Thanks. Just full disclosure in case others stumble across this in the future. The speedup here is because the 2nd example (flux wreck ls with no args) is only listing the last 25 jobs in the KVS instead of 1000 jobs.

@grondo grondo deleted the grondo:issue#1366 branch Mar 20, 2018

@grondo grondo referenced this pull request May 10, 2018

Closed

0.9.0 Release #1479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.