Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hierarchical lwj directory support in kvs #811

Merged
merged 40 commits into from Oct 19, 2016

Conversation

Projects
None yet
5 participants
@grondo
Copy link
Contributor

grondo commented Sep 18, 2016

This PR is an experiment in changing just the lwj-active. kvs directory for wreck jobs to a tree of directories, with parameters as suggested by @dongahn in #798.

This implementation uses a configurable number of prefix directory "levels" (parent dirs) and adjustable number of bits to use for parent directories (i.e. adjustable number of entries per dir in powers of 2).
The numbers can be adjusted at flux start time only by the new broker attributes:

wreck.lwj-bits-per-dir                  7
wreck.lwj-dir-levels                    2

The default is 2 levels and 7 bits per directory as determined in @dongahn's experiments. To get the old behavior set wreck.lwj-dir-levels to 0, e.g.

flux start -o,-Swreck.lwj-dir-levels=0 ...

The top-level lwj-active. directory always uses all remaining high-order bits, (i.e bits 64 - (levels+1)*bits_per_dir). This means this top-level directory can grow to allow the maximum lwj id to be stored, however, it doesn't necessarily adhere to lw-bits-per-dir.

Because lwj-active. directory is now stored in a hierarchy by default, the "archive" of jobs seemed no longer necessary. Instead lwj.<id> is kept as a link to lwj-active.x.y.<id>, and lwj-complete.<hb>.<id> is also linked back to lwj-active.. (i.e. we no longer use kvs_move()) Perhaps that was a gratuitous change at this point, and could easily be backed out.

Actually lwj-active is a misnomer now, and probably should be renamed to lwj-store. or something, but since this PR is just for testing purposes at this stage I didn't make that change yet.

I didn't see a real improvement with job throughput using this change. Possibly it will only affect performance of pure flux submit. I tried the soak test and unfortunately I'm seeing very high standard deviation of job runtimes both on this branch and current master. I may have botched the tests.

@dongahn, hopefully you'll be able to give early feedback on this change and your job throughput testing.

@grondo grondo added the review label Sep 18, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 18, 2016

Coverage Status

Coverage decreased (-0.01%) to 75.234% when pulling fcefed5 on grondo:lwj-hierarchy into b532e12 on flux-framework:master.

@grondo grondo force-pushed the grondo:lwj-hierarchy branch from fcefed5 to 96f3c5a Sep 18, 2016

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Sep 18, 2016

Current coverage is 71.54% (diff: 75.00%)

Merging #811 into master will increase coverage by 0.02%

@@             master       #811   diff @@
==========================================
  Files           157        156     -1   
  Lines         26683      26783   +100   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19082      19161    +79   
- Misses         7601       7622    +21   
  Partials          0          0          
Diff Coverage File Path
•••••• 62% src/modules/wreck/wrexecd.c
••••••• 74% src/modules/wreck/job.c
•••••••• 80% src/modules/libjsc/jstatctl.c

Powered by Codecov. Last update 402ee49...f3c33ef

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 18, 2016

Coverage Status

Coverage increased (+0.03%) to 75.277% when pulling 96f3c5a on grondo:lwj-hierarchy into b532e12 on flux-framework:master.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 19, 2016

Thanks @grondo! I will cherry pick this and experiment with it soon.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2016

@grondo: as you may recall, I have been using the purge-keep change for my testing and this PR has conflicts with the cherry-picked change. Could you please take a quick look at it and suggest the right merge for me?

<<<<<<< HEAD
                    table.insert (r, id)
                    if self.opt.R then
                        local t, target = f:kvs_type ("lwj."..id)
                        if t == "symlink" then
                            f:kvs_unlink (target)
                        end
                        f:kvs_unlink ("lwj."..id)
                        f:kvs_unlink ("lwj-complete."..hb.."."..id)
=======
                    if keep (id) then
                        hb_unlink = false
                    else
                        table.insert (r, id)
                        if self.opt.R then
                            f:kvs_unlink ("lwj."..id)
                            f:kvs_unlink ("lwj-complete."..hb.."."..id)
                        end
                        n = n - 1
                        if n == 0 then return end
>>>>>>> ed7f0ad... cmd/flux-wreck: add --keep option to purge
                    end
                end
                if self.opt.R and hb_unlink then
                    f:kvs_unlink ("lwj-complete."..hb)
                end
            end
        end)()

        -- gather ids to remove in hostlist for condensed output:
        --
        local hl = require 'flux.hostlist' .new (table.concat (r, ",")):uniq ()
<<<<<<< HEAD
        local idstring = tostring (hl):match ("%[?([^]]+)%]?")
=======
>>>>>>> ed7f0ad... cmd/flux-wreck: add --keep option to purge


@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Sep 20, 2016

Ah, sorry @dongahn I forgot about that one, let me see if I can merge that for you!

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2016

@grondo: thanks! I could have tried it but I know you are the perfect :-)

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Sep 20, 2016

I didn't have time to test right now, but I pushed a fixed up commit on top here. It may not work, but I'll check back in after dinner.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 20, 2016

Coverage Status

Coverage decreased (-0.01%) to 75.238% when pulling 8266a44 on grondo:lwj-hierarchy into b532e12 on flux-framework:master.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2016

My tests seem to run better w/ this. I ran the 2CN case twice and there are some degradation towards the end (w/ all of the scheduler optimizations in it) but not like before where I had steep degradation at 3000 and afterwards.

First run:

cat MOAB.SLURM.00000002.00002Nx01B.half.usleep.FCFS.QD16.DELAYtrue.00010000.10800.in.stats.old
job-group-id,used-nodes,job-count,JEPS
0,1,1000,23.33541049600628450607
1,1,1000,18.87337644138335054899
2,1,1000,17.20563164411596217204
3,1,1000,16.29070648428112989473
4,1,1000,15.08969700688164123029
5,1,1000,14.74500170983039827193
6,1,1000,14.00075354855749046125
7,1,1000,13.03554329895838839492
8,1,1000,4.52034726284068150888
9,1,1000,3.98137040345909741902

Second run:

cat MOAB.SLURM.00000002.00002Nx01B.half.usleep.FCFS.QD16.DELAYtrue.00010000.10800.in.stats
job-group-id,used-nodes,job-count,JEPS
0,1,1000,23.46239928794433649003
1,1,1000,19.21635622451022408154
2,1,1000,17.40877003697204945371
3,1,1000,15.97285713220096531323
4,1,1000,15.23304401872927134973
5,1,1000,14.57901979622408344144
6,1,1000,14.25902668397180169104
7,1,1000,13.40933134449706653444
8,1,1000,2.70363351257125777594
9,1,1000,9.81802853555012897845

Also, the version of flux-core seems generally a bit slower than the version I used for my tests before. I can reestablished the baseline for the flux-core performance by using the version right before this PR -- making sure the hierarchical lwj doesn't hurt general performance much.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2016

I didn't see a real improvement with job throughput using this change

Did you use wreckrun as well as purge along with the change? If so, I'm wondering if the active lwj won't grow much if there is logic within wreckrun to block, natural throttling. Also wondering what happens if we don't use purge.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Sep 20, 2016

Did you use wreckrun as well as purge along with the change?

I didn't use purge. I was expecting "job insertion" to go a bit faster for anything over 128 jobs, perhaps that expectation was a bit premature.

if so, I'm wondering if the active lwj won't grow much if there is logic within wreckrun to block, natural throttling.

I was running with -d, so wreckrun shouldn't be blocking. However, this runs jobs overlapping and could slow down the machine itself, so perhaps a better test on my end would have flux submit.

As a side note, on this branch the lwj-active directory is a misnomer, as kvs_move is no longer used at job completion to move the lwj directory from lwj-active.<id> back to lwj.id, lwj.id remains a link (until purge of course)

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Sep 20, 2016

I didn't use purge. I was expecting "job insertion" to go a bit faster for anything over 128 jobs, perhaps that expectation was a bit premature

Interesting. Let`s continue to peel this. There still one time overhead of creating a link into lwj, which could eclipse other benefits... I also have to understan degradation after
8000 jobs as well...

I was running with -d, so wreckrun shouldn't be blocking. However, this runs jobs overlapping and could slow down the machine itself, so perhaps a better test on my end would have flux submit.

Yeah. I once tested mine with wteckrun -d and I noticed it naturally throttled more than flux submit.

@grondo grondo force-pushed the grondo:lwj-hierarchy branch from 8266a44 to 88b456c Oct 12, 2016

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 12, 2016

Ok, I've switched this PR over a scheme that removes the lwj-active directory completely and adds the lwj hierarchy directly under lwj.. This required removing all instances where lwj.<id> was used for the kvs path to job id, passing the kvs path to wrexecd, etc. I haven't tackled libjsc yet, so this branch will not pass travis, however all other wreck interfaces should pretty much work.

I ran a 8K job soak test real quick, and got the following (with no purge)

soak-45980

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 12, 2016

Also, in this current branch the wrexec module was removed and the equivalent code was moved into the job module (which already ran on every rank). The fork/exec of wrexecd was also greatly simplified, double fork removed (I don't think it served a purpose), and the "parent-fd" shenanigans removed.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 12, 2016

Ok, here's another soak test up to 10K jobs, after a couple of leak fixes, also with the y axes scaled down a bit. Still looks like perhaps I've got a slow leak to find

soak-23817

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 12, 2016

Great results @grondo! This is major in particular looking at the flat runtime overheads in the absence of purge! Feel free to delegate JSC work to me. If nothing else, this will reveal good understanding about the scheduler throughput as well!

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 12, 2016

@dongahn, I felt bad about pushing a PR that couldn't pass tests, so I took a stab at updating jsc in these most recent commits.

It isn't perfect, but the approach taken here is to add a new hash to jsc ctx to store a mapping between lwj id and kvs path. The hash is updated whenever a new lookup via job.kvspath service is performed, or using the new kvs_path member of wrexec state events. Instead of direct xasprintf to construct kvs keys assuming lwj.%d, a ljw_key() function is used to construct keys based on the real base path for the job id.

One problem is that the hash only grows, it doesn't purge old entries, so if libjsc is used in long running processes this will need to be addressed, but I consider the current approach an interim solution so you can move on to your testing more quickly.

There are also some other minor test and leak fixes here -- we'll see if travis tests now pass.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Oct 12, 2016

Here's a 10K job soak test on master, for comparison (not scaled like @grondo's though)

soak-8312

This was on my desktop so the spikes may be due to other things going on.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Oct 12, 2016

Coverage Status

Coverage decreased (-0.06%) to 75.082% when pulling 725c09f on grondo:lwj-hierarchy into 286b6e7 on flux-framework:master.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 13, 2016

Thanks @garlick. It seems, over 4x improvement in sqlite db memory overhead tells a lot about the cause of the large kvs directory perf issue...

@grondo grondo force-pushed the grondo:lwj-hierarchy branch from f1cc8db to f96f401 Oct 14, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Oct 14, 2016

Coverage Status

Coverage increased (+0.04%) to 75.148% when pulling f96f401 on grondo:lwj-hierarchy into 1d3081d on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Oct 14, 2016

Is this ready to go in? Seems like it helps quite a bit!

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 14, 2016

I just rebased on current master and cleaned up some transitional code, and added tests for the new broker attrs used to construct the lwj.x.y.z paths in kvs.

I'd like to run some valgrind tests before merging, then I think it is ready if the general approach is ok.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 14, 2016

Oh, wait, I spoke too soon. I haven't even tested yet what changes will be necessary in flux-sched to support this. Also, this may break capacitor as well. I think for now this has to stay as an experimental branch.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Oct 14, 2016

@grondo: I can quickly run this through PerfExplore for sched compatibility and performance improvements on this experimental branch. Is this good time to do this?

grondo added some commits Oct 11, 2016

cmd/flux-wreck: add -p, --kvs-path option to last-jobid
Add -p, --kvspath option to flux-wreck last-jobid in order to return
path in kvs to the last jobid instead of just the jobid itself.
t/t2000-wreck.t: Use last jobid path where necessary
Do not assume last-jobid path in kvs is at lwj.<id>. Instead use
the new flux-wreck last-jobid -p option to get the last jobid path
in kvs for tests.
wreck: timeout.lua: use path relative to kvsdir
Do not assume lwj info in lwj.<id>, instead build path to "state"
key relative to wreck.kvsdir, the actual path to the lwj info.
wreck: job: Remove lwj-active kvs dir
Place jobs directly in lwj.* -- do not use the lwj-active
directory for active jobs.  The "active" directory is no longer
needed now that job entries are stored hierarchically.
cmd/flux-wreck: add kvs-path subcommand
Add a `kvs-path` subcommand to print the correct kvs path / key for
any jobid using the job.kvspath service.
modules/libjsc: update for hierarchical lwj directory
Remove all places where lwj.%d is assumed path/key to job id
in kvs.

When lwj path in kvs needs to be determined, look it up via
job.kvspath service and memoize result in a new ctx->job_kvs_paths
hash.

In job_state_cb, check for existence of "kvs_path" key in json
object, and if it exists, cache the kvs path for this jobid,
thus avoiding a job.kvspath query.
libjsc: add error string if kvs_get_dir fails
In jobid_exists() add the error string to the debug level error
message from libjsc when kvs_get_dir() returns non-zero.
t/t2001-jsc.t: do not assume jobs are in lwj.ID
Use `flux wreck kvs-path` to get base path/key for jobids in kvs,
when necessary.
t/t2002-pmi.t: fixup path to job in kvs
Do not assume path to LWJ is in lwj.<id>, use `flux wreck kvs-path`
to build the correct base path for jobid.
cmd/flux-wreck: no output from "ls" and "timings" when no jobs
When there are no jobs, preserve old behavior of "ls" and "timings"
subcommands and print no output, not even the header line.
wreck: wrexecd: remove unneeded code
Remove code to call out to job.kvspath service when --kvs-path option
is missing. The --kvs-path option is always used.
t/t2000-wreck.t: add tests for lwj hiearchy setup
Add test to ensure lwj kvs directory hierarchy parameters can be
set via broker attributes. Also, ensure a job can be run in old
style flat hierarchy (kvs-dir-levels = 0).

@grondo grondo force-pushed the grondo:lwj-hierarchy branch from b596bac to f3c33ef Oct 19, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Oct 19, 2016

Coverage Status

Coverage increased (+0.02%) to 75.156% when pulling f3c33ef on grondo:lwj-hierarchy into 402ee49 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Oct 19, 2016

Rebased on current master

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Oct 19, 2016

I'll go ahead and merge - great work here @grondo.

@garlick garlick merged commit 18ae697 into flux-framework:master Oct 19, 2016

4 checks passed

codecov/patch 75.00% of diff hit (target 71.51%)
Details
codecov/project 71.54% (+0.02%) compared to 402ee49
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 75.156%
Details

@garlick garlick removed the review label Oct 19, 2016

@grondo grondo deleted the grondo:lwj-hierarchy branch Oct 22, 2016

@garlick garlick referenced this pull request Oct 26, 2016

Closed

0.5.0 release notes #879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.