Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite db grows unreasonably large when running many small jobs #609

Closed
garlick opened this Issue Mar 17, 2016 · 20 comments

Comments

Projects
None yet
2 participants
@garlick
Copy link
Member

garlick commented Mar 17, 2016

In a soak test run, on the 7741th 8-way job, encountered this error:

[1458183667.824245] soak.info[0]: 7741 4.17 1603712
[1458183667.867864] broker.debug[0]: content purge: 15 entries
[1458183667.855751] broker.debug[2]: content purge: 11 entries
[1458183667.862400] broker.debug[1]: content purge: 12 entries
[1458183667.848563] broker.debug[6]: content purge: 11 entries
[1458183667.843773] broker.debug[4]: content purge: 11 entries
[1458183667.851527] broker.debug[5]: content purge: 11 entries
[1458183667.860135] broker.debug[3]: content purge: 11 entries
[1458183667.854959] broker.debug[7]: content purge: 10 entries
[1458183667.883523] job.info[0]: got request job.create
[1458183667.884140] job.info[0]: Setting job 7742 to reserved
[1458183668.217504] lwj.7742.debug[0]: initializing from CMB: rank=0
[1458183668.218107] lwj.7742.debug[0]: lwj.7742: node0: basis=0
[1458183668.272222] lwj.7742.info[0]: lwj 7742: node0: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183668.272249] lwj.7742.debug[0]: updating job state to starting
[1458183668.313432] lwj.7742.debug[0]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183668.904889] lwj.7742.debug[6]: initializing from CMB: rank=6
[1458183668.907340] lwj.7742.debug[6]: lwj.7742: node6: basis=6
[1458183668.914922] lwj.7742.debug[2]: initializing from CMB: rank=2
[1458183668.916117] lwj.7742.debug[2]: lwj.7742: node2: basis=2
[1458183668.927386] lwj.7742.debug[7]: initializing from CMB: rank=7
[1458183668.929815] lwj.7742.debug[7]: lwj.7742: node7: basis=7
[1458183668.936263] lwj.7742.debug[1]: initializing from CMB: rank=1
[1458183668.937195] lwj.7742.debug[1]: lwj.7742: node1: basis=1
[1458183668.937351] lwj.7742.debug[3]: initializing from CMB: rank=3
[1458183668.937381] lwj.7742.debug[5]: initializing from CMB: rank=5
[1458183668.938404] lwj.7742.debug[3]: lwj.7742: node3: basis=3
[1458183668.938642] lwj.7742.debug[5]: lwj.7742: node5: basis=5
[1458183668.986396] kvs.debug[0]: commit_timeout_handler: coalesced 5 commits
[1458183669.631723] lwj.7742.info[5]: lwj 7742: node5: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.632129] lwj.7742.debug[5]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.663131] lwj.7742.debug[4]: initializing from CMB: rank=4
[1458183669.665177] lwj.7742.debug[4]: lwj.7742: node4: basis=4
[1458183669.683631] lwj.7742.info[1]: lwj 7742: node1: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.684011] lwj.7742.debug[1]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.685311] lwj.7742.info[2]: lwj 7742: node2: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.685630] lwj.7742.debug[2]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.692297] lwj.7742.info[3]: lwj 7742: node3: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.692646] lwj.7742.debug[3]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.693038] lwj.7742.info[7]: lwj 7742: node7: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.693416] lwj.7742.debug[7]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.700922] lwj.7742.info[6]: lwj 7742: node6: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183669.701263] lwj.7742.debug[6]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183669.852251] broker.debug[0]: content purge: 35 entries
[1458183669.856172] broker.debug[2]: content purge: 17 entries
[1458183669.856847] broker.debug[6]: content purge: 13 entries
[1458183669.857786] broker.debug[5]: content purge: 13 entries
[1458183669.861135] broker.debug[1]: content purge: 19 entries
[1458183669.850368] broker.debug[4]: content purge: 13 entries
[1458183669.865314] broker.debug[3]: content purge: 15 entries
[1458183669.863885] broker.debug[7]: content purge: 13 entries
[1458183670.136682] lwj.7742.info[4]: lwj 7742: node4: nprocs=1, nnodes=8, cmdline=[ "\/bin\/true" ]
[1458183670.137088] lwj.7742.debug[4]: reading lua files from /home/garlick/proj/flux-core/src/modules/wreck/lua.d/*.lua
[1458183670.251404] lwj.7742.debug[0]: task0: pid 49850 (/bin/true): started
[1458183670.558618] lwj.7742.debug[2]: task0: pid 49851 (/bin/true): started
[1458183670.577426] lwj.7742.debug[5]: task0: pid 49852 (/bin/true): started
[1458183670.612219] lwj.7742.debug[1]: task0: pid 49853 (/bin/true): started
[1458183670.615133] lwj.7742.debug[3]: task0: pid 49854 (/bin/true): started
[1458183670.629030] lwj.7742.debug[4]: task0: pid 49855 (/bin/true): started
[1458183670.637325] lwj.7742.debug[6]: task0: pid 49856 (/bin/true): started
[1458183670.640369] lwj.7742.debug[7]: task0: pid 49857 (/bin/true): started
[1458183670.677648] lwj.7742.debug[0]: updating job state to running
[1458183670.726333] lwj.7742.debug[0]: task0: pid 49850 (/bin/true) exited with status 0x0000
[1458183670.728001] content-sqlite.err[0]: store: executing stmt: disk I/O error
[1458183670.728298] broker.err[0]: content store: Invalid argument
[1458183670.728320] broker.debug[0]: content flush begin
[1458183670.744677] broker.debug[0]: content flush +1 (dirty=2 pending=2)
[1458183670.728523] content-sqlite.err[0]: store: executing stmt: disk I/O error
[1458183670.744907] broker.err[0]: content store: Invalid argument
[1458183670.744923] broker.debug[0]: content flush begin
[1458183670.760687] broker.debug[0]: content flush +1 (dirty=2 pending=2)
[1458183670.751379] content-sqlite.err[0]: store: executing stmt: database disk image is malformed
[1458183670.760953] broker.err[0]: content store: Invalid argument
[1458183670.760964] broker.debug[0]: content flush begin
[1458183670.776835] broker.debug[0]: content flush +1 (dirty=1 pending=1)
[1458183670.767418] content-sqlite.err[0]: store: executing stmt: database disk image is malformed
[1458183670.777031] broker.err[0]: content store: Invalid argument

These messages then repeat. I terminated the instance before thinking to check the sqlite db file.

garlick added a commit to garlick/flux-core that referenced this issue Mar 17, 2016

modules/content-sqlite: improve error logging
Log the sqlite3 extended error code along with the regular
error string.

Translate some key sqlite errors to errno values so a better
errnum gets returned to the requestor.

This might help debug flux-framework#609

garlick added a commit to garlick/flux-core that referenced this issue Mar 17, 2016

modules/content-sqlite: improve error logging
Log the sqlite3 extended error code along with the regular
error string.

Translate some key sqlite errors to errno values so a better
errnum gets returned to the requestor.

This might help debug flux-framework#609
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 18, 2016

A repeat run of the same test failed at job 7741:

[1458270878.926880] content-sqlite.err[0]: store: executing stmt: disk I/O error(778)
[1458270878.927330] broker.err[0]: content store: Input/output error

778 is SQLITE_IOERR_WRITE.

sqlite data file is about 30G

$ cd /tmp/flux-61277-gxe9wj/content
$ ls -l
total 32016992
-rw------- 1 garlick garlick 32785383424 Mar 17 20:14 sqlite

and file system is full

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_jimbo-lv_root
                       50G   47G   16M 100% /

garlick added a commit to garlick/flux-core that referenced this issue Mar 18, 2016

modules/content-sqlite: compress large blos
Blobs > 256 bytes will be compressed before being stored in the
database using LZO compression.  A new 'size' integer column is added
to store the uncompressed size of the blob so that the right size
buffer can be allocated for decompression.  If size is -1, this
indicates that the blob is not compressed.

This might help a bit with issue flux-framework#609

garlick added a commit to garlick/flux-core that referenced this issue Mar 18, 2016

modules/content-sqlite: compress large blobs
Blobs > 256 bytes will be compressed before being stored in the
database using LZO compression.  A new 'size' integer column is added
to store the uncompressed size of the blob so that the right size
buffer can be allocated for decompression.  If size is -1, this
indicates that the blob is not compressed.

This might help a bit with issue flux-framework#609
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 19, 2016

Little better with compression for the on-disk objects but still using a disturbing amount of disk.

soak

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 21, 2016

I was able to run 19033 jobs over the weekend until my hard drive filled up. Apparently my root disk then decided to give up. I was able to get the last data point:

  • jobid 19033
  • job runtime 9.85s
  • rank 0 broker rss 12,686,012K
  • content/sqlite size 106,290,640K
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

I recovered the data from that run:

soak

At least it looks like the broker rss was relatively stable (sub gigabyte) prior to the disk filling up. Then it begins to spike because dirty entries can't be written out to sqlite.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

Here's a trace of directory objects updated during flux wreckrun --ntasks 8 /bin/true (object size in parens), for 1st and 1000th job:

[1458681203.151762] kvs.debug[0]: stored lwj.1 (39)
[1458681203.151794] kvs.debug[0]: stored lwj (62)
[1458681203.154288] kvs.debug[0]: stored lwj.1.input (64)
[1458681203.154330] kvs.debug[0]: stored lwj.1 (452)
[1458681203.154351] kvs.debug[0]: stored lwj (475)
[1458681203.159085] kvs.debug[0]: stored lwj.1.rank.1 (30)
[1458681203.159106] kvs.debug[0]: stored lwj.1.rank.2 (30)
[1458681203.159121] kvs.debug[0]: stored lwj.1.rank.3 (30)
[1458681203.159136] kvs.debug[0]: stored lwj.1.rank.4 (30)
[1458681203.159151] kvs.debug[0]: stored lwj.1.rank.5 (30)
[1458681203.159165] kvs.debug[0]: stored lwj.1.rank.6 (30)
[1458681203.159179] kvs.debug[0]: stored lwj.1.rank.7 (30)
[1458681203.159194] kvs.debug[0]: stored lwj.1.rank.0 (30)
[1458681203.159230] kvs.debug[0]: stored lwj.1.rank (403)
[1458681203.159271] kvs.debug[0]: stored lwj.1 (862)
[1458681203.159292] kvs.debug[0]: stored lwj (885)
[1458681203.161217] kvs.debug[0]: stored lwj.1 (507)
[1458681203.161242] kvs.debug[0]: stored lwj (530)
[1458681203.173393] kvs.debug[0]: stored lwj.1.input.files (75)
[1458681203.173422] kvs.debug[0]: stored lwj.1.input (163)
[1458681203.173462] kvs.debug[0]: stored lwj.1 (622)
[1458681203.173483] kvs.debug[0]: stored lwj (645)
[1458681203.184762] kvs.debug[0]: stored lwj.1.log (71)
[1458681203.184809] kvs.debug[0]: stored lwj.1 (600)
[1458681203.184829] kvs.debug[0]: stored lwj (623)
[1458681203.189521] kvs.debug[0]: stored lwj.1.log (139)
[1458681203.189568] kvs.debug[0]: stored lwj.1 (668)
[1458681203.189589] kvs.debug[0]: stored lwj (691)
[1458681203.192126] kvs.debug[0]: stored lwj.1.log (479)
[1458681203.192155] kvs.debug[0]: stored lwj.1.0 (149)
[1458681203.192202] kvs.debug[0]: stored lwj.1 (1230)
[1458681203.192223] kvs.debug[0]: stored lwj (1253)
[1458681203.193946] kvs.debug[0]: stored lwj.1.log (547)
[1458681203.193993] kvs.debug[0]: stored lwj.1 (1197)
[1458681203.194013] kvs.debug[0]: stored lwj (1220)
[1458681203.202331] kvs.debug[0]: stored lwj.1.0 (200)
[1458681203.202354] kvs.debug[0]: stored lwj.1.3 (200)
[1458681203.202374] kvs.debug[0]: stored lwj.1.6 (200)
[1458681203.202393] kvs.debug[0]: stored lwj.1.4 (200)
[1458681203.202413] kvs.debug[0]: stored lwj.1.1 (200)
[1458681203.202432] kvs.debug[0]: stored lwj.1.2 (200)
[1458681203.202460] kvs.debug[0]: stored lwj.1.5 (200)
[1458681203.202479] kvs.debug[0]: stored lwj.1.7 (200)
[1458681203.202538] kvs.debug[0]: stored lwj.1 (2390)
[1458681203.202559] kvs.debug[0]: stored lwj (2413)
[1458681203.207666] kvs.debug[0]: stored lwj.1.0 (276)
[1458681203.207701] kvs.debug[0]: stored lwj.1.3 (276)
[1458681203.207733] kvs.debug[0]: stored lwj.1.6 (276)
[1458681203.207765] kvs.debug[0]: stored lwj.1.4 (276)
[1458681203.207796] kvs.debug[0]: stored lwj.1.1 (276)
[1458681203.207829] kvs.debug[0]: stored lwj.1.2 (276)
[1458681203.207862] kvs.debug[0]: stored lwj.1.5 (276)
[1458681203.207893] kvs.debug[0]: stored lwj.1.7 (276)
[1458681203.207952] kvs.debug[0]: stored lwj.1 (2998)
[1458681203.207974] kvs.debug[0]: stored lwj (3021)
[1458681203.209688] kvs.debug[0]: stored lwj.1 (1225)
[1458681203.209711] kvs.debug[0]: stored lwj (1248)
[1458681203.214820] kvs.debug[0]: stored lwj.1.0.stdout (45)
[1458681203.214838] kvs.debug[0]: stored lwj.1.0.stderr (45)
[1458681203.214868] kvs.debug[0]: stored lwj.1.0 (334)
[1458681203.214898] kvs.debug[0]: stored lwj.1.3.stdout (45)
[1458681203.214914] kvs.debug[0]: stored lwj.1.3.stderr (45)
[1458681203.214943] kvs.debug[0]: stored lwj.1.3 (334)
[1458681203.214968] kvs.debug[0]: stored lwj.1.6.stdout (45)
[1458681203.214983] kvs.debug[0]: stored lwj.1.6.stderr (45)
[1458681203.215012] kvs.debug[0]: stored lwj.1.6 (334)
[1458681203.215035] kvs.debug[0]: stored lwj.1.4.stdout (45)
[1458681203.215051] kvs.debug[0]: stored lwj.1.4.stderr (45)
[1458681203.215081] kvs.debug[0]: stored lwj.1.4 (334)
[1458681203.215104] kvs.debug[0]: stored lwj.1.1.stdout (45)
[1458681203.215119] kvs.debug[0]: stored lwj.1.1.stderr (45)
[1458681203.215148] kvs.debug[0]: stored lwj.1.1 (334)
[1458681203.215172] kvs.debug[0]: stored lwj.1.2.stdout (45)
[1458681203.215187] kvs.debug[0]: stored lwj.1.2.stderr (45)
[1458681203.215217] kvs.debug[0]: stored lwj.1.2 (334)
[1458681203.215241] kvs.debug[0]: stored lwj.1.5.stdout (45)
[1458681203.215255] kvs.debug[0]: stored lwj.1.5.stderr (45)
[1458681203.215284] kvs.debug[0]: stored lwj.1.5 (334)
[1458681203.215307] kvs.debug[0]: stored lwj.1.7.stdout (45)
[1458681203.215323] kvs.debug[0]: stored lwj.1.7.stderr (45)
[1458681203.215352] kvs.debug[0]: stored lwj.1.7 (334)
[1458681203.215415] kvs.debug[0]: stored lwj.1 (3513)
[1458681203.215436] kvs.debug[0]: stored lwj (3536)
[1458681203.217692] kvs.debug[0]: stored lwj.1 (1279)
[1458681203.217716] kvs.debug[0]: stored lwj (1302)

Note that only directory updates were logged. Since values smaller than a blobref are stored in the directory, I think most values will be small for this workload. Here is 1000th, observe lwj directory size increase:

[1458681549.922252] kvs.debug[0]: stored lwj.1000 (39)
[1458681549.924579] kvs.debug[0]: stored lwj (69887)
[1458681549.928801] kvs.debug[0]: stored lwj.1000.input (64)
[1458681549.928857] kvs.debug[0]: stored lwj.1000 (452)
[1458681549.931757] kvs.debug[0]: stored lwj (70300)
[1458681549.940968] kvs.debug[0]: stored lwj.1000.rank.1 (30)
[1458681549.941002] kvs.debug[0]: stored lwj.1000.rank.2 (30)
[1458681549.941019] kvs.debug[0]: stored lwj.1000.rank.3 (30)
[1458681549.941033] kvs.debug[0]: stored lwj.1000.rank.4 (30)
[1458681549.941048] kvs.debug[0]: stored lwj.1000.rank.5 (30)
[1458681549.941063] kvs.debug[0]: stored lwj.1000.rank.6 (30)
[1458681549.941080] kvs.debug[0]: stored lwj.1000.rank.7 (30)
[1458681549.941094] kvs.debug[0]: stored lwj.1000.rank.0 (30)
[1458681549.941120] kvs.debug[0]: stored lwj.1000.rank (403)
[1458681549.941163] kvs.debug[0]: stored lwj.1000 (862)
[1458681549.943659] kvs.debug[0]: stored lwj (70710)
[1458681549.948774] kvs.debug[0]: stored lwj.1000 (507)
[1458681549.950033] kvs.debug[0]: stored lwj (70355)
[1458681549.966247] kvs.debug[0]: stored lwj.1000.input.files (75)
[1458681549.966285] kvs.debug[0]: stored lwj.1000.input (163)
[1458681549.966330] kvs.debug[0]: stored lwj.1000 (622)
[1458681549.968816] kvs.debug[0]: stored lwj (70470)
[1458681549.984797] kvs.debug[0]: stored lwj.1000.log (71)
[1458681549.984860] kvs.debug[0]: stored lwj.1000 (600)
[1458681549.987167] kvs.debug[0]: stored lwj (70448)
[1458681549.994344] kvs.debug[0]: stored lwj.1000.0 (149)
[1458681549.994410] kvs.debug[0]: stored lwj.1000 (799)
[1458681549.996765] kvs.debug[0]: stored lwj (70647)
[1458681550.134885] kvs.debug[0]: stored lwj.1000.log (139)
[1458681550.134978] kvs.debug[0]: stored lwj.1000 (789)
[1458681550.137469] kvs.debug[0]: stored lwj (70637)
[1458681550.209648] kvs.debug[0]: stored lwj.1000.log (207)
[1458681550.209711] kvs.debug[0]: stored lwj.1000 (857)
[1458681550.211995] kvs.debug[0]: stored lwj (70705)
[1458681550.215526] kvs.debug[0]: stored lwj.1000.log (547)
[1458681550.215576] kvs.debug[0]: stored lwj.1000 (1197)
[1458681550.217928] kvs.debug[0]: stored lwj (71045)
[1458681550.304243] kvs.debug[0]: stored lwj.1000.0 (203)
[1458681550.304273] kvs.debug[0]: stored lwj.1000.4 (203)
[1458681550.304300] kvs.debug[0]: stored lwj.1000.2 (203)
[1458681550.304325] kvs.debug[0]: stored lwj.1000.7 (203)
[1458681550.304349] kvs.debug[0]: stored lwj.1000.3 (203)
[1458681550.304374] kvs.debug[0]: stored lwj.1000.6 (203)
[1458681550.304397] kvs.debug[0]: stored lwj.1000.5 (203)
[1458681550.304432] kvs.debug[0]: stored lwj.1000.1 (203)
[1458681550.304499] kvs.debug[0]: stored lwj.1000 (2414)
[1458681550.306733] kvs.debug[0]: stored lwj (72262)
[1458681550.349654] kvs.debug[0]: stored lwj.1000.0 (279)
[1458681550.349695] kvs.debug[0]: stored lwj.1000.4 (279)
[1458681550.349730] kvs.debug[0]: stored lwj.1000.2 (279)
[1458681550.349765] kvs.debug[0]: stored lwj.1000.7 (279)
[1458681550.349800] kvs.debug[0]: stored lwj.1000.3 (279)
[1458681550.349834] kvs.debug[0]: stored lwj.1000.6 (279)
[1458681550.349870] kvs.debug[0]: stored lwj.1000.5 (279)
[1458681550.349906] kvs.debug[0]: stored lwj.1000.1 (279)
[1458681550.349969] kvs.debug[0]: stored lwj.1000 (3022)
[1458681550.352218] kvs.debug[0]: stored lwj (72870)
[1458681550.356577] kvs.debug[0]: stored lwj.1000 (1225)
[1458681550.358913] kvs.debug[0]: stored lwj (71073)
[1458681550.428086] kvs.debug[0]: stored lwj.1000.0.stdout (45)
[1458681550.428121] kvs.debug[0]: stored lwj.1000.0.stderr (45)
[1458681550.428156] kvs.debug[0]: stored lwj.1000.0 (337)
[1458681550.428183] kvs.debug[0]: stored lwj.1000.4.stdout (45)
[1458681550.428200] kvs.debug[0]: stored lwj.1000.4.stderr (45)
[1458681550.428232] kvs.debug[0]: stored lwj.1000.4 (337)
[1458681550.428260] kvs.debug[0]: stored lwj.1000.2.stdout (45)
[1458681550.428277] kvs.debug[0]: stored lwj.1000.2.stderr (45)
[1458681550.428309] kvs.debug[0]: stored lwj.1000.2 (337)
[1458681550.428335] kvs.debug[0]: stored lwj.1000.7.stdout (45)
[1458681550.428352] kvs.debug[0]: stored lwj.1000.7.stderr (45)
[1458681550.428385] kvs.debug[0]: stored lwj.1000.7 (337)
[1458681550.428412] kvs.debug[0]: stored lwj.1000.3.stdout (45)
[1458681550.428429] kvs.debug[0]: stored lwj.1000.3.stderr (45)
[1458681550.428463] kvs.debug[0]: stored lwj.1000.3 (337)
[1458681550.428490] kvs.debug[0]: stored lwj.1000.6.stdout (45)
[1458681550.428506] kvs.debug[0]: stored lwj.1000.6.stderr (45)
[1458681550.428538] kvs.debug[0]: stored lwj.1000.6 (337)
[1458681550.428565] kvs.debug[0]: stored lwj.1000.5.stdout (45)
[1458681550.428582] kvs.debug[0]: stored lwj.1000.5.stderr (45)
[1458681550.428615] kvs.debug[0]: stored lwj.1000.5 (337)
[1458681550.428643] kvs.debug[0]: stored lwj.1000.1.stdout (45)
[1458681550.428660] kvs.debug[0]: stored lwj.1000.1.stderr (45)
[1458681550.428692] kvs.debug[0]: stored lwj.1000.1 (337)
[1458681550.428759] kvs.debug[0]: stored lwj.1000 (3537)
[1458681550.431032] kvs.debug[0]: stored lwj (73385)
[1458681550.441590] kvs.debug[0]: stored lwj.1000 (1279)
[1458681550.444348] kvs.debug[0]: stored lwj (71127)
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

Running 100 8-node jobs without backing store results in the following content-cache stats (size in bytes):

content-acct-entries                    7960
content-acct-size                       8019656
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 22, 2016

Is that bytes or kbytes?
Are we tracking down content store usage in this issue or investigating the error in the issue title? (wondering if this is the correct place to muse on possibly improvements in lwj. directory contents.)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

Bytes.

I'll fix the topic of this issue. Let's keep the issue focused on the general problem of sqlite growth during workload of many small jobs, and we can open a new issue if we identify a problem in the kvs or wreck that needs fixing.

@garlick garlick changed the title content-sqlite.err[0]: store: executing stmt: disk I/O error sqlite db grows unreasonably large when running many small jobs Mar 22, 2016

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 22, 2016

OK, thanks. I wasn't sure if you were pointing to lwj. directory size growth as a possible wreck issue.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

I did some brute force tracing over 1000 8-rank jobs and 16,100 (+/- 99) versions of the lwj directory object were written totaling 545MB uncompressed. The size of the sqlite db was 373MB (compression enabled), 568MB (compression disabled).

Since the lwj directory gets copied each time something is updated in the job's subdir (in this case about 16 times per job), and over time what's being copied grows proportionally to the number of past jobs, maybe we could cut the growth by simply copying completed jobs to a different directory?

I'd try a test but I'm not too sure who makes the final update to the job state or what changing the representation in the KVS would do to other entities monitoring the job state.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 22, 2016

Just tried a simple test.

diff --git a/src/test/soak-workload.sh b/src/test/soak-workload.sh
index a244f60..9485af3 100755
--- a/src/test/soak-workload.sh
+++ b/src/test/soak-workload.sh
@@ -14,4 +14,6 @@ for i in `seq 1 $njobs`; do
     broker_rss=$(flux comms-stats --rusage cmb --parse maxrss)
     content_k=$(du -sk $contentdir | cut -f1)
     flux logger --priority soak.info $i $elapsed $broker_rss $content_k
+
+    flux kvs unlink lwj.$(($i - 5)) || :
 done

after 1000 jobs the sqlite db was 35MB (compression enabled), so that's a nice order of magnitude improvement. It would be a bit less of an improvement with a "completed" directory growing during the test.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 23, 2016

Nice work! This makes sense now.

In the current form the first wrexecd of the job sort of owns write access to the lwj state after the runrequest event is sent and the daemons area started.

However, you are correct that other read-only entities like flux-wreckrun or JSC are watching the lwj.<id>.state file specifically and thus we can't archive jobids right after their state reaches complete. A change in the interface for monitoring job state, IO, and other job information would be required to make the archiving seamless.

I'm going to ask a dumb question now, but would there be a way for us to mark directories in kvs as read-only or "archived" (e.g. after job completion) and have them virtually archived, perhaps not even appearing in the directory listing but still accessible from the same path? (I know that is dumb idea, but perhaps the start of just a lame idea?)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 23, 2016

One idea I had was if we could split large directories over several objects, then if we add to the directories in lexographical order I think json-c will serialize the earlier chunks the same each time and allow them to be squashed when the directory is updated. I already have code for splitting objects though I think I was splitting at a megabyte or something, so there would still be considerable overhead (we only broke 70K on the lwj directory at 1000 jobs).

Another thought was to implement "union directories" where a large archive version of lwj and the smaller active version could be merged in a virtual lwj directory that would look like it does now.

Still another approach could be to allow attributes to be set on the namespace indicating "read only"/archive (job completed - it can no longer be altered) or "volatile" and get some behind the scenes optimizations activated based on that.

I'm sort of reaching here. Likely none of these are going to fit short term goals of stablizing the current implementation.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 23, 2016

Here's data from a longer overnight run with lwj data for completed jobs being removed:

soak-62762

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 23, 2016

union dirs and read-only/archived entries is exactly what I was thinking above.

In the short term, perhaps we could write a simple "job archive" module that watches the lwj directory and doesn't let it grow over a configurable size? It could also wait a configurable interval before archiving or deleting "old" jobids to give a buffer period for watchers to capture the last state of the job instead of getting an ENOENT.

Hm, that makes me wonder, if we replace an lwj.<id> directory with a soft link to its archived location, will the directory revisions still grow unbounded?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 23, 2016

The symlink idea seems like a good one. Since we want the directory containing the running jobs to be the small one (since each job seems to be updated about 16 times, and that's with empty stdio), the symlinks would need to point from the lwj directory to a "running" directory while the job is running. Once the job is completed, the final job state could replace the symlink and be removed from the running directory.

I think this would work transparently for those kvs_watching the lwj.jobid.state or stdio keys; watch caches the key and looks it up each time the root ref changes, sending a response to the watcher whenever the value changes. If there are symlinks in the path that change, the root ref will change and the paths will be walked from the root on each change.

In summary:

  • To create a job, create running.jobid and symlink lwj.jobid to running.jobid
  • Wrexecd updates state in running.jobid
  • Watchers watch lwj.jobid as before
  • On completion, wrexecd for local rank 0 recursively copies running.jobid over lwj.jobid and removes running.jobid.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 23, 2016

Nice solution!

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 23, 2016

I implemented kvs_copy(), a recursive KVS copy function, and added copy and move subcommands to flux-kvs here. Might make the above a bit easier to get implemented if you don't mind having a go at it @grondo.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 23, 2016

Thanks for that! I'll experiment with converting wreck stuff over to this scheme and see how it goes

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Mar 30, 2016

I think this is more or less solved by #614 which was merged this morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.