Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: provide mechanism to load KVS root from completed instance in subdir of another instance #1800

Merged
merged 11 commits into from Nov 6, 2018

Conversation

Projects
None yet
3 participants
@garlick
Copy link
Member

garlick commented Nov 4, 2018

This PR builds on @grondo's experiment in #777, which was further discussed recently in #1754:

  • If the persist-directory is set, save the final kvs root reference to kvsroot.final in that directory
  • Fix a bug in content-sqlite that was unconditionally unlinking the sqlite db file on exit, despite obvious intentions not to elsewhere in the code
  • Misc cleanup in content-sqlite (errno, exit on error, etc)
  • Add flux kvs put --treeobj key=- <value capability
  • Modify @grondo's sharness test to create previous root as a snapshot in current root using flux kvs put --treeobj

One caveat is that the shutdown timer could interrupt rc3 before the content cache is fully flushed to disk, in which case it's possible the snapshot will have dangling references. I would prefer to treat that as a separate issue if/when it comes up though.

@garlick garlick force-pushed the garlick:preserve_kvsroot branch from ce32ebd to f6c75af Nov 6, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 6, 2018

Rebased on current master.

@grondo grondo referenced this pull request Nov 6, 2018

Merged

kvs: misc fixes & cleanup #1805

grondo and others added some commits Aug 20, 2016

content-sqlite: allow open of existing content file
Add "if not exists" to CREATE TABLE command when opening content
store to allow the open of an existing content store sqlite
file.
content-sqlite: do not error on existing persist dir
Do not issue a fatal error on mkdir() if persist directory
already exists.

This allows opening an existing persist directory.
testsuite: add test for KVS snapshot/content recovery
Ensure that sqlite file persists when persist-directory
attribute is set, that there is a kvsroot.final file,
and that a new instance can reuse the sqlite file
and restore the previous instance's final snapshot.

Co-authored-by: Mark Grondona <mark.grondona@gmail.com>
rc3: conditionally save kvsroot on shutdown
Problem: although the content backing store can be
made to persist after a Flux instance terminates
by setting 'persist-directory' or 'persist-filesystem'
broker attributes, there is currently no mechanism
to access KVS data in it because the KVS root is
not saved.

Add code to rc3 to save the root in a file called
'kvsroot.final' in the persist-directory, if set.

As discussed in #1754, one can start a flux instance
and make the old instance's final KVS snapshot available
in the new instance's KVS with a command like:

flux kvs put --treeobj \
    oldroot-snapshot=$(cat kvsroot.final)
cmd/flux-kvs: allow put --treeobj key=- for stdin
Problem: JSON RFC 11 objects create quoting challenges
as command line arguments, since they contain shell meta-
characters.  flux kvs put --treeobj only accepts values to
on the command line.

Add code to allow flux kvs put --treeobj to interpret a
value of "-" on the command line to mean "fetch value from
standard input".  flux kvs put --raw already supports this.
modules/content-sqlite: remove extra database unlink
Problem: even with persist-directory set, the sqlite
database file goes missing after instance exit.

There is an extra unlink(2) of the database file
in the freectx() destructor.  This is above and beyond
the "cleanup" atexit registration, which is conditional
on persist-directory.

Drop the extra unlink.
modules/sqlite: fix sqlite errno handling in initialization
Problem: sqlite errors are logged, but errno is set
to EINVAL when the getctx() fails and an errno is
returned to the system.

Rework the getctx() code so that sqlite errors
are translated.
modules/content-sqlite: cleanup log_sqlite_error
Problem: log_sqlite_error() calls xvasprintf to
contruct buffer.

This not only exits on ENOMEM, but is an unnecessary
malloc in an error critical path.

Rework to use a static buffer and vsnprintf().
modules/content-sqlite: avoid exit on ENOMEM
Problem: module uses xzmalloc(), xasprintf() which exit
when malloc fails.

Use calloc() and asprintf(), and handle the errors.

@garlick garlick force-pushed the garlick:preserve_kvsroot branch from f6c75af to 98710ca Nov 6, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 6, 2018

Rebased on master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 6, 2018

Codecov Report

Merging #1800 into master will increase coverage by 0.03%.
The diff coverage is 51.72%.

@@            Coverage Diff             @@
##           master    #1800      +/-   ##
==========================================
+ Coverage   79.62%   79.66%   +0.03%     
==========================================
  Files         186      186              
  Lines       34554    34550       -4     
==========================================
+ Hits        27515    27523       +8     
+ Misses       7039     7027      -12
Impacted Files Coverage Δ
src/modules/content-sqlite/content-sqlite.c 58.66% <43.47%> (+1.56%) ⬆️
src/cmd/flux-kvs.c 81.81% <83.33%> (+0.01%) ⬆️
src/common/libflux/mrpc.c 85.55% <0%> (-1.15%) ⬇️
src/cmd/flux-module.c 85.28% <0%> (-0.31%) ⬇️
src/bindings/lua/flux-lua.c 82.82% <0%> (+0.69%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 6, 2018

Looks like diff coverage is low because log_sqlite_error() is never called (also, error handling code). As long as that is expected, it is probably fine to merge this PR now (increases overall coverage)

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 6, 2018

Hmm, cleaning up functions in error paths with no test coverage. Maybe I should see if I can add a quick test.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 6, 2018

This is pretty annoying to test actually. I did manually verify that log_sqlite_error() does work after my changes, so if it's OK with you I'd say press the button and I'll queue up the next one.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 6, 2018

if it's OK with you I'd say press the button and I'll queue up the next one

Yes, that is what I assumed. Sorry!

@grondo grondo merged commit 17cd32e into flux-framework:master Nov 6, 2018

2 of 3 checks passed

codecov/patch 51.72% of diff hit (target 79.62%)
Details
codecov/project 79.66% (+0.03%) compared to 1956120
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.