Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: support removal of a KVS namespace #1299

Merged
merged 11 commits into from Dec 20, 2017

Conversation

Projects
None yet
4 participants
@chu11
Copy link
Contributor

chu11 commented Dec 7, 2017

This PR adds the ability to remove a KVS namespace. The primary trickiness was determining how the removal of a namespace could affect asynchronous rpcs. e.g. what to do if a lookup began when a namespace was legal, but the namespace is removed before the lookup can complete.

Ultimately, I decided to return ENOTSUP in this situation. It is the same error code that would be returned if the user had specified an illegal namespace. So to rpc client, they can view this as a simple error that a namespace was illegal. The user wouldn't know (and it shouldn't matter for them to know) if when the namespace became illegal.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 7, 2017

hit a #731, restart builder

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 7, 2017

Hmmm, haven't seen this build failure before

PASS: t0015-cron.t 17 - flux-cron event --min-interval works

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Presumably the test flux-cron can set timeout on tasks is the one that failed. It has a short 0.1s timeout, perhaps that was hit?

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 7, 2017

Coverage Status

Coverage increased (+0.008%) to 78.554% when pulling f517a51 on chu11:issue1197-part2 into 57e24fa on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 7, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@e704a7f). Click here to learn what that means.
The diff coverage is 77.38%.

@@            Coverage Diff            @@
##             master    #1299   +/-   ##
=========================================
  Coverage          ?   78.27%           
=========================================
  Files             ?      154           
  Lines             ?    28100           
  Branches          ?        0           
=========================================
  Hits              ?    21996           
  Misses            ?     6104           
  Partials          ?        0
Impacted Files Coverage Δ
src/modules/kvs/commit.c 78.52% <100%> (ø)
src/common/libkvs/kvs.c 100% <100%> (ø)
src/modules/kvs/lookup.c 79.8% <100%> (ø)
src/modules/kvs/kvs.c 65.51% <72.93%> (ø)
src/cmd/flux-kvs.c 81.03% <84.61%> (ø)
@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 7, 2017

diff coverage is unsurprisingly not that great. Lots of error paths that aren't coverable.

I considered writing a stress test to try and catch several of the "the namespace disappeared on me" cases. But it's just be a ton of "create-namespace, do something that you know isn't cached, remove a namespace" over and over again trying to hopefully hit the race. But since there is no guarantee of hitting it in any particular run, I elected not to.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 7, 2017

Just a quick general comment on the semantics of remove:

It's hard to make any guarantees about when a namespace is "gone" in our design. In this PR, removal from ranks > 0 is based on an open loop event. Down the road we had talked about namespaces being kept in a temporal cache on ranks > 0. In other words, even after the remove RPC has finished, one may still be able to perform operations on the namespace for a brief time.

IMHO this is OK, and we just need to document it. For the use case we are planning for (jobs have their own namespace), we just need to ensure that namespace removal doesn't start until there is no possibility of the job continuing to use it.

If we accept that removal is "eventually consistent" then there is no need to ensure that operations (other than commit) racing with remove get an error.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 7, 2017

A watch on a namespace that is removed should get exactly one error response.

I didn't see offhand that responses were being generated for all the watchers in the watchlist upon namespace removal? That should happen, but then we should make sure that a stalled watch request cannot send another response after the first one has been sent...

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 8, 2017

Doh! I realized there's a huge mistake in my PR. If a kvsroot is removed, the kvsroot object is freed. So we can't use that pointer anymore, wherever it may be stored.

Not too hard to fix/deal with on lookups. But the commit code is a little trickier. B/c on remove the commit manager for the namespace is also destroyed. Several ways to deal with this, but gotta figure out best way. This is trickier than I had thought it would be.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 8, 2017

And I also need to handle sending errors to commits that have been submitted, but have not yet begun processing b/c its in the commit queue. Oh boy ... I didn't think this all through

chu11 added some commits Dec 8, 2017

modules/kvs: Refactor lookup_create()
Pass in the name of the namespace for the lookup.  While not
used internally, it is critical to know the namespace a lookup is
attached to for other purposes, most notably on replays.

Adjust callers appropriately.
modules/kvs: Add namespace to internal commit API
Refactor commit_mgr_create() to pass in the name of the namespace
for the commit mgr.  While not used internally, it is critical to
know the namespace a commit is attached to for other purposes, most
notably on replays.

Add get function commit_get_namespace() to get a namespace.

Add appropriate unit tests.

Adjust callers appropriately.
modules/kvs: Refactor kvsroot with remove flag
In preparation for kvs remove namespace support, place remove flag
in kvsroot to indicate if kvsroot is marked for removal.

Add new function lookup_root_safe() which looks up root and checks
if a root is marked for removal.  For the time being, all code
assumes a root cannot be marked for removal.
modules/kvs: Add commit_mgr_iter_fences()
Add function to iterate through current fences stored in the
commit mgr.

Add new unit tests appropriately.

@chu11 chu11 force-pushed the chu11:issue1197-part2 branch from f517a51 to 3d9af86 Dec 16, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 16, 2017

Just re-pushed. This is the new implementation of namespace remove. Instead of removing a namespace after a user has requested a remove, a namespace is instead marked for removal. In the background, cleanup occurs and the namespace is fully garbage collected in the heartbeat callback once the cleanup has finished.

The big patch is 445579c. There's some code I wish to make better, but I think this is good for the time being. For example, you can't call zhash_delete() while iterating through a hash. I think this could be abstracted away in some ways later on. I use a temporary "removelist" in several places to deal with this, which is sort of ugly.

A number of new tests in t1004-kvs-namespace.t are somewhat nuanced/detailed. Hopefully the comments makes sense. I know that I have a bajillion helper functions in t1004. That should be cleaned up in some way in the future (should share code with t1002-kvs-watch.t too).

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 16, 2017

Coverage Status

Coverage decreased (-0.04%) to 78.524% when pulling 3d9af86 on chu11:issue1197-part2 into e704a7f on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 16, 2017

re-push, fixed a mem-leak and a logic bug in cleanup path travis found

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 16, 2017

hmmm, one stalled build. Unsure. Restarting. Also restarting a build that hit #1025.

PASS: t0015-cron.t 21 - module load with sync

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 16, 2017

Coverage Status

Coverage increased (+0.07%) to 78.639% when pulling 72a2588 on chu11:issue1197-part2 into e704a7f on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 18, 2017

Coverage Status

Coverage increased (+0.04%) to 78.604% when pulling 80bc9d0 on chu11:issue1197-part2 into e704a7f on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2017

hmm, one build hung after PASS: t4000-issues-test-driver.t 4 - t0900-wreck-invalid-cores.sh. It's been seen before.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 19, 2017

I think this is OK to go in. Do you concur @chu11?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 20, 2017

yup, just let me squash.

chu11 added some commits Nov 28, 2017

modules/kvs: Support namespace removal
Support rpc to remove a namespace, marking the namespace for
removal, beginning cleanup process, and sending an event to all
other ranks to do the same.

The cleanup process includes notifying all watchers that the
namespace is being removed, all incomplete fences that the fences
will never complete, and all ready commits that are processing.
Final namespace remove occurs within the heartbeat callback when it
recognizes that all cleanup has completed in a namespace.

Catch and return errors in which a namespace has been removed after
a rpc began but before it finished (kvs.get, kvs.watch, kvs.fence).

Handle special cases where a namespace is in the process of being
removed (kvs.namespace.create, kvs.setroot events, kvs.error events).

Adjust some error messages to handle roots marked for removal.

Minor refactor along the way, commit mgr aux data reverted to kvs_ctx_t
from struct kvsroot.
common/libkvs/: Support flux_kvs_namespace_remove
In addition, add unit tests and manpage.
modules/kvs: Minor refactor kvsroot
Remove kvs_ctx_t from kvsroot struct, as it is no longer used.

chu11 added some commits Dec 14, 2017

t/kvs: Add new testfile fence_namespace_remove
File will specifically setup a fence with nprocs = 2, but
only call flux_kvs_fence() once.  That way the fence hangs
until an error occurs.

@chu11 chu11 force-pushed the chu11:issue1197-part2 branch from 80bc9d0 to 00c5fcc Dec 20, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 20, 2017

hit a #731, restart build

@coveralls

This comment has been minimized.

Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage increased (+0.02%) to 78.586% when pulling 00c5fcc on chu11:issue1197-part2 into e704a7f on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 20, 2017

Nice work @chu11 - merging.

@garlick garlick merged commit d29b0cd into flux-framework:master Dec 20, 2017

3 checks passed

buildbot/core_standard Build done.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 78.586%
Details

@grondo grondo referenced this pull request May 10, 2018

Closed

0.9.0 Release #1479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.