Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restructure commit handling code for correctness #788

Merged
merged 21 commits into from Sep 4, 2016

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Aug 31, 2016

Work in progress here.

This PR should address issue 776 in which commits were allowed to complete before the objects they reference had landed in the content store. This repairs "causal consistency" and "read your writes".

The commit handling code was restructured to be "restartable" so that all RPCs to the content cache in the course of applying a commit are asynchronous, and the KVS can remain responsive while a commit is being applied.

As a side effect, all commits (not just fences) are now terminated with the setroot event not a separate response. I'm curious if this helps performance for use cases like stdio where lots of commits come in at once. The event was going out anyway - this just piggybacks the commit "name" on it.

Commit coalescing is disabled at the moment.

Needs cleanup, tests, and performance exploration at scale.

garlick added some commits Aug 24, 2016

modules/kvs: restore commit consistency
Problem: KVS "causal consistency" and potentially "read your writes"
consistency are no longer guaranteed since kvs_commit() can return
before committed data has reached the content store.

Make all content store operations from the KVS rank 0 module
synchronous.

This is going to increase commit latency and render the KVS
unresponsive to other operations while a commit is in progress.
Address performance in a later commit.

Fixes #776
modules/libkvs: use fence to implement commit
Implement kvs_commit() as kvs_fence (name, 1), where name
is a uuid generated for the single commit.

This allows commits to complete in the same efficient way
as fences;  that is, piggybacked on the setroot event
resulting from the commit.

Add async interfaces: kvs_commit_begin() and kvs_commit_finish()
modules/kvs: drop unnecessary commit handling code
Drop separate handling for commit requests now that commit
is implemented in terms of fence.
modules/kvs: special case "handler restart" wait_t
Implement a new, generic wait_create() that takes a callback
and arg.  When the wait_t usecount reaches zero, the callback
is called.

The "message handler restart" use case, formerly the only
supported use case, is now a special case.  To underscore this,
the interfaces were renamed:

  wait_compare_f       => wait_test_msg_f
  wait_create()        => wait_create_msg_handler ()
  wait_destroy_match() => wait_destroy_msg ()

Also, drop the time argument from wait_destroy().  This was
only used to time kvs_get(), which is not all that useful.
Drop "get" and other unused counters from KVS statistics.
modules/kvs: stall commit completion on store
Make the commit process "stallable" so that the KVS module
can remain responsive when content store requests are in progress.
Try to get all content store requests for a commit in flight
at once, asynchronously, to avoid boxcar effect on RPCs.

When a fence/commit becomes ready, it is added to the ctx->ready queue.
Only one commit can be active at a time.  When one finishes, or when
one is enqueued and afterwards, the queue contains exactly one entry,
begin processing the commit at the head of the queue.

Make commit_apply_fence() idempotent, and then arrange for it
to be stalled/restarted when
 * the rootdir or any directories required to walk the (changing)
   name space are not in local cache
 * any new objects are still dirty with respect to the content cache

Commit coalescing is still not (re-)implemented.

@garlick garlick added the review label Aug 31, 2016

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Aug 31, 2016

Current coverage is 74.69% (diff: 83.17%)

Merging #788 into master will decrease coverage by 0.07%

@@             master       #788   diff @@
==========================================
  Files           145        145          
  Lines         25040      24960    -80   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          18723      18645    -78   
+ Misses         6317       6315     -2   
  Partials          0          0          
Diff Coverage File Path
••••••• 73% src/modules/kvs/proto.c
•••••••• 81% src/modules/kvs/kvs.c
•••••••• 86% src/modules/kvs/libkvs.c
•••••••• 87% src/modules/kvs/waitqueue.c
•••••••••• 100% src/modules/kvs/cache.c

Powered by Codecov. Last update 1e403d9...89b6b89

@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 31, 2016

Coverage Status

Coverage decreased (-0.04%) to 75.032% when pulling e1a5cbf on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

@garlick garlick changed the title restructure restructure commit handling code for correctness Aug 31, 2016

garlick added some commits Aug 31, 2016

modules/kvs: waitqueue docs, cleanup, testing
Add better explanation to waitqueue.h of how wait_t and
waitqueue_t interact.

Avoid assertion when NULL is passed to wait_destroy() or
wait_queue_destroy().

Allow a NULL msg argument to be passed to
wait_create_msg_handler().

Improve waitqueue.c test coverage.
modules/kvs: apply commit in check watcher
Instead of processing commits in the context of the commit request,
enqueue them and handle them after all other pending events have
been processed, using a prepare/check/idle watcher combination.

The prepare watcher runs at the end of the reactor loop, before
sleeping.  If commits are runnable, it starts the idle watcher.
If the idle watcher (no-op) can run, the reactor will not sleep.
The check watcher runs just after the reactor wakes up.
It stops the idle watcher and if commits are runable, starts one.
modules/kvs: send array of names in setroot event
Now that commits are uniquely named and completed via the setroot
event, instead of only one optional name in the setroot event,
send an array of them and require there to be at least one.

This prepares the way to reimplement commit aggregation.

Update proto test, and add diag output so encoded messages can be
scrutinized when the test is run manually.
@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 31, 2016

Coverage Status

Coverage decreased (-0.08%) to 74.994% when pulling 0026607 on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 31, 2016

I've re-added commit merging on rank 0 using a different algorithm.

Before: commits were timestamped, and a new commit was deferred if the previous commit occurred more recently than a minimum window. A timer was set, and queued commits were merged and applied as one. Commit requests were not cracked open until they were to be synchronously applied.

Now: commits are added to a queue. A prepare/check/idle watcher pattern is used to execute the commit code once per event loop iteration, allowing other work to progress perhaps more than before, including the work of enqueuing more commits. When the commit code runs, it gathers up all commits that are ready and commits them as one.

Now commit handling can stall reading/writing the content store. When it stalls, the event loop can run, allowing other work to progress. Further, if it stalls while reading objects needed to walk the namespace that is undergoing change, it can still have operations added to it. In other words it remains eligible for merging part way through what used to be an atomic operation. Only when it advances to the writing out state does it need to proceed atomically, and even then only because no heroic efforts were made to break it down further.

I think with this algorithm, there is actually more opportunity for merging under load of many simultaneous commits than with the timed window. This is a function of the decreased priority of commit handling, the increased concurrency, and the ability to continue merging half way through the commit processing. This potential improvement, along with the use of events to finalize commits instead of response messages, is balanced against the correctness fix (issue #776), where commits now have to wait until the dirty cache pages are flushed to the content store.

I do see significant merging in the logs when running a 512 node session on my desktop, running wreckrun hostname across all nodes, etc.. I reran the scale testing up to 49136 tasks (on 2048 nodes of jade) and saw minor improvements in rc1 times. Mainly I was just happy not to have introduced a performance regression at scale there.

This branch might be interesting to try with some of the testing that was going on last week where the merge window tuning seemed to have hurt performance. In general, commit storms can now look a bit more like orderly fences.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Sep 1, 2016

I reran the test from #784 and you seem to be doing at least as good, perhaps slightly better, than before!

$ time src/cmd/flux start -s 4 'for i in $(seq 1 100); do flux wreckrun -n 4 sleep 0; done'

real    0m9.726s
user    0m11.076s
sys 0m7.185s

So I think we can mark #784 fixed by this PR once merged.
Use of the check watcher to fire off commits is genius. Nice work!

It would be interesting to run even the simple throughput test with commit-merge=0 and with/without stdio from the job, but I think you've got a major improvement here...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Sep 1, 2016

Similar test with some I/O:

Current master 1e403d9

$ time src/cmd/flux  start -s 4 'for i in $(seq 1 100); do flux wreckrun -n 4 -w complete cat /etc/passwd; done'

real    0m46.148s
user    0m49.946s
sys 0m28.978s

This PR

$ time src/cmd/flux  start -s 4 'for i in $(seq 1 100); do flux wreckrun -n 4 -w complete cat /etc/passwd; done'

real    0m21.343s
user    0m44.325s
sys 0m24.855s

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Sep 1, 2016

Thanks for running those tests @grondo, that's encouraging.

I'll try to get this cleaned up today to be considered for merge. I probably should run under valgrind memcheck to make sure I've not screwed up the json refcounting again.

garlick added some commits Aug 31, 2016

modules/kvs: reimplement commit aggregation
Merge commits when multiple commits are in the ready queue
when the "check" watcher runs, and the current commit has not
progressed to the point of unrolling the rootcpy.

When a merged commit completes, the setroot event name field,
changed to an array in a previous commit, contains >1 name.
Log this when it happens on rank 0 at LOG_DEBUG level.

Add a module parameter "commit-merge=0" used to disable commit
merging for debug/performance testing.
test/soak: fix fallout from logging changes
Problem: Changes in logging output format and
flux-logger command line options broke the soak test.

Allow for colons in the date string when converting
log entries to GNU plot data file.

Update the flux-logger command line option from
--priority name.sev to --severity=sev --appname=name
modules/kvs: avoid storing root object redundantly
Problem: all ranks of the KVS, in response to a setroot event,
would synchronously store the root object to the content store
as a side effect of placing it in the local cache.

Instead of calling the generic store() function, manipulate
the local cache directly, inserting the root object without
triggering a flush to the content store.
modules/kvs: add codec for error event
An error event is needed so that errors can be propagated to callers
when commit/fence fails.

Create codec for message type and add a unit test.

@garlick garlick force-pushed the garlick:kvs_issue_776 branch from 0026607 to c1447a7 Sep 1, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 1, 2016

Coverage Status

Coverage decreased (-0.2%) to 74.92% when pulling c1447a7 on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

modules/kvs: add commit/fence error path
Register a handler for the new kvs.error event, so that
fence requests can be finalized with an error code in the same
way that they were finalized with a success response by the
setroot event.

If the 'errnum' variable is set nonzero at any point during commit
handling on rank 0, the kvs.error event will be generated instead
of the kvs.setroot event, and the root blobref will not be updated.

Commits thus become "all or nothing".  If a commit/fence fails,
none of the operations contained within it were committed to
the namespace.

@garlick garlick force-pushed the garlick:kvs_issue_776 branch from c1447a7 to 4f533be Sep 1, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 2, 2016

Coverage Status

Coverage decreased (-0.2%) to 74.879% when pulling 4f533be on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

garlick added some commits Sep 1, 2016

modules/kvs: add error handling to store path
Propagate errors from store(), the function that updates the local
cache and flushes to the content store, back to the commit handling
code, so errors can be returned to the caller of kvs_commit(), via
the kvs.error event.

Note: errors that occur in the content store continuation,
if invoked asynchronously, are *not* propagated.  This case
still needs to be handled.  The error reported in issue #776,
where the content store request failed with EBUSY because the
handle ran out of matchtags, would be handled, since matchtags
are allocated in flux_rpc() call.

@garlick garlick force-pushed the garlick:kvs_issue_776 branch from 4f533be to 0cbeacc Sep 2, 2016

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Sep 2, 2016

This is one of those PR's that could go on and on with the cleanup, but since I think I've addressed the main problems identified in #766, I think it might be best to call it done and as for some review for merging.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 2, 2016

Coverage Status

Coverage decreased (-0.2%) to 74.895% when pulling 0cbeacc on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

modules/kvs: don't allow kvs put of key .
Catch "." as a special case during commit processing.
commit_link_dirent() now returns 0 on success, -1 on failure.
If this case fails, it will generate a kvs.error event
and the caller of kvs_commit() or kvs_fence() will get
an EINVAL error.

Add a test case to sharness t1000-kvs-basic.t for this.
@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 2, 2016

Coverage Status

Coverage decreased (-0.06%) to 75.013% when pulling 89b6b89 on garlick:kvs_issue_776 into 1e403d9 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Sep 2, 2016

Looks great. I'm going to poke the tires as bit, but otherwise no comments yet. Great improvement.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Sep 4, 2016

Ok, I did try out this PR for awhile and I have no further comments.
Merging!

@grondo grondo merged commit 90fb199 into flux-framework:master Sep 4, 2016

4 checks passed

codecov/patch 83.17% of diff hit (target 74.77%)
Details
codecov/project Absolute coverage decreased by -0.07% but relative coverage increased by +8.40% compared to 1e403d9
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.06%) to 75.013%
Details

@grondo grondo removed the review label Sep 4, 2016

@garlick garlick deleted the garlick:kvs_issue_776 branch Sep 6, 2016

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Sep 6, 2016

Thanks!

garlick added a commit to garlick/flux-core that referenced this pull request Sep 13, 2016

modules/kvs: fix kvs.unwach regression
Problem: kvs.unwatch is not matching kvs.watch messages,
causing kvs "watch-unwatch" sharness test failure when
non-group tagpool was used in kvs.watch.

In the recently merged pr flux-framework#788, the kvs.watch protocol
changed.  The kvs.unwatch handler was partially decoding
queued kvs.watch messages in order to find them for removal,
and was not using decode functions in proto.c that were updated.

Solution: use the proto decode function.

@garlick garlick referenced this pull request Oct 26, 2016

Closed

0.5.0 release notes #879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.