Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
restructure commit handling code for correctness #788
Work in progress here.
This PR should address issue 776 in which commits were allowed to complete before the objects they reference had landed in the content store. This repairs "causal consistency" and "read your writes".
The commit handling code was restructured to be "restartable" so that all RPCs to the content cache in the course of applying a commit are asynchronous, and the KVS can remain responsive while a commit is being applied.
As a side effect, all commits (not just fences) are now terminated with the setroot event not a separate response. I'm curious if this helps performance for use cases like stdio where lots of commits come in at once. The event was going out anyway - this just piggybacks the commit "name" on it.
Commit coalescing is disabled at the moment.
Needs cleanup, tests, and performance exploration at scale.
Current coverage is 74.69% (diff: 83.17%)
@@ master #788 diff @@ ========================================== Files 145 145 Lines 25040 24960 -80 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== - Hits 18723 18645 -78 + Misses 6317 6315 -2 Partials 0 0
I've re-added commit merging on rank 0 using a different algorithm.
Before: commits were timestamped, and a new commit was deferred if the previous commit occurred more recently than a minimum window. A timer was set, and queued commits were merged and applied as one. Commit requests were not cracked open until they were to be synchronously applied.
Now: commits are added to a queue. A prepare/check/idle watcher pattern is used to execute the commit code once per event loop iteration, allowing other work to progress perhaps more than before, including the work of enqueuing more commits. When the commit code runs, it gathers up all commits that are ready and commits them as one.
Now commit handling can stall reading/writing the content store. When it stalls, the event loop can run, allowing other work to progress. Further, if it stalls while reading objects needed to walk the namespace that is undergoing change, it can still have operations added to it. In other words it remains eligible for merging part way through what used to be an atomic operation. Only when it advances to the writing out state does it need to proceed atomically, and even then only because no heroic efforts were made to break it down further.
I think with this algorithm, there is actually more opportunity for merging under load of many simultaneous commits than with the timed window. This is a function of the decreased priority of commit handling, the increased concurrency, and the ability to continue merging half way through the commit processing. This potential improvement, along with the use of events to finalize commits instead of response messages, is balanced against the correctness fix (issue #776), where commits now have to wait until the dirty cache pages are flushed to the content store.
I do see significant merging in the logs when running a 512 node session on my desktop, running wreckrun hostname across all nodes, etc.. I reran the scale testing up to 49136 tasks (on 2048 nodes of jade) and saw minor improvements in rc1 times. Mainly I was just happy not to have introduced a performance regression at scale there.
This branch might be interesting to try with some of the testing that was going on last week where the merge window tuning seemed to have hurt performance. In general, commit storms can now look a bit more like orderly fences.
I reran the test from #784 and you seem to be doing at least as good, perhaps slightly better, than before!
So I think we can mark #784 fixed by this PR once merged.
It would be interesting to run even the simple throughput test with
Similar test with some I/O:
Current master 1e403d9