Couchdb 3298 optimize writing btree nodes#512
Conversation
|
Wrote this script: https://gist.github.com/nickva/2304bc2f738769c15fac116b8453ec04 To check view sizes for various combinations of parameters. The script was run like this: That roughly means "Write 1000 documents. Check emit sizes: 10 and 10k. Query after every insert and query at the end ( Script was run on master code and on the PR branch. These are the results:
This PR shows a dramatic (20-55%) improvement in disk usage without any detected regressions. Great work @davisp ! +1 |
|
I've ran a separate test comparing view disk size difference between master and this PR. My setup is a single view index emitting pseudo-random value of fixed size 32k with pre-generated sequential and (on a second run) random keys. I've inserted 10000 documents in one-by-one fashion, triggering view reindexing after each insert to maximize number of changes for view's btree. Index sizes were taken after each 10 inserts as reported by The results are pretty impressive. I see 129% of disk size improvement for sequential keys and 51% disk size improvement for random keys. Obviously this is an extreme setup done to maximize "garbage" and emphasize effect of this patch, in real life the numbers are going to be more humble, but still this is a great improvement. Here is the complete jupiter's notebook with results Out of curiosity I've ran btree analysis on final view index for each case, to calculate number of kp and kv nodes and max and min length of the nodes per branch/leaf. Here the results: Sequential keys: View's btree after the patch: Random keys: View's btree after the patch: I second Nick, this is a great work! +1 |
|
For reference, cfcheck is a tool we wrote to analyze trees in databases and views and I've just realized its not open source which is silly since its just bits of couchdb extracted so we could run it while a server was online. Am now trying to figure out how many lawyers it'll take for me to open that up. The numbers have their obvious meaning. If this IBM process turns into a thing I'll make a gist that does the same counting in a remsh if anyone is curious. |
|
While I don't pretend to understand the details of the algorithm, I found the thorough description very helpful, especially the aside about how a btree update works. All eunit tests are passing for me. My only complaint is that I find |
|
Managed to not have to sacrifice a teddy bear to the legal department. https://github.com/cloudant-labs/cfcheck Its a fairly simple and obvious tool but its there for anyone that wants to play with it. |
69d4821 to
fa38bd8
Compare
|
Rebased and amended the commit subject that was out of date. |
This reverts commit 8556adb.
As it turns out, the original change in COUCHDB-3298 ends up hurting
disk usage when a view emits large amounts of data (i.e., more than
half of the btree chunk size). The cause for this is that instead of
writing single element nodes it would instead prefer to write kv nodes
with three elements. While normally we might prefer this in memory, it
turns out that our append only storage this causes a significantly more
amount of trash on disk.
We can show this with a few trivial examples. Imagine we write KV's a
through f. The two following patterns show the nodes as we write each
new kv.
Before 3298:
[]
[a]
[a, b]
[a, b]', [c]
[a, b]', [c, d]
[a, b]', [c, d]', [e]
[a, b]', [c, d]', [e, f]
After 3298:
[]
[a]
[a, b]
[a, b, c]
[a, b]', [c, d]
[a, b]', [c, d, e]
[a, b]', [c, d]', [e, f]
The thing to realize here is which of these nodes end up as garbage. In
the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes
that have been orphaned. Where as in the second case we end up with
[a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been
orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned
is due to how a btree update works. For instance, when adding c, we read
[a, b] into memory, append c, and then during our node write we call
chunkify which gives us back [a, b], [c] which leads us to writing [a,
b] a second time.
The main benefit of this patch is to realize when its possible to reuse
a node that already exists on disk. It achieves this by looking at the
list of key/values when writing new nodes and comparing it to the old
list of key/values for the node read from disk. By checking to see if
the old list exists unchanged in the new list we can just reuse the old
node. Node reuse is limited to when the old node is larger than 50% of
the chunk threshold to maintain the B+Tree properties.
The disk usage improvements this gives can also be quite dramatic. In
the case above when we have ordered keys with large values (> 50% of the
btree chunk size) we find upwards of 50% less disk usage. Random keys
also benefit as well though to a lesser extent depending on disk size
(as they will often be in the middle of an existing node which prevents
our optimization).
COUCHDB-3298
fa38bd8 to
4296ec0
Compare
* bump version to 3.0.1 * Disable 'Edit on Github' links, fixes apache#512

Overview
As it turns out, the original change in COUCHDB-3298 ends up hurting
disk usage when a view emits large amounts of data (i.e., more than
half of the btree chunk size). The cause for this is that instead of
writing single element nodes it would instead prefer to write kv nodes
with three elements. While normally we might prefer this in memory, it
turns out that our append only storage this causes a significantly more
amount of trash on disk.
We can show this with a few trivial examples. Imagine we write KV's a
through f. The two following patterns show the nodes as we write each
new kv.
The thing to realize here is which of these nodes end up as garbage. In
the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes
that have been orphaned. Where as in the second case we end up with
[a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been
orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned
is due to how a btree update works. For instance, when adding c, we read
[a, b] into memory, append c, and then during our node write we call
chunkify which gives us back [a, b], [c] which leads us to writing [a,
b] a second time.
The main benefit of this patch is to realize when its possible to reuse
a node that already exists on disk. It achieves this by looking at the
list of key/values when writing new nodes and comparing it to the old
list of key/values for the node read from disk. By checking to see if
the old list exists unchanged in the new list we can just reuse the old
node. Node reuse is limited to when the old node is larger than 50% of
the chunk threshold to maintain the B+Tree properties.
The disk usage improvements this gives can also be quite dramatic. In
the case above when we have ordered keys with large values (> 50% of the
btree chunk size) we find upwards of 50% less disk usage. Random keys
also benefit as well though to a lesser extent depending on disk size
(as they will often be in the middle of an existing node which prevents
our optimization).
Testing recommendations
$ make check
Also, see @nickva's script for testing that he's written. I'll make him add details to this PR once its opened.
JIRA issue number
COUCHDB-3298
Checklist
Note for documentation, this isn't user visible behavior. Though we will want to call out that we'll use less disk space in our release notes.