Skip to content

Couchdb 3298 optimize writing btree nodes#512

Merged
davisp merged 2 commits into
masterfrom
COUCHDB-3298-optimize-writing-btree-nodes
Jun 8, 2017
Merged

Couchdb 3298 optimize writing btree nodes#512
davisp merged 2 commits into
masterfrom
COUCHDB-3298-optimize-writing-btree-nodes

Conversation

@davisp
Copy link
Copy Markdown
Member

@davisp davisp commented May 9, 2017

Overview

As it turns out, the original change in COUCHDB-3298 ends up hurting
disk usage when a view emits large amounts of data (i.e., more than
half of the btree chunk size). The cause for this is that instead of
writing single element nodes it would instead prefer to write kv nodes
with three elements. While normally we might prefer this in memory, it
turns out that our append only storage this causes a significantly more
amount of trash on disk.

We can show this with a few trivial examples. Imagine we write KV's a
through f. The two following patterns show the nodes as we write each
new kv.

Before 3298:

[]
[a]
[a, b]
[a, b]', [c]
[a, b]', [c, d]
[a, b]', [c, d]', [e]
[a, b]', [c, d]', [e, f]

After 3298:

[]
[a]
[a, b]
[a, b, c]
[a, b]', [c, d]
[a, b]', [c, d, e]
[a, b]', [c, d]', [e, f]

The thing to realize here is which of these nodes end up as garbage. In
the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes
that have been orphaned. Where as in the second case we end up with
[a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been
orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned
is due to how a btree update works. For instance, when adding c, we read
[a, b] into memory, append c, and then during our node write we call
chunkify which gives us back [a, b], [c] which leads us to writing [a,
b] a second time.

The main benefit of this patch is to realize when its possible to reuse
a node that already exists on disk. It achieves this by looking at the
list of key/values when writing new nodes and comparing it to the old
list of key/values for the node read from disk. By checking to see if
the old list exists unchanged in the new list we can just reuse the old
node. Node reuse is limited to when the old node is larger than 50% of
the chunk threshold to maintain the B+Tree properties.

The disk usage improvements this gives can also be quite dramatic. In
the case above when we have ordered keys with large values (> 50% of the
btree chunk size) we find upwards of 50% less disk usage. Random keys
also benefit as well though to a lesser extent depending on disk size
(as they will often be in the middle of an existing node which prevents
our optimization).

Testing recommendations

$ make check

Also, see @nickva's script for testing that he's written. I'll make him add details to this PR once its opened.

JIRA issue number

COUCHDB-3298

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

Note for documentation, this isn't user visible behavior. Though we will want to call out that we'll use less disk space in our release notes.

@davisp davisp mentioned this pull request May 9, 2017
3 tasks
@nickva
Copy link
Copy Markdown
Contributor

nickva commented May 9, 2017

Wrote this script:

https://gist.github.com/nickva/2304bc2f738769c15fac116b8453ec04

To check view sizes for various combinations of parameters. The script was run like this:

 ./viewsize.py -n 1000 -s 10 -s 10000 -q true -q false -x true -x false -m 'function(d){emit(1,1);}' -m 'function(d){emit("K"+d._id,d.v);}'

That roughly means "Write 1000 documents. Check emit sizes: 10 and 10k. Query after every insert and query at the end (-q). Using random and then sequential keys (-x). Emit the document ids and generated value, and also try a trivial example where key = 1 and value = 1 (-m)". The script will generates all combinations of those parameters automatically.

Script was run on master code and on the PR branch. These are the results:

pr_vs_master

PR and MASTER columns show file sizes in bytes as reported by views' _info endpoint. Regressions if they were any would have been shown in red in the percentage column.

This PR shows a dramatic (20-55%) improvement in disk usage without any detected regressions.

Great work @davisp !

+1

@eiri
Copy link
Copy Markdown
Member

eiri commented May 10, 2017

I've ran a separate test comparing view disk size difference between master and this PR. My setup is a single view index emitting pseudo-random value of fixed size 32k with pre-generated sequential and (on a second run) random keys. I've inserted 10000 documents in one-by-one fashion, triggering view reindexing after each insert to maximize number of changes for view's btree. Index sizes were taken after each 10 inserts as reported by _info endpoint.

The results are pretty impressive. I see 129% of disk size improvement for sequential keys and 51% disk size improvement for random keys. Obviously this is an extreme setup done to maximize "garbage" and emphasize effect of this patch, in real life the numbers are going to be more humble, but still this is a great improvement. Here is the complete jupiter's notebook with results

Out of curiosity I've ran btree analysis on final view index for each case, to calculate number of kp and kv nodes and max and min length of the nodes per branch/leaf. Here the results:

Sequential keys:
View's btree before:
{ "depth": 4, "kp_nodes": { "count": 209, "min": 8, "max": 25 }, "kv_nodes": { "count": 5000, "min": 2, "max": 25 } }

View's btree after the patch:
{ "depth": 5, "kp_nodes": { "count": 419, "min": 3, "max": 13 }, "kv_nodes": { "count": 5001, "min": 1, "max": 13 } }

Random keys:
View's btree before:
{ "depth": 6, "kp_nodes": { "count": 865, "min": 2, "max": 18 }, "kv_nodes": { "count": 4286, "min": 2, "max": 18 } }

View's btree after the patch:
{ "depth": 5, "kp_nodes": { "count": 833, "min": 1, "max": 16 }, "kv_nodes": { "count": 5480, "min": 1, "max": 16 } }

I second Nick, this is a great work!

+1

@davisp
Copy link
Copy Markdown
Member Author

davisp commented May 10, 2017

For reference, cfcheck is a tool we wrote to analyze trees in databases and views and I've just realized its not open source which is silly since its just bits of couchdb extracted so we could run it while a server was online. Am now trying to figure out how many lawyers it'll take for me to open that up.

The numbers have their obvious meaning. If this IBM process turns into a thing I'll make a gist that does the same counting in a remsh if anyone is curious.

@jaydoane
Copy link
Copy Markdown
Contributor

While I don't pretend to understand the details of the algorithm, I found the thorough description very helpful, especially the aside about how a btree update works. All eunit tests are passing for me. My only complaint is that I find if statements to be less easy to read than case statements.

@davisp
Copy link
Copy Markdown
Member Author

davisp commented May 11, 2017

Managed to not have to sacrifice a teddy bear to the legal department.

https://github.com/cloudant-labs/cfcheck

Its a fairly simple and obvious tool but its there for anyone that wants to play with it.

@davisp davisp force-pushed the COUCHDB-3298-optimize-writing-btree-nodes branch from 69d4821 to fa38bd8 Compare May 12, 2017 16:05
@davisp
Copy link
Copy Markdown
Member Author

davisp commented May 12, 2017

Rebased and amended the commit subject that was out of date.

davisp added 2 commits June 8, 2017 12:44
As it turns out, the original change in COUCHDB-3298 ends up hurting
disk usage when a view emits large amounts of data (i.e., more than
half of the btree chunk size). The cause for this is that instead of
writing single element nodes it would instead prefer to write kv nodes
with three elements. While normally we might prefer this in memory, it
turns out that our append only storage this causes a significantly more
amount of trash on disk.

We can show this with a few trivial examples. Imagine we write KV's a
through f. The two following patterns show the nodes as we write each
new kv.

    Before 3298:

    []
    [a]
    [a, b]
    [a, b]', [c]
    [a, b]', [c, d]
    [a, b]', [c, d]', [e]
    [a, b]', [c, d]', [e, f]

    After 3298:

    []
    [a]
    [a, b]
    [a, b, c]
    [a, b]', [c, d]
    [a, b]', [c, d, e]
    [a, b]', [c, d]', [e, f]

The thing to realize here is which of these nodes end up as garbage. In
the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes
that have been orphaned. Where as in the second case we end up with
[a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been
orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned
is due to how a btree update works. For instance, when adding c, we read
[a, b] into memory, append c, and then during our node write we call
chunkify which gives us back [a, b], [c] which leads us to writing [a,
b] a second time.

The main benefit of this patch is to realize when its possible to reuse
a node that already exists on disk. It achieves this by looking at the
list of key/values when writing new nodes and comparing it to the old
list of key/values for the node read from disk. By checking to see if
the old list exists unchanged in the new list we can just reuse the old
node. Node reuse is limited to when the old node is larger than 50% of
the chunk threshold to maintain the B+Tree properties.

The disk usage improvements this gives can also be quite dramatic. In
the case above when we have ordered keys with large values (> 50% of the
btree chunk size) we find upwards of 50% less disk usage. Random keys
also benefit as well though to a lesser extent depending on disk size
(as they will often be in the middle of an existing node which prevents
our optimization).

COUCHDB-3298
@davisp davisp force-pushed the COUCHDB-3298-optimize-writing-btree-nodes branch from fa38bd8 to 4296ec0 Compare June 8, 2017 17:45
@davisp davisp merged commit 07fa508 into master Jun 8, 2017
@davisp davisp deleted the COUCHDB-3298-optimize-writing-btree-nodes branch June 9, 2017 13:59
nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
* bump version to 3.0.1

* Disable 'Edit on Github' links, fixes apache#512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants