Couchdb 3298 optimize writing btree nodes by davisp · Pull Request #512 · apache/couchdb

davisp · 2017-05-09T21:43:07Z

Overview

As it turns out, the original change in COUCHDB-3298 ends up hurting
disk usage when a view emits large amounts of data (i.e., more than
half of the btree chunk size). The cause for this is that instead of
writing single element nodes it would instead prefer to write kv nodes
with three elements. While normally we might prefer this in memory, it
turns out that our append only storage this causes a significantly more
amount of trash on disk.

We can show this with a few trivial examples. Imagine we write KV's a
through f. The two following patterns show the nodes as we write each
new kv.

Before 3298:

[]
[a]
[a, b]
[a, b]', [c]
[a, b]', [c, d]
[a, b]', [c, d]', [e]
[a, b]', [c, d]', [e, f]

After 3298:

[]
[a]
[a, b]
[a, b, c]
[a, b]', [c, d]
[a, b]', [c, d, e]
[a, b]', [c, d]', [e, f]

The thing to realize here is which of these nodes end up as garbage. In
the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes
that have been orphaned. Where as in the second case we end up with
[a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been
orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned
is due to how a btree update works. For instance, when adding c, we read
[a, b] into memory, append c, and then during our node write we call
chunkify which gives us back [a, b], [c] which leads us to writing [a,
b] a second time.

The main benefit of this patch is to realize when its possible to reuse
a node that already exists on disk. It achieves this by looking at the
list of key/values when writing new nodes and comparing it to the old
list of key/values for the node read from disk. By checking to see if
the old list exists unchanged in the new list we can just reuse the old
node. Node reuse is limited to when the old node is larger than 50% of
the chunk threshold to maintain the B+Tree properties.

The disk usage improvements this gives can also be quite dramatic. In
the case above when we have ordered keys with large values (> 50% of the
btree chunk size) we find upwards of 50% less disk usage. Random keys
also benefit as well though to a lesser extent depending on disk size
(as they will often be in the middle of an existing node which prevents
our optimization).

Testing recommendations

$ make check

Also, see @nickva's script for testing that he's written. I'll make him add details to this PR once its opened.

JIRA issue number

COUCHDB-3298

Checklist

Code is written and works correctly;
Changes are covered by tests;
Documentation reflects the changes;

Note for documentation, this isn't user visible behavior. Though we will want to call out that we'll use less disk space in our release notes.

nickva · 2017-05-09T22:30:47Z

Wrote this script:

https://gist.github.com/nickva/2304bc2f738769c15fac116b8453ec04

To check view sizes for various combinations of parameters. The script was run like this:

 ./viewsize.py -n 1000 -s 10 -s 10000 -q true -q false -x true -x false -m 'function(d){emit(1,1);}' -m 'function(d){emit("K"+d._id,d.v);}'

That roughly means "Write 1000 documents. Check emit sizes: 10 and 10k. Query after every insert and query at the end (-q). Using random and then sequential keys (-x). Emit the document ids and generated value, and also try a trivial example where key = 1 and value = 1 (-m)". The script will generates all combinations of those parameters automatically.

Script was run on master code and on the PR branch. These are the results:

PR and MASTER columns show file sizes in bytes as reported by views' _info endpoint. Regressions if they were any would have been shown in red in the percentage column.

This PR shows a dramatic (20-55%) improvement in disk usage without any detected regressions.

Great work @davisp !

+1

eiri · 2017-05-10T16:22:34Z

I've ran a separate test comparing view disk size difference between master and this PR. My setup is a single view index emitting pseudo-random value of fixed size 32k with pre-generated sequential and (on a second run) random keys. I've inserted 10000 documents in one-by-one fashion, triggering view reindexing after each insert to maximize number of changes for view's btree. Index sizes were taken after each 10 inserts as reported by _info endpoint.

The results are pretty impressive. I see 129% of disk size improvement for sequential keys and 51% disk size improvement for random keys. Obviously this is an extreme setup done to maximize "garbage" and emphasize effect of this patch, in real life the numbers are going to be more humble, but still this is a great improvement. Here is the complete jupiter's notebook with results

Out of curiosity I've ran btree analysis on final view index for each case, to calculate number of kp and kv nodes and max and min length of the nodes per branch/leaf. Here the results:

Sequential keys:
View's btree before:
{ "depth": 4, "kp_nodes": { "count": 209, "min": 8, "max": 25 }, "kv_nodes": { "count": 5000, "min": 2, "max": 25 } }

View's btree after the patch:
{ "depth": 5, "kp_nodes": { "count": 419, "min": 3, "max": 13 }, "kv_nodes": { "count": 5001, "min": 1, "max": 13 } }

Random keys:
View's btree before:
{ "depth": 6, "kp_nodes": { "count": 865, "min": 2, "max": 18 }, "kv_nodes": { "count": 4286, "min": 2, "max": 18 } }

View's btree after the patch:
{ "depth": 5, "kp_nodes": { "count": 833, "min": 1, "max": 16 }, "kv_nodes": { "count": 5480, "min": 1, "max": 16 } }

I second Nick, this is a great work!

+1

davisp · 2017-05-10T16:38:53Z

For reference, cfcheck is a tool we wrote to analyze trees in databases and views and I've just realized its not open source which is silly since its just bits of couchdb extracted so we could run it while a server was online. Am now trying to figure out how many lawyers it'll take for me to open that up.

The numbers have their obvious meaning. If this IBM process turns into a thing I'll make a gist that does the same counting in a remsh if anyone is curious.

jaydoane · 2017-05-11T06:23:22Z

While I don't pretend to understand the details of the algorithm, I found the thorough description very helpful, especially the aside about how a btree update works. All eunit tests are passing for me. My only complaint is that I find if statements to be less easy to read than case statements.

davisp · 2017-05-11T15:03:06Z

Managed to not have to sacrifice a teddy bear to the legal department.

https://github.com/cloudant-labs/cfcheck

Its a fairly simple and obvious tool but its there for anyone that wants to play with it.

davisp · 2017-05-12T16:06:50Z

Rebased and amended the commit subject that was out of date.

This reverts commit 8556adb.

As it turns out, the original change in COUCHDB-3298 ends up hurting disk usage when a view emits large amounts of data (i.e., more than half of the btree chunk size). The cause for this is that instead of writing single element nodes it would instead prefer to write kv nodes with three elements. While normally we might prefer this in memory, it turns out that our append only storage this causes a significantly more amount of trash on disk. We can show this with a few trivial examples. Imagine we write KV's a through f. The two following patterns show the nodes as we write each new kv. Before 3298: [] [a] [a, b] [a, b]', [c] [a, b]', [c, d] [a, b]', [c, d]', [e] [a, b]', [c, d]', [e, f] After 3298: [] [a] [a, b] [a, b, c] [a, b]', [c, d] [a, b]', [c, d, e] [a, b]', [c, d]', [e, f] The thing to realize here is which of these nodes end up as garbage. In the first example we end up with [a], [a, b], [c], [c, d], and [e] nodes that have been orphaned. Where as in the second case we end up with [a], [a, b], [a, b, c], [c, d], [c, d, e] as nodes that have been orphaned. A quick aside, the reason that [a, b] and [c, d] are orphaned is due to how a btree update works. For instance, when adding c, we read [a, b] into memory, append c, and then during our node write we call chunkify which gives us back [a, b], [c] which leads us to writing [a, b] a second time. The main benefit of this patch is to realize when its possible to reuse a node that already exists on disk. It achieves this by looking at the list of key/values when writing new nodes and comparing it to the old list of key/values for the node read from disk. By checking to see if the old list exists unchanged in the new list we can just reuse the old node. Node reuse is limited to when the old node is larger than 50% of the chunk threshold to maintain the B+Tree properties. The disk usage improvements this gives can also be quite dramatic. In the case above when we have ordered keys with large values (> 50% of the btree chunk size) we find upwards of 50% less disk usage. Random keys also benefit as well though to a lesser extent depending on disk size (as they will often be in the middle of an existing node which prevents our optimization). COUCHDB-3298

* bump version to 3.0.1 * Disable 'Edit on Github' links, fixes apache#512

davisp mentioned this pull request May 9, 2017

Opimize writing KV node append writes #504

Closed

3 tasks

davisp force-pushed the COUCHDB-3298-optimize-writing-btree-nodes branch from 69d4821 to fa38bd8 Compare May 12, 2017 16:05

wohali added dbcore enhancement labels May 16, 2017

davisp added 2 commits June 8, 2017 12:44

Revert "Make couch_btree:chunkify/1 prefer fewer chunks"

4d192ec

This reverts commit 8556adb.

davisp force-pushed the COUCHDB-3298-optimize-writing-btree-nodes branch from fa38bd8 to 4296ec0 Compare June 8, 2017 17:45

davisp merged commit 07fa508 into master Jun 8, 2017

davisp deleted the COUCHDB-3298-optimize-writing-btree-nodes branch June 9, 2017 13:59

nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022

Bump ver (apache#527)

042bbd4

* bump version to 3.0.1 * Disable 'Edit on Github' links, fixes apache#512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couchdb 3298 optimize writing btree nodes#512

Couchdb 3298 optimize writing btree nodes#512
davisp merged 2 commits into
masterfrom
COUCHDB-3298-optimize-writing-btree-nodes

davisp commented May 9, 2017 •

edited

Loading

Uh oh!

nickva commented May 9, 2017

Uh oh!

eiri commented May 10, 2017 •

edited

Loading

Uh oh!

davisp commented May 10, 2017

Uh oh!

jaydoane commented May 11, 2017

Uh oh!

davisp commented May 11, 2017

Uh oh!

davisp commented May 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

davisp commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing recommendations

JIRA issue number

Checklist

Uh oh!

nickva commented May 9, 2017

Uh oh!

eiri commented May 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davisp commented May 10, 2017

Uh oh!

jaydoane commented May 11, 2017

Uh oh!

davisp commented May 11, 2017

Uh oh!

davisp commented May 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davisp commented May 9, 2017 •

edited

Loading

eiri commented May 10, 2017 •

edited

Loading