Tree Structure

Luwak stores its file data as an immutable hash tree. Therefore writing to this tree causes new subtrees to be generated rather than mutating the existing tree. This document describes how reading and writing from these trees will work.

This document assumes some properties will hold true for luwak trees:

Luwak tree nodes are several orders of magnitude smaller than tree leaves (data blocks) in terms of memory consumption
The probability of a hash conflict is so small that the occurrence of one must point to the existence of a malevolent personal deity.

Tree Writing

The interface for luwak allows for writing to the tree with an arbitrary offset and length, like a traditional file API. This algorithm tries to minimize the copying of existing data within the underlying key-value store. The reason we attempt to minimize copying is the assumption that higher fragmentation is preferable to the higher bandwidth consumption that would occur during large copies.

When luwak accepts a write it will generally come in one of two forms: either a write will be aligned on a block boundary or it will not be aligned on a block boundary. Luwak is optimized for the case when writes occur on block boundaries. On block aligned writes, a completely new block will be created. If the ending block of a write stream is not aligned on a block boundary, then the ending block of the stream is looked up and mutated, then this is stored as the ending block.

[  B1  ] [  B2  ] [  B3  ] [  B4  ]
    [       write 1    ]
fetch B1
write B1'
write W2
fetch B3
write B3'
[  B1' ] [  W2  ] [  B3' ] [  B4  ]

figure 1 - Start and end unaligned write

[  B1  ] [  B2  ] [  B3  ] [  B4  ]
         [    write 1    ]
write W1
write W2
[  B1  ] [  W1  ] [  W2  ] [  B4  ]

figure 2 - Start and end aligned write

[  B1  ] [  B2  ]
                  [    write 1    ]
write W1
write W2
[  B1  ] [  B2  ] [  W1  ] [  W2  ]

figure 3 - Appending write

After the blocks are written, the tree for the file is updated. The tree update at this point is handled in a similar manner to a normal b+ tree:

do a search to determine what bucket the new record should go in
if the bucket is not full, add the record.
otherwise, split the bucket.
allocate new leaf and move half the bucket's elements to the new bucket
insert the new leaf's smallest key and address into the parent.
if the parent is full, split it also
now add the middle key to the parent node
repeat until a parent is found that need not split
if the root splits, create a new root which has one key and two pointers

Tree Nodes

[ ...
{links, [{"node1", 50}, {"node2", 100}]}
...]

figure 4 - Example of a node

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

these_trees.mdown

these_trees.mdown

Tree Structure

Tree Writing

Tree Nodes

Files

these_trees.mdown

Latest commit

History

these_trees.mdown

File metadata and controls

Tree Structure

Tree Writing

Tree Nodes