Skip to content

Latest commit

 

History

History
66 lines (47 loc) · 2.81 KB

these_trees.mdown

File metadata and controls

66 lines (47 loc) · 2.81 KB

Tree Structure

Luwak stores its file data as an immutable hash tree. Therefore writing to this tree causes new subtrees to be generated rather than mutating the existing tree. This document describes how reading and writing from these trees will work.

This document assumes some properties will hold true for luwak trees:

  1. Luwak tree nodes are several orders of magnitude smaller than tree leaves (data blocks) in terms of memory consumption
  2. The probability of a hash conflict is so small that the occurrence of one must point to the existence of a malevolent personal deity.

Tree Writing

The interface for luwak allows for writing to the tree with an arbitrary offset and length, like a traditional file API. This algorithm tries to minimize the copying of existing data within the underlying key-value store. The reason we attempt to minimize copying is the assumption that higher fragmentation is preferable to the higher bandwidth consumption that would occur during large copies.

When luwak accepts a write it will generally come in one of two forms: either a write will be aligned on a block boundary or it will not be aligned on a block boundary. Luwak is optimized for the case when writes occur on block boundaries. On block aligned writes, a completely new block will be created. If the ending block of a write stream is not aligned on a block boundary, then the ending block of the stream is looked up and mutated, then this is stored as the ending block.

[  B1  ] [  B2  ] [  B3  ] [  B4  ]
    [       write 1    ]
fetch B1
write B1'
write W2
fetch B3
write B3'
[  B1' ] [  W2  ] [  B3' ] [  B4  ]
  • figure 1 - Start and end unaligned write
  • [  B1  ] [  B2  ] [  B3  ] [  B4  ]
             [    write 1    ]
    write W1
    write W2
    [  B1  ] [  W1  ] [  W2  ] [  B4  ]
    
  • figure 2 - Start and end aligned write
  • [  B1  ] [  B2  ]
                      [    write 1    ]
    write W1
    write W2
    [  B1  ] [  B2  ] [  W1  ] [  W2  ]
    
  • figure 3 - Appending write
  • After the blocks are written, the tree for the file is updated. The tree update at this point is handled in a similar manner to a normal b+ tree:

    • do a search to determine what bucket the new record should go in
    • if the bucket is not full, add the record.
    • otherwise, split the bucket.
    • allocate new leaf and move half the bucket's elements to the new bucket
    • insert the new leaf's smallest key and address into the parent.
    • if the parent is full, split it also
    • now add the middle key to the parent node
    • repeat until a parent is found that need not split
    • if the root splits, create a new root which has one key and two pointers

    Tree Nodes

    [ ...
    {links, [{"node1", 50}, {"node2", 100}]}
    ...]
    
  • figure 4 - Example of a node