Skip to content

mv overview

matthewvon edited this page Dec 7, 2014 · 41 revisions

Summary

This wiki entry contains notes concerning internal structures and functions within Google's leveldb. The notes are not authoritative. They are based upon impressions and research.

Unless stated otherwise, all notes refer to a single database. A leveldb database is equivalent to a Riak vnode. Riak often has 8 to 128 leveldb databases open simultaneously.

Key space

leveldb is by definition a key/value store. It has to track every key in a manner that allows any key to be retrieved at any time in the future. leveldb accesses a requested key via a logical search tree. This search tree has four layers.

The first layer is the "manifest". A database may have zero to tens of thousands of table files. Every table file has an entry in the manifest. The manifest entry tracks the first and last key contained in each table file. The manifest keeps the table file entries in one of seven sorted arrays. Each of the seven arrays represents one "level" of table files. A user request for a key causes leveldb to check each table file that overlaps the target key. leveldb searches each potential table file, level by level, until finding the first that yields an exact match for requested key.

The second layer of the search tree is a table file's block index of keys. The block index is part of the table file's metadata which is located at the end of its physical data file. The index contains one entry for each logical data block within the table file. The entry contains the last key in the block and the offset of the block within the table file. leveldb performs a binary search of the block index to locate a candidate data block. It reads the candidate data block from the table file.

The third layer is the restart index that is part of each data block. The restart index is located at the end of the data block. The restart index contains a subset of keys contained in the block. Each key in the subset includes an offset for where it is located within the data block. leveldb performs a binary search of the restart index to locate the nearest key to the user's requested key.

The fourth layer is simply a list of key/values in sorted order within a data block. leveldb now performs a linear search of the key/values, starting at the offset given by the restart index, to see if the user's requested key exists.

leveldb contains several optimizations to enhance its key retrieval performance. There is a table cache (also called the file cache in some places). All leveldb access to the second layer of the search tree goes through the table cache. The table cache attempts to keep the most popular table files open and their block index available. Similarly, there is a block cache. The block cache keeps recently read blocks in memory. Finally, each table file also contains a bloom filter in its metadata. The bloom filter is kept in the table cache along with the block index. The bloom filter is quickly checked before the second layer search to see if there is any possibility of the key actually being present within the table file.

The bloom filter is very good at eliminating unnecessary reads of blocks from a table file. Google's original bloom filter implementation had a 1% false positive rate. Basho's bloom filter has a 0.04% false positive rate. A false positive is when the bloom filter suggests that the key might be present in the table file, but a complete search finds that it is not present.

Limitation

Note that the entire manifest structure must fit into memory. leveldb will fail if there are more manifest entries than available memory.

Updating the key space

leveldb makes changes to the key space in batches. Seldom will a user's single write of a key/value make a direct change to the key space. A user's individual write first posts to a write buffer. Google's write buffer has two components. The first component is a "skip list". A skip list is an alternative to a binary tree structure. It keeps key/values in sorted order within memory. The second component is a "recovery log" file. The recovery log contains a copy of the user's written key/value. leveldb uses the recovery log to repopulate a write buffer if the program terminates for any reason before the write buffer is converted to a table file.

skip lists paper

leveldb continues to add new user key/values to the write buffer until the total size of the key/values exceeds the write_buffer_size parameter. The user gives the write_buffer_size in the Options structure as part of a database Open operation. Once the total size exceeds write_buffer_size, leveldb performs a quick operation to shift the current write buffer to a secondary slot and creates a new write buffer. The secondary slot for the old write buffer is called "imm" in the leveldb code (imm stand for immutable memory). leveldb creates a background task to convert the imm to a table file in level 0.

Note: leveldb will stall if it needs to shift the current write buffer to the imm, but the imm is still populated with the previous write buffer. The stall only ends once the background task successfully converts the previous imm to a level 0 table file, making the imm available to the current write buffer.

leveldb now posts the new level 0 table file to the manifest via a "VersionEdit" object. The post operation adds the new level 0 table file to the key space.

The leveldb manifest is therefore subject to change over time. leveldb has the concept of manifest "versions" (i.e. key space versions). A given user operation may start while one manifest version exists and conclude after one or more manifest versions have transpired. Each user operation must process completely against a single version of the manifest. Therefore leveldb maintains multiple manifest versions as necessary to give each user operation a stable view of the database. leveldb can therefore easily support "snapshots" so that the user can perform multiple operations against a single version of the dataset. leveldb's "VersionSet" object manages access to the current manifest version and tracks any older versions necessary for snapshots and pending user operations.

Compactions / Key space grooming

Compactions are leveldb's method of grooming the table files that comprise the key space. The first thing to note is that leveldb never updates a table file. leveldb writes the table file once, adds it to the key space, then only reads from it. leveldb writes any user updates to a particular key to new level 0 table files. New writes of a particular key hide all older writes. Only the most recent write of a key/value is available to a leveldb user. (Riak's concept of "siblings" is managed outside of leveldb.) leveldb will remove older versions of a key and any deleted keys as two or more table files with the same key(s) participate in the same compaction.

A compaction is the merging of one or more table files in a given level with zero or more table files in the next higher level. Google's leveldb and Basho's leveldb differ heavily in when compactions initiate and how compactions select table files for merging. Basho's differences increase leveldb's throughput within the Riak environment: multiple databases needing simultaneous compactions, mostly random keys from the user, simultaneous compactions on multiple levels, and need for aggressive grooming of deleted keys. The logic for compaction selection is within VersionSet::PickCompaction() function of db/version_set.cc.

Once leveldb selects the table files to participate in a compaction, it executes a simple key merge across all table file inputs. Where the same key exists in multiple inputs, only the most recent is passed to the merge's output. The process only removes deleted keys if there are no higher level table files that might hold older versions of the key. Removing the deleted key from the table files too early would allow any older values to be resurrected and delivered to the user.

leveldb creates a VersionEdit object at the completion of the compaction to post to the manifest. The post operation removes all table files that were inputs to the compaction from the manifest, and adds all the output table files.

Manifest repair

Grooming versus heavy load compactions

compaction pacing changes

Clone this wiki locally