mv overview

Summary

This wiki entry contains notes concerning internal structures and functions within Google's leveldb. The notes are not authoritative. They are based upon impressions and research.

Unless stated otherwise, all notes refer to a single database. A leveldb database is equivalent to a Riak vnode. Riak often has 8 to 128 leveldb databases open simultaneously.

A quick video presentation of leveldb and a few Basho modifications is found here:

YouTube presentation for leveldb - June 10, 2013
Audio discussion of env_posix.cc, repair, and throttle - December 11, 2014

Key space

leveldb is by definition a key/value store. It has to track every key in a manner that allows any key to be retrieved at any time in the future. leveldb accesses a requested key via a logical search tree. This search tree has four layers.

The first layer is the "manifest". A database may have zero to tens of thousands of table files. Every table file has an entry in the manifest. The manifest entry tracks the first and last key contained in each table file. The manifest keeps the table file entries in one of seven sorted arrays. Each of the seven arrays represents one "level" of table files. A user request for a key causes leveldb to check each table file that overlaps the target key. leveldb searches each potential table file, level by level, until finding the first that yields an exact match for requested key.

The second layer of the search tree is a table file's block index of keys. The block index is part of the table file's metadata which is located at the end of its physical data file. The index contains one entry for each logical data block within the table file. The entry contains the last key in the block and the offset of the block within the table file. leveldb performs a binary search of the block index to locate a candidate data block. It reads the candidate data block from the table file.

The third layer is the restart index that is part of each data block. The restart index is located at the end of the data block. The restart index contains a subset of keys contained in the block. Each key in the subset includes an offset for where it is located within the data block. leveldb performs a binary search of the restart index to locate the nearest key to the user's requested key.

The fourth layer is simply a list of key/values in sorted order within a data block. leveldb now performs a linear search of the key/values, starting at the offset given by the restart index, to see if the user's requested key exists.

leveldb contains several optimizations to enhance its key retrieval performance. There is a table cache (also called the file cache in some places). All leveldb access to the second layer of the search tree goes through the table cache. The table cache attempts to keep the most popular table files open and their block index available. Similarly, there is a block cache. The block cache keeps recently read blocks in memory. Finally, each table file also contains a bloom filter in its metadata. The bloom filter is kept in the table cache along with the block index. The bloom filter is quickly checked before the second layer search to see if there is any possibility of the key actually being present within the table file.

The bloom filter is very good at eliminating unnecessary reads of blocks from a table file. Google's original bloom filter implementation had a 1% false positive rate. Basho's bloom filter has a 0.04% false positive rate. A false positive is when the bloom filter suggests that the key might be present in the table file, but a complete search finds that it is not present.

Limitation

Note that the entire manifest structure must fit into memory. leveldb will fail if there are more manifest entries than available memory.

Updating the key space

leveldb makes changes to the key space in batches. Seldom will a user's single write of a key/value make a direct change to the key space. A user's individual write first posts to a write buffer. Google's write buffer has two components. The first component is a "skip list". A skip list is an alternative to a binary tree structure. It keeps key/values in sorted order within memory. The second component is a "recovery log" file. The recovery log contains a copy of the user's written key/value. leveldb uses the recovery log to repopulate a write buffer if the program terminates for any reason before the write buffer is converted to a table file.

skip lists paper

leveldb continues to add new user key/values to the write buffer until the total size of the key/values exceeds the write_buffer_size parameter. The user gives the write_buffer_size in the Options structure as part of a database Open operation. Once the total size exceeds write_buffer_size, leveldb performs a quick operation to shift the current write buffer to a secondary slot and creates a new write buffer. The secondary slot for the old write buffer is called "imm" in the leveldb code (imm stand for immutable memory). leveldb creates a background task to convert the imm to a table file in level 0.

Note: leveldb will stall if it needs to shift the current write buffer to the imm, but the imm is still populated with the previous write buffer. The stall only ends once the background task successfully converts the previous imm to a level 0 table file, making the imm available to the current write buffer.

leveldb now posts the new level 0 table file to the manifest via a "VersionEdit" object. The post operation adds the new level 0 table file to the key space.

The leveldb manifest is therefore subject to change over time. leveldb has the concept of manifest "versions" (i.e. key space versions). A given user operation may start while one manifest version exists and conclude after one or more manifest versions have transpired. Each user operation must process completely against a single version of the manifest. Therefore leveldb maintains multiple manifest versions as necessary to give each user operation a stable view of the database. leveldb can therefore easily support "snapshots" so that the user can perform multiple operations against a single version of the dataset. leveldb's "VersionSet" object manages access to the current manifest version and tracks any older versions necessary for snapshots and pending user operations.

Compactions / Key space grooming

Compactions are leveldb's method of grooming the table files that comprise the key space. The first thing to note is that leveldb never updates a table file. leveldb writes the table file once, adds it to the key space, then only reads from it. leveldb writes any user updates to a particular key to new level 0 table files. New writes of a particular key hide all older writes. Only the most recent write of a key/value is available to a leveldb user. (Riak's concept of "siblings" is managed outside of leveldb.) leveldb will remove older versions of a key and any deleted keys as two or more table files with the same key(s) participate in the same compaction.

A compaction is the merging of one or more table files in a given level with zero or more table files in the next higher level. Google's leveldb and Basho's leveldb differ heavily in when compactions initiate and how compactions select table files for merging. Basho's differences increase leveldb's throughput within the Riak environment: multiple databases needing simultaneous compactions, mostly random keys from the user, simultaneous compactions on multiple levels, and need for aggressive grooming of deleted keys. The logic for compaction selection is within VersionSet::PickCompaction() function of db/version_set.cc.

Once leveldb selects the table files to participate in a compaction, it executes a simple key merge across all table file inputs. Where the same key exists in multiple inputs, only the most recent is passed to the merge's output. The process only removes deleted keys if there are no higher level table files that might hold older versions of the key. Removing the deleted key from the table files too early would allow any older values to be resurrected and delivered to the user.

leveldb creates a VersionEdit object at the completion of the compaction to post to the manifest. The post operation removes all table files that were inputs to the compaction from the manifest, and adds all the output table files.

Manifest repair

leveldb depends upon the manifest being correct. Every table file in the manifest must exist on disk. Any error, hardware or operator or programmer generated, that removes table files from disk that exist in the manifest can put leveldb into an infinite loop. This is a highly "not fault tolerant" situation. Currently the only cure is for the operator to close the database and execute a leveldb repair.

Google's repair algorithm is perfect. The algorithm handles all edge cases seen and envisioned to date. It works by logically shifting all table files to level 0 and performing a single merge across all of them. Ok, Google's repair algorithm is not perfect if you consider performance. The following link details how a 490G database was going to take 6 weeks to repair:

Repair performance characteristics

Basho's repair algorithm works differently (as mentioned in the above link). Basho's algorithm is not perfect. It contains some edge cases where it cannot guarantee the correct fix. But Basho's repair completes in minutes instead of weeks.

Basho's repair algorithm relies upon a subtle change to leveldb's table file organization. Google's leveldb keeps all table files, regardless of logical level, within a single, flat directory. Basho's leveldb keeps table files in directories by their logical level: sst_0, sst_1, … sst_6. This change gives Basho's repair algorithm a huge, extra data attribute about a given table file during a manifest rebuild. Google has no knowledge of how table files previously related to each other. Detail discussion of coding changes are here:

Repair by Level

Basho's tiered storage feature is a side effect of repair's need to have logical levels stored within directories.

Overlapped versus Sorted levels

leveldb sorts all keys within a table file. How one table file's key space compares to another table file's key space depends upon the level. Google's leveldb allows the key spaces of level 0 table files to overlap. The search for a given key within level 0 may require reads from none, some, or all of the level 0 table files. Google's leveldb changes to a sorted key space for levels 1 and above. A sorted key space means that there is no overlap of key space between any two table files on the same level. The search for a given key within level 1 or higher will at most require reads from one table file per level.

The act of compacting keys from an overlapped level of table files to a sorted level of table files has an interesting problem. Any one of the level 0 table files may have keys that overlap the entire key space of all table files in level 1. Therefore the compaction must encompass all the level 1 files. Google's implementation limits the size of a table file to 2 megabytes and the combined size of all level 1 table files to 10 megabytes. The resulting compaction of one level 0 table file with all level 1 table files would worse case total 12 megabytes, not a large amount of data (see Google's MaxBytesForLevel() and MaxFileSizeForLevel() in db/version_set.cc ). The downside to this structure is that it leads to a high write amplification (number of times the same key/value is rewritten due to compactions) due to small files rewriting often and/or overflowing quickly to higher levels.

Basho's leveldb radically changes the size of individual table files and the total size of each level. level 0 files range from 30 to 60 megabytes (based upon write buffer size when imm compacts to a level 0 file). MaxFileSizeForLevel() for remain levels progressively increases to 734 megabytes by level 6. These size changes were necessary to contend with the extreme number of files in terabyte database situations and to reduce write amplification.

Google's original overlapped level to sorted level conversion strategy had horrible performance with these new size changes. Basho's leveldb now uses both level 0 and level 1 as overlapped levels. A compaction of table files from level 0 simply merges all the level 0 files into a new level 1 table file without merging any existing table files at level 1. The overlapped table files of level 1 then compact with the sorted table files of level 2. The maximum combined file size for all level 2 table files is 3 gigabytes. The total size for level 1 could approach 1 gigabytes assuming a 60 megabyte write buffer and no compression. This creates a potential merge of 4 gigabytes which sounds far worse than Google's 12 megabytes.

Basho also changed the logic related to level 2 to reduce the 4 gigabyte compaction potential. level 2 is now the "landing level". Basho's leveldb constantly grooms level 2 down to 200 megabytes. So a typical compaction from level 1 to level 2 is 1.2 gigabytes, not 4 gigabytes. And this compaction is focused entirely upon breaking up the large overlapped files of level 1 into small sorted table files. The grooming quickly compacts the smaller, sorted table files with individual table files of level 3. While the 1.2 gigabyte compaction is still larger than a 12 megabytes Google compaction from level 0 to level 1, Basho's big compaction happens very infrequently. The write amplification with Basho's compaction algorithm is 3.6 compared to Google's 11 for large amounts of random key data. The reduction in write amplification is mathematically the same as saying Basho's leveldb reduces the total number of bytes written to disk by 67% for large data sets of random keys.

Grooming versus mandatory compactions

Google's leveldb used one background thread to perform compactions as needed. Basho's leveldb has eleven threads. Five threads specialize in converting an imm write buffer to a level 0 table file. Three threads specialize in compactions of level 0 to level 1 files. Three more threads handle compactions between all other levels. All of these threads are necessary when leveldb is under a heavy write load against 8 to 128 open databases with Riak. On the other hand, having so many compactions running simultaneously greatly impacts read latencies. Overall throughput is better with fewer compactions running, but only if a compaction backlog is not building.

Basho's leveldb therefore categorizes compaction work into "grooming" and "mandatory" scenarios. The function VersionSet::Finalize() in db/version_set.cc selects a compaction and categorizes it as grooming or not. The function compares each potential compaction's attributes against two different thresholds. The lower threshold denotes a grooming compaction. The higher threshold denotes a mandatory compaction. A mandatory compaction is always placed upon the work queue for the thread pool. A grooming compaction never goes on the work queue. It either starts immediately or does not run. Only one thread per queue may take grooming compactions.

The lower and upper threshold criteria are different for overlapped versus sorted levels. Overlapped levels submit a grooming compaction to the work queue when 4 table files exist at that level, and submit a mandatory compaction when 6 or more table files exists. Sorted levels submit a grooming compaction when the total size of all table files on a level exceed gLevelTraits[].m_DesiredBytesForLevel, see table at beginning of db/version_set.cc. Sorted Levels submit a mandatory compaction when total size exceeds gLevelTraits[].m_MaxBytesForLevel. Sorted levels will also submit a grooming compaction if the number of deleted objects in a table file exceeds Options::delete_threshold.

compaction pacing changes

not yet written

env_posix.cc's massive changes

should be called env_riak.cc by now
files: async write & close
why pread over men-mapped read
threading hot threads
fadvise

resource management

file handles unaddressed issue for file handles
file size, write buffer size
block cache
page cache (notes about fadvise?)
allocator overhead concern (malloc, and worse libtcmalloc)
flexcache (maybe independent section): file close at 10 days, internal vs. user databases autosizing caches

Other optimizations

write throttle: stalls
dynamic block size dynamic block_size
performance counters perf_dump tool
valgrind and Env::Shutdown valgrind cleanup
aggressive delete aggressive delete
tiered storage formalize options for tiered storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly