Skip to content

Latest commit

 

History

History
757 lines (653 loc) · 36 KB

rocksdb.md

File metadata and controls

757 lines (653 loc) · 36 KB

Pebble vs RocksDB: Implementation Differences

RocksDB is a key-value store implemented using a Log-Structured Merge-Tree (LSM). This document is not a primer on LSMs. There exist some decent introductions on the web, or try chapter 3 of Designing Data-Intensive Applications.

Pebble inherits the RocksDB file formats, has a similar API, and shares many implementation details, but it also has many differences that improve performance, reduce implementation complexity, or extend functionality. This document highlights some of the more important differences.

Internal Keys

The external RocksDB API accepts keys and values. Due to the LSM structure, keys are never updated in place, but overwritten with new versions. Inside RocksDB, these versioned keys are known as Internal Keys. An Internal Key is composed of the user specified key, a sequence number and a kind. On disk, sstables always store Internal Keys.

  +-------------+------------+----------+
  | UserKey (N) | SeqNum (7) | Kind (1) |
  +-------------+------------+----------+

The Kind field indicates the type of key: set, merge, delete, etc.

While Pebble inherits the Internal Key encoding for format compatibility, it diverges from RocksDB in how it manages Internal Keys in its implementation. In RocksDB, Internal Keys are represented either in encoded form (as a string) or as a ParsedInternalKey. The latter is a struct with the components of the Internal Key as three separate fields.

struct ParsedInternalKey {
  Slice  user_key;
  uint64 seqnum;
  uint8  kind;
}

The component format is convenient: changing the SeqNum or Kind is field assignment. Extracting the UserKey is a field reference. However, RocksDB tends to only use ParsedInternalKey locally. The major internal APIs, such as InternalIterator, operate using encoded internal keys (i.e. strings) for parameters and return values.

To give a concrete example of the overhead this causes, consider Iterator::Seek(user_key). The external Iterator is implemented on top of an InternalIterator. Iterator::Seek ends up calling InternalIterator::Seek. Both Seek methods take a key, but InternalIterator::Seek expects an encoded Internal Key. This is both error prone and expensive. The key passed to Iterator::Seek needs to be copied into a temporary string in order to append the SeqNum and Kind. In Pebble, Internal Keys are represented in memory using an InternalKey struct that is the analog of ParsedInternalKey. All internal APIs use InternalKeys, with the exception of the lowest level routines for decoding data from sstables. In Pebble, since the interfaces all take and return the InternalKey struct, we don’t need to allocate to construct the Internal Key from the User Key, but RocksDB sometimes needs to allocate, and encode (i.e. make a copy). The use of the encoded form also causes RocksDB to pass encoded keys to the comparator routines, sometimes decoding the keys multiple times during the course of processing.

Indexed Batches

In RocksDB, a batch is the unit for all write operations. Even writing a single key is transformed internally to a batch. The batch internal representation is a contiguous byte buffer with a fixed 12-byte header, followed by a series of records.

  +------------+-----------+--- ... ---+
  | SeqNum (8) | Count (4) |  Entries  |
  +------------+-----------+--- ... ---+

Each record has a 1-byte kind tag prefix, followed by 1 or 2 length prefixed strings (varstring):

  +----------+-----------------+-------------------+
  | Kind (1) | Key (varstring) | Value (varstring) |
  +----------+-----------------+-------------------+

(The Kind indicates if there are 1 or 2 varstrings. Set, Merge, and DeleteRange have 2 varstrings, while Delete has 1.)

Adding a mutation to a batch involves appending a new record to the buffer. This format is extremely fast for writes, but the lack of indexing makes it untenable to use directly for reads. In order to support iteration, a separate indexing structure is created. Both RocksDB and Pebble use a skiplist for the indexing structure, but with a clever twist. Rather than the skiplist storing a copy of the key, it simply stores the offset of the record within the mutation buffer. The result is that the skiplist acts a multi-map (i.e. a map that can have duplicate entries for a given key). The iteration order for this map is constructed so that records sort on key, and for equal keys they sort on descending offset. Newer records for the same key appear before older records.

While the indexing structure for batches is nearly identical between RocksDB and Pebble, how the index structure is used is completely different. In RocksDB, a batch is indexed using the WriteBatchWithIndex class. The WriteBatchWithIndex class provides a NewIteratorWithBase method that allows iteration over the merged view of the batch contents and an underlying "base" iterator created from the database. BaseDeltaIterator contains logic to iterate over the batch entries and the base iterator in parallel which allows us to perform reads on a snapshot of the database as though the batch had been applied to it. On the surface this sounds reasonable, yet the implementation is incomplete. Merge and DeleteRange operations are not supported. The reason they are not supported is because handling them is complex and requires duplicating logic that already exists inside RocksDB for normal iterator processing.

Pebble takes a different approach to iterating over a merged view of a batch's contents and the underlying database: it treats the batch as another level in the LSM. Recall that an LSM is composed of zero or more memtable layers and zero or more sstable layers. Internally, both RocksDB and Pebble contain a MergingIterator that knows how to merge the operations from different levels, including processing overwritten keys, merge operations, and delete range operations. The challenge with treating the batch as another level to be used by a MergingIterator is that the records in a batch do not have a sequence number. The sequence number in the batch header is not assigned until the batch is committed. The solution is to give the batch records temporary sequence numbers. We need these temporary sequence numbers to be larger than any other sequence number in the database so that the records in the batch are considered newer than any committed record. This is accomplished by reserving the high-bit in the 56-bit sequence number for use as a marker for batch sequence numbers. The sequence number for a record in an uncommitted batch is:

  RecordOffset | (1<<55)

Newer records in a given batch will have a larger sequence number than older records in the batch. And all of the records in a batch will have larger sequence numbers than any committed record in the database.

The end result is that Pebble's batch iterators support all of the functionality of regular database iterators with minimal additional code.

Large Batches

The size of a batch is limited only by available memory, yet the required memory is not just the batch representation. When a batch is committed, the commit operation iterates over the records in the batch from oldest to newest and inserts them into the current memtable. The memtable is an in-memory structure that buffers mutations that have been committed (written to the Write Ahead Log), but not yet written to an sstable. Internally, a memtable uses a skiplist to index records. Each skiplist entry has overhead for the index links and other metadata that is a dozen bytes at minimum. A large batch composed of many small records can require twice as much memory when inserted into a memtable than it required in the batch. And note that this causes a temporary increase in memory requirements because the batch memory is not freed until it is completely committed.

A non-obvious implementation restriction present in both RocksDB and Pebble is that there is a one-to-one correspondence between WAL files and memtables. That is, a given WAL file has a single memtable associated with it and vice-versa. While this restriction could be removed, doing so is onerous and intricate. It should also be noted that committing a batch involves writing it to a single WAL file. The combination of restrictions results in a batch needing to be written entirely to a single memtable.

What happens if a batch is too large to fit in a memtable? Memtables are generally considered to have a fixed size, yet this is not actually true in RocksDB. In RocksDB, the memtable skiplist is implemented on top of an arena structure. An arena is composed of a list of fixed size chunks, with no upper limit set for the number of chunks that can be associated with an arena. So RocksDB handles large batches by allowing a memtable to grow beyond its configured size. Concretely, while RocksDB may be configured with a 64MB memtable size, a 1GB batch will cause the memtable to grow to accomodate it. Functionally, this is good, though there is a practical problem: a large batch is first written to the WAL, and then added to the memtable. Adding the large batch to the memtable may consume so much memory that the system runs out of memory and is killed by the kernel. This can result in a death loop because upon restarting as the batch is read from the WAL and applied to the memtable again.

In Pebble, the memtable is also implemented using a skiplist on top of an arena. Significantly, the Pebble arena is a fixed size. While the RocksDB skiplist uses pointers, the Pebble skiplist uses offsets from the start of the arena. The fixed size arena means that the Pebble memtable cannot expand arbitrarily. A batch that is too large to fit in the memtable causes the current mutable memtable to be marked as immutable and the batch is wrapped in a flushableBatch structure and added to the list of immutable memtables. Because the flushableBatch is readable as another layer in the LSM, the batch commit can return as soon as the flushableBatch has been added to the immutable memtable list.

Internally, a flushableBatch provides iterator support by sorting the batch contents (the batch is sorted once, when it is added to the memtable list). Sorting the batch contents and insertion of the contents into a memtable have the same big-O time, but the constant factor dominates here. Sorting is significantly faster and uses significantly less memory due to not having to copy the batch records.

Note that an effect of this large batch support is that Pebble can be configured as an efficient on-disk sorter: specify a small memtable size, disable the WAL, and set a large L0 compaction threshold. In order to sort a large amount of data, create batches that are larger than the memtable size and commit them. When committed these batches will not be inserted into a memtable, but instead sorted and then written out to L0. The fully sorted data can later be read and the normal merging process will take care of the final ordering.

Commit Pipeline

The commit pipeline is the component which manages the steps in committing write batches, such as writing the batch to the WAL and applying its contents to the memtable. While simple conceptually, the commit pipeline is crucial for high performance. In the absence of concurrency, commit performance is limited by how fast a batch can be written (and synced) to the WAL and then added to the memtable, both of which are outside of the purview of the commit pipeline.

To understand the challenge here, it is useful to have a conception of the WAL (write-ahead log). The WAL contains a record of all of the batches that have been committed to the database. As a record is written to the WAL it is added to the memtable. Each record is assigned a sequence number which is used to distinguish newer updates from older ones. Conceptually the WAL looks like:

+--------------------------------------+
| Batch(SeqNum=1,Count=9,Records=...)  |
+--------------------------------------+
| Batch(SeqNum=10,Count=5,Records=...) |
+--------------------------------------+
| Batch(SeqNum=15,Count=7,Records...)  |
+--------------------------------------+
| ...                                  |
+--------------------------------------+

Note that each WAL entry is precisely the batch representation described earlier in the Indexed Batches section. The monotonically increasing sequence numbers are a critical component in allowing RocksDB and Pebble to provide fast snapshot views of the database for reads.

If concurrent performance was not a concern, the commit pipeline could simply be a mutex which serialized writes to the WAL and application of the batch records to the memtable. Concurrent performance is a concern, though.

The primary challenge in concurrent performance in the commit pipeline is maintaining two invariants:

  1. Batches need to be written to the WAL in sequence number order.
  2. Batches need to be made visible for reads in sequence number order. This invariant arises from the use of a single sequence number which indicates which mutations are visible.

The second invariant deserves explanation. RocksDB and Pebble both keep track of a visible sequence number. This is the sequence number for which records in the database are visible during reads. The visible sequence number exists because committing a batch is an atomic operation, yet adding records to the memtable is done without an exclusive lock (the skiplists used by both Pebble and RocksDB are lock-free). When the records from a batch are being added to the memtable, a concurrent read operation may see those records, but will skip over them because they are newer than the visible sequence number. Once all of the records in the batch have been added to the memtable, the visible sequence number is atomically incremented.

So we have four steps in committing a write batch:

  1. Write the batch to the WAL
  2. Apply the mutations in the batch to the memtable
  3. Bump the visible sequence number
  4. (Optionally) sync the WAL

Writing the batch to the WAL is actually very fast as it is just a memory copy. Applying the mutations in the batch to the memtable is by far the most CPU intensive part of the commit pipeline. Syncing the WAL is the most expensive from a wall clock perspective.

With that background out of the way, let's examine how RocksDB commits batches. This description is of the traditional commit pipeline in RocksDB (i.e. the one used by CockroachDB).

RocksDB achieves concurrency in the commit pipeline by grouping concurrently committed batches into a batch group. Each group is assigned a "leader" which is the first batch to be added to the group. The batch group is written atomically to the WAL by the leader thread, and then the individual batches making up the group are concurrently applied to the memtable. Lastly, the visible sequence number is bumped such that all of the batches in the group become visible in a single atomic step. While a batch group is being applied, other concurrent commits are added to a waiting list. When the group commit finishes, the waiting commits form the next group.

There are two criticisms of the batch grouping approach. The first is that forming a batch group involves copying batch contents. RocksDB partially alleviates this for large batches by placing a limit on the total size of a group. A large batch will end up in its own group and not be copied, but the criticism still applies for small batches. Note that there are actually two copies here. The batch contents are concatenated together to form the group, and then the group contents are written into an in memory buffer for the WAL before being written to disk.

The second criticism is about the thread synchronization points. Let's consider what happens to a commit which becomes the leader:

  1. Lock commit mutex
  2. Wait to become leader
  3. Form (concatenate) batch group and write to the WAL
  4. Notify followers to apply their batch to the memtable
  5. Apply own batch to memtable
  6. Wait for followers to finish
  7. Bump visible sequence number
  8. Unlock commit mutex
  9. Notify followers that the commit is complete

The follower's set of operations looks like:

  1. Lock commit mutex
  2. Wait to become follower
  3. Wait to be notified that it is time to apply batch
  4. Unlock commit mutex
  5. Apply batch to memtable
  6. Wait to be notified that commit is complete

The thread synchronization points (all of the waits and notifies) are overhead. Reducing that overhead can improve performance.

The Pebble commit pipeline addresses both criticisms. The main innovation is a commit queue that mirrors the commit order. The Pebble commit pipeline looks like:

  1. Lock commit mutex
  • Add batch to commit queue
  • Assign batch sequence number
  • Write batch to the WAL
  1. Unlock commit mutex
  2. Apply batch to memtable (concurrently)
  3. Publish batch sequence number

Pebble does not use the concept of a batch group. Each batch is individually written to the WAL, but note that the WAL write is just a memory copy into an internal buffer in the WAL.

Step 4 deserves further scrutiny as it is where the invariant on the visible batch sequence number is maintained. Publishing the batch sequence number cannot simply bump the visible sequence number because batches with earlier sequence numbers may still be applying to the memtable. If we were to ratchet the visible sequence number without waiting for those applies to finish, a concurrent reader could see partial batch contents. Note that RocksDB has experimented with allowing these semantics with its unordered writes option.

We want to retain the atomic visibility of batch commits. The publish batch sequence number step needs to ensure that we don't ratchet the visible sequence number until all batches with earlier sequence numbers have applied. Enter the commit queue: a lock-free single-producer, multi-consumer queue. Batches are added to the commit queue with the commit mutex held, ensuring the same order as the sequence number assignment. After a batch finishes applying to the memtable, it atomically marks the batch as applied. It then removes the prefix of applied batches from the commit queue, bumping the visible sequence number, and marking the batch as committed (via a sync.WaitGroup). If the first batch in the commit queue has not be applied we wait for our batch to be committed, relying on another concurrent committer to perform the visible sequence ratcheting for our batch. We know a concurrent commit is taking place because if there was only one batch committing it would be at the head of the commit queue.

There are two possibilities when publishing a sequence number. The first is that there is an unapplied batch at the head of the queue. Consider the following scenario where we're trying to publish the sequence number for batch B.

  +---------------+-------------+---------------+-----+
  | A (unapplied) | B (applied) | C (unapplied) | ... |
  +---------------+-------------+---------------+-----+

The publish routine will see that A is unapplied and then simply wait for B's done sync.WaitGroup to be signalled. This is safe because A must still be committing. And if A has concurrently been marked as applied, the goroutine publishing A will then publish B. What happens when A publishes its sequence number? The commit queue state becomes:

  +-------------+-------------+---------------+-----+
  | A (applied) | B (applied) | C (unapplied) | ... |
  +-------------+-------------+---------------+-----+

The publish routine pops A from the queue, ratchets the sequence number, then pops B and ratchets the sequence number again, and then finds C and stops. A detail that it is important to notice is that the committer for batch B didn't have to do any more work. An alternative approach would be to have B wakeup and ratchet its own sequence number, but that would serialize the remainder of the commit queue behind that goroutine waking up.

The commit queue reduces the number of thread synchronization operations required to commit a batch. There is no leader to notify, or followers to wait for. A commit either publishes its own sequence number, or performs one synchronization operation to wait for a concurrent committer to publish its sequence number.

Range Deletions

Deletion of an individual key in RocksDB and Pebble is accomplished by writing a deletion tombstone. A deletion tombstone shadows an existing value for a key, causing reads to treat the key as not present. The deletion tombstone mechanism works well for deleting small sets of keys, but what happens if you want to all of the keys within a range of keys that might number in the thousands or millions? A range deletion is an operation which deletes an entire range of keys with a single record. In contrast to a point deletion tombstone which specifies a single key, a range deletion tombstone (a.k.a. range tombstone) specifies a start key (inclusive) and an end key (exclusive). This single record is much faster to write than thousands or millions of point deletion tombstones, and can be done blindly -- without iterating over the keys that need to be deleted. The downside to range tombstones is that they require additional processing during reads. How the processing of range tombstones is done significantly affects both the complexity of the implementation, and the efficiency of read operations in the presence of range tombstones.

A range tombstone is composed of a start key, end key, and sequence number. Any key that falls within the range is considered deleted if the key's sequence number is less than the range tombstone's sequence number. RocksDB stores range tombstones segregated from point operations in a special range deletion block within each sstable. Conceptually, the range tombstones stored within an sstable are truncated to the boundaries of the sstable, though there are complexities that cause this to not actually be physically true.

In RocksDB, the main structure implementing range tombstone processing is the RangeDelAggregator. Each read operation and iterator has its own RangeDelAggregator configured for the sequence number the read is taking place at. The initial implementation of RangeDelAggregator built up a "skyline" for the range tombstones visible at the read sequence number.

10   +---+
 9   |   |
 8   |   |
 7   |   +----+
 6   |        |
 5 +-+        |  +----+
 4 |          |  |    |
 3 |          |  |    +---+
 2 |          |  |        |
 1 |          |  |        |
 0 |          |  |        |
  abcdefghijklmnopqrstuvwxyz

The above diagram shows the skyline created for the range tombstones [b,j)#5, [d,h)#10, [f,m)#7, [p,u)#5, and [t,y)#3. The skyline is queried for each key read to see if the key should be considered deleted or not. The skyline structure is stored in a binary tree, making the queries an O(logn) operation in the number of tombstones, though there is an optimization to make this O(1) for next/prev iteration. Note that the skyline representation loses information about the range tombstones. This requires the structure to be rebuilt on every read which has a significant performance impact.

The initial skyline range tombstone implementation has since been replaced with a more efficient lookup structure. See the DeleteRange blog post for a good description of both the original implementation and the new (v2) implementation. The key change in the new implementation is to "fragment" the range tombstones that are stored in an sstable. The fragmented range tombstones provide the same benefit as the skyline representation: the ability to binary search the fragments in order to find the tombstone covering a key. But unlike the skyline approach, the fragmented tombstones can be cached on a per-sstable basis. In the v2 approach, RangeDelAggregator keeps track of the fragmented range tombstones for each sstable encountered during a read or iterator, and logically merges them together.

Fragmenting range tombstones involves splitting range tombstones at overlap points. Let's consider the tombstones in the skyline example above:

10:   d---h
 7:     f------m
 5: b-------j     p----u
 3:                   t----y

Fragmenting the range tombstones at the overlap points creates a larger number of range tombstones:

10:   d-f-h
 7:     f-h-j--m
 5: b-d-f-h-j     p---tu
 3:                   tu---y

While the number of tombstones is larger there is a significant advantage: we can order the tombstones by their start key and then binary search to find the set of tombstones overlapping a particular point. This is possible because due to the fragmenting, all the tombstones that overlap a range of keys will have the same start and end key. The v2 RangeDelAggregator and associated classes perform fragmentation of range tombstones stored in each sstable and those fragmented tombstones are then cached.

In summary, in RocksDB RangeDelAggregator acts as an oracle for answering whether a key is deleted at a particular sequence number. Due to caching of fragmented tombstones, the v2 implementation of RangeDelAggregator implementation is significantly faster to populate than v1, yet the overall approach to processing range tombstones remains similar.

Pebble takes a different approach: it integrates range tombstones processing directly into the mergingIter structure. mergingIter is the internal structure which provides a merged view of the levels in an LSM. RocksDB has a similar class named MergingIterator. Internally, mergingIter maintains a heap over the levels in the LSM (note that each memtable and L0 table is a separate "level" in mergingIter). In RocksDB, MergingIterator knows nothing about range tombstones, and it is thus up to higher-level code to process range tombstones using RangeDelAggregator.

While the separation of MergingIterator and range tombstones seems reasonable at first glance, there is an optimization that RocksDB does not perform which is awkward with the RangeDelAggregator approach: skipping swaths of deleted keys. A range tombstone often shadows more than one key. Rather than iterating over the deleted keys, it is much quicker to seek to the end point of the range tombstone. The challenge in implementing this optimization is that a key might be newer than the range tombstone and thus shouldn't be skipped. An insight to be utilized is that the level structure itself provides sufficient information. A range tombstone at Ln is guaranteed to be newer than any key it overlaps in Ln+1.

Pebble utilizes the insight above to integrate range deletion processing with mergingIter. A mergingIter maintains a point iterator and a range deletion iterator per level in the LSM. In this context, every L0 table is a separate level, as is every memtable. Within a level, when a range deletion contains a point operation the sequence numbers must be checked to determine if the point operation is newer or older than the range deletion tombstone. The mergingIter maintains the invariant that the range deletion iterators for all levels newer that the current iteration key are positioned at the next (or previous during reverse iteration) range deletion tombstone. We know those levels don't contain a range deletion tombstone that covers the current key because if they did the current key would be deleted. The range deletion iterator for the current key's level is positioned at a range tombstone covering or past the current key. The position of all of other range deletion iterators is unspecified. Whenever a key from those levels becomes the current key, their range deletion iterators need to be positioned. This lazy positioning avoids seeking the range deletion iterators for keys that are never considered.

For a full example, consider the following setup:

  p0:               o
  r0:             m---q

  p1:              n p
  r1:       g---k

  p2:  b d    i
  r2: a---e           q----v

  p3:     e
  r3:

The diagram above shows is showing 4 levels, with pX indicating the point operations in a level and rX indicating the range tombstones.

If we start iterating from the beginning, the first key we encounter is b in p2. When the mergingIter is pointing at a valid entry, the range deletion iterators for all of the levels less that the current key's level are positioned at the next range tombstone past the current key. So r0 will point at [m,q) and r1 at [g,k). When the key b is encountered, we check to see if the current tombstone for r0 or r1 contains it, and whether the tombstone for r2, [a,e), contains and is newer than b.

Advancing the iterator finds the next key at d. This is in the same level as the previous key b so we don't have to reposition any of the range deletion iterators, but merely check whether d is now contained by any of the range tombstones at higher levels or has stepped past the range tombstone in its own level. In this case, there is nothing to be done.

Advancing the iterator again finds e. Since e comes from p3, we have to position the r3 range deletion iterator, which is empty. e is past the r2 tombstone of [a,e) so we need to advance the r2 range deletion iterator to [q,v).

The next key is i. Because this key is in p2, a level above e, we don't have to reposition any range deletion iterators and instead see that i is covered by the range tombstone [g,k). The iterator is immediately advanced to n which is covered by the range tombstone [m,q) causing the iterator to advance to o which is visible.

Flush and Compaction Pacing

Flushes and compactions in LSM trees are problematic because they contend with foreground traffic, resulting in write and read latency spikes. Without throttling the rate of flushes and compactions, they occur "as fast as possible" (which is not entirely true, since we have a bytes_per_sync option). This instantaneous usage of CPU and disk IO results in potentially huge latency spikes for writes and reads which occur in parallel to the flushes and compactions.

RocksDB attempts to solve this issue by offering an option to limit the speed of flushes and compactions. A maximum bytes/sec can be specified through the options, and background IO usage will be limited to the specified amount. Flushes are given priority over compactions, but they still use the same rate limiter. Though simple to implement and understand, this option is fragile for various reasons.

  1. If the rate limit is configured too low, the DB will stall and write throughput will be affected.
  2. If the rate limit is configured too high, the write and read latency spikes will persist.
  3. A different configuration is needed per system depending on the speed of the storage device.
  4. Write rates typically do not stay the same throughout the lifetime of the DB (higher throughput during certain times of the day, etc) but the rate limit cannot be configured during runtime.

RocksDB also offers an "auto-tuned" rate limiter which uses a simple multiplicative-increase, multiplicative-decrease algorithm to dynamically adjust the background IO rate limit depending on how much of the rate limiter has been exhausted in an interval. This solves the problem of having a static rate limit, but Pebble attempts to improve on this with a different pacing mechanism.

Pebble's pacing mechanism uses separate rate limiters for flushes and compactions. Both the flush and compaction pacing mechanisms work by attempting to flush and compact only as fast as needed and no faster. This is achieved differently for flushes versus compactions.

For flush pacing, Pebble keeps the rate at which the memtable is flushed at the same rate as user writes. This ensures that disk IO used by flushes remains steady. When a mutable memtable becomes full and is marked immutable, it is typically flushed as fast as possible. Instead of flushing as fast as possible, what we do is look at the total number of bytes in all the memtables (mutable + queue of immutables) and subtract the number of bytes that have been flushed in the current flush. This number gives us the total number of bytes which remain to be flushed. If we keep this number steady at a constant level, we have the invariant that the flush rate is equal to the write rate.

When the number of bytes remaining to be flushed falls below our target level, we slow down the speed of flushing. We keep a minimum rate at which the memtable is flushed so that flushes proceed even if writes have stopped. When the number of bytes remaining to be flushed goes above our target level, we allow the flush to proceed as fast as possible, without applying any rate limiting. However, note that the second case would indicate that writes are occurring faster than the memtable can flush, which would be an unsustainable rate. The LSM would soon hit the memtable count stall condition and writes would be completely stopped.

For compaction pacing, Pebble uses an estimation of compaction debt, which is the number of bytes which need to be compacted before no further compactions are needed. This estimation is calculated by looking at the number of bytes that have been flushed by the current flush routine, adding those bytes to the size of the level 0 sstables, then seeing how many bytes exceed the target number of bytes for the level 0 sstables. We multiply the number of bytes exceeded by the level ratio and add that number to the compaction debt estimate. We repeat this process until the final level, which gives us a final compaction debt estimate for the entire LSM tree.

Like with flush pacing, we want to keep the compaction debt at a constant level. This ensures that compactions occur only as fast as needed and no faster. If the compaction debt estimate falls below our target level, we slow down compactions. We maintain a minimum compaction rate so that compactions proceed even if flushes have stopped. If the compaction debt goes above our target level, we let compactions proceed as fast as possible without any rate limiting. Just like with flush pacing, this would indicate that writes are occurring faster than the background compactions can keep up with, which is an unsustainable rate. The LSM's read amplification would increase and the L0 file count stall condition would be hit.

With the combined flush and compaction pacing mechanisms, flushes and compactions only occur as fast as needed and no faster, which reduces latency spikes for user read and write operations.

Write throttling

RocksDB adds artificial delays to user writes when certain thresholds are met, such as l0_slowdown_writes_threshold. These artificial delays occur when the system is close to stalling to lessen the write pressure so that flushing and compactions can catch up. On the surface this seems good, since write stalls would seemingly be eliminated and replaced with gradual slowdowns. Closed loop write latency benchmarks would show the elimination of abrupt write stalls, which seems desirable.

However, this doesn't do anything to improve latencies in an open loop model, which is the model more likely to resemble real world use cases. Artificial delays increase write latencies without a clear benefit. Writes stalls in an open loop system would indicate that writes are generated faster than the system could possibly handle, which adding artificial delays won't solve.

For this reason, Pebble doesn't add artificial delays to user writes and writes are served as quickly as possible.

Other Differences

  • internalIterator API which minimizes indirect (virtual) function calls
  • Previous pointers in the memtable and indexed batch skiplists
  • Elision of per-key lower/upper bound checks in long range scans
  • Improved Iterator API
    • SeekPrefixGE for prefix iteration
    • SetBounds for adjusting the bounds on an existing Iterator
  • Simpler Get implementation