Compaction

Compaction algorithms constrain the LSM tree shape. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. You can read more on RocksDB Compactions here: Multi-threaded compactions

LSM terminology and metaphors

Let us first establish the different, sometimes mixed, metaphors and terminology used in describing LSM levels and structure.

A level is above another level if its number is lower. For example, L1 is above L2
The lowest-numbered level, L0, can be called the top level or first level.
- A version of a key in L0 must be newer than versions of that same key in all levels below L0.
- Thus, L0 is sometimes loosely referred to as the level containing the newest data.
A level is below another level if its number is higher. For example, L2 is below L1.
The highest-numbered level, Lmax, can be called the bottom-most or last level.
- A version of a key in Lmax must be older than versions of that same key in all levels above Lmax.
- Thus, Lmax is sometimes loosely referred to as the level containing the oldest data.
When talking about a particular key or key-range, a level is considered bottom-most when that level contains data for that key or key-range and no below level contains data for it.

Overview of Compaction algorithms

Source: https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html

Here we present a taxonomy of compaction algorithms: Classic Leveled, Tiered, Tiered+Leveled, Leveled-N, FIFO. Out of them, Rocksdb implements Tiered+Leveled (termed Level Compaction in the code), Tiered (termed Universal in the code), and FIFO.

Classic Leveled

Classic Leveled compaction, introduced by LSM-tree paper by O'Neil et al, minimizes space amplification at the cost of read and write amplification.

The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in this paper by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.

While write amplification is usually worse with leveled than with tiered, there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for compaction priority in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.

Leveled-N

Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level. The Dostoevsky paper defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level.

Tiered

Tiered compaction minimizes write amplification at the cost of read and space amplification.

The LSM tree can still be viewed as a sequence of levels as explained in the Dostoevsky paper by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.

A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.

There are a few challenges with tiered compaction:

Transient space amplification is large when compaction includes a sorted run from the max level.
The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.
Compaction for large sorted runs takes a long time. Multi-threading would help.
Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run.

For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name universal rather than tiered.

Tiered compaction in RocksDB code base is termed Universal Compaction.

Tiered+Leveled

Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.

The tiered+leveled approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.

SlimDB from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. Fluid LSM is described as tiered+leveled but I think it is leveled-N.

Leveled compaction in RocksDB is also tiered+leveled. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of level0_file_num_compaction_trigger. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. Subcompactions in the RocksDB L0 makes this even more interesting, but that is a topic for another post.

FIFO

The FIFOStyle Compaction drops oldest file when obsolete and can be used for cache-like data.

Options

Here we give overview of the options that impact behavior of Compactions:

AdvancedColumnFamilyOptions::compaction_style - RocksDB currently supports four compaction algorithms - kCompactionStyleLevel(default), kCompactionStyleUniversal, kCompactionStyleFIFO and kCompactionStyleNone. If kCompactionStyleNone is selected, compaction has to be triggered manually by calling CompactRange() or CompactFiles()). Level compaction options are available under AdvancedColumnFamilyOptions. Universal Compaction options are available in AdvancedColumnFamilyOptions::compaction_options_universal and FIFO compaction options available in AdvancedColumnFamilyOptions::compaction_options_fifo
ColumnFamilyOptions::disable_auto_compactions - This dynamically changeable setting can be used by the application to disable automatic compactions. Manual compactions can still be issued on this database.
ColumnFamilyOptions::compaction_filter - Allows an application to modify/delete a key-value during background compaction (single instance). The client must provide compaction_filter_factory if it requires a new compaction filter to be used for different compaction processes. Client should specify only one of filter or factory.
ColumnFamilyOptions::compaction_filter_factory - a factory that provides compaction filter objects which allow an application to modify/delete a key-value during background compaction. A new filter will be created for each compaction run.
DBOptions::max_subcompactions (Default: 1) - Specify the max number of subcompactions each compaction is allowed to be split into.

Other options impacting performance of compactions and when they get triggered are:

DBOptions::access_hint_on_compaction_start (Default: NORMAL) - Specify the file access pattern once a compaction is started. It will be applied to all input files of a compaction. Other AccessHint settings - NONE, SEQUENTIAL, WILLNEED
ColumnFamilyOptions::level0_file_num_compaction_trigger (Default: 4) - Number of files to trigger level-0 compaction. A negative value means that level-0 compaction will not be triggered by number of files at all.
AdvancedColumnFamilyOptions::target_file_size_base and AdvancedColumnFamilyOptions::target_file_size_multiplier - Target file size for compaction. target_file_size_base is per-file size for level-1. Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB. Default target_file_size_base is 64MB and default target_file_size_multiplier is 1.
AdvancedColumnFamilyOptions::max_compaction_bytes (Default: target_file_size_base * 25) - Maximum number of bytes in all compacted files. We avoid expanding the lower level file set of a compaction if it would make the total compaction cover more than this amount.
DBOptions::max_background_jobs (Default: 2) - Maximum number of concurrent background jobs (compactions and flushes)
DBOptions::compaction_readahead_size - If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. We enforce it to be 2MB if you don't set it with direct I/O.

Compaction can also be manually triggered. See Manual Compaction

See include/rocksdb/options.h and include/rocksdb/advanced_options.h for detailed explanation of these options

Leveled style compaction

See Leveled Compaction.

Universal style compaction

For description about universal style compaction, see Universal compaction style

If you're using Universal style compaction, there is an object CompactionOptionsUniversal that holds all the different options for that compaction. The exact definition is in rocksdb/universal_compaction.h and you can set it in Options::compaction_options_universal. Here we give a short overview of options in CompactionOptionsUniversal:

CompactionOptionsUniversal::size_ratio - Percentage flexibility while comparing file size. If the candidate file(s) size is 1% smaller than the next file's size, then include next file into this candidate set. Default: 1
CompactionOptionsUniversal::min_merge_width - The minimum number of files in a single compaction run. Default: 2
CompactionOptionsUniversal::max_merge_width - The maximum number of files in a single compaction run. Default: UINT_MAX
CompactionOptionsUniversal::max_size_amplification_percent - The size amplification is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification. Default: 200, which means that a 100 byte database could require upto 300 bytes of storage.
CompactionOptionsUniversal::compression_size_percent - If this option is set to be -1 (the default value), all the output files will follow compression type specified. If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed. When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time: [ A1...An B1...Bm C1...Ct ], where A1 is the newest and Ct is the oldest, and we are going to compact B1...Bm, we calculate the total size of all the files as total_size, as well as the total size of C1...Ct as total_C, the compaction output file will be compressed iff total_C / total_size < this percentage
CompactionOptionsUniversal::stop_style - The algorithm used to stop picking files into a single compaction run. Can be kCompactionStopStyleSimilarSize (pick files of similar size) or kCompactionStopStyleTotalSize (total size of picked files > next file). Default: kCompactionStopStyleTotalSize
CompactionOptionsUniversal::allow_trivial_move - Option to optimize the universal multi level compaction by enabling trivial move for non overlapping files. Default: false.

FIFO Compaction Style

See FIFO compaction style

Thread pools

Compactions are executed in thread pools. See Thread Pool.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly