-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Setup Options and Basic Tuning
Besides writing code using Basic Operations on RocksDB, you may also be interested in how to tune RocksDB to achieve desired performance. In this page, we introduce how to get an initial set-up, which should work well enough for many use cases.
RocksDB has many configuration options, but most of them can be safely ignored by many users, as the majority of them are for influencing the performance of very specific workloads. For general use, most RocksDB options can be left at their defaults, however, we suggest some options below that every user might like to experiment with for general workloads.
First, you need to think about the options relating to resource limits (see also: Basic Operations):
This can be set either per Database and/or per Column Family.
This is the maximum write buffer size used for the Column Family.
It represents the amount of data to build up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk file. The default is 64 MB.
You need to budget for 2 x your worst case memory use. If you don't have enough memory for this, you should reduce this value. Otherwise, it is not recommended to change this option. For example:
cf_options.write_buffer_size = 64 << 20;
See below for sharing memory across Column Families.
This is the maximum size of all Write Buffers across all Column Families in the database. It represents the amount of data to build up in memtables across all column families before writing to disk.
By default this feature is disabled (by being set to 0
). You should not need to change it. However, for reference, if you do need to change it to 64 GB for example:
db_options.db_write_buffer_size = 64 << 30;
You can create a Block cache of your chosen the size for caching uncompressed data.
We recommend that this should be about 1/3 of your total memory budget. The remaining free memory can be left for the OS (Operating System) page cache. Leaving a large chunk of memory for OS page cache has the benefit of avoiding tight memory budgeting (see also: Memory Usage in RocksDB).
Setting the block cache size requires that we also set table related options, for example if you want an LRU Cache of 128 MB
:
auto cache = NewLRUCache(128 << 20);
BlockBasedTableOptions table_options;
table_options.block_cache = cache;
auto table_factory = new BlockBasedTableFactory(table_options);
cf_options.table_factory.reset(table_factory);
NOTE: You should set the same Cache object on all the table_options
for all the Column Families of all DB's managed by the process. An alternative to achieve this, is to pass the same table_factory
or table_options
to all Column Families of all DB's. To learn more about the Block Cache, see: Block Cache.
You can only choose compression types which are supported on your host system. Using compression is a trade-off between CPU, I/O and storage space.
-
cf_options.compression
controls the compression type used for the firstn-1
levels. We recommend to use LZ4 (kLZ4Compression
), or if not available, to use Snappy (kSnappyCompression
). -
cf_options.bottommost_compression
controls the compression type used for thenth
level. We recommend to use ZStandard (kZSTD
), or if not available, to use Zlib (kZlibCompression
).
To learn more about compression, See Compression.
You should only enable this if it suits your Query patterns; If you have many point lookup operations (i.e. Get()
), then a Bloom Filter can help speed up those operations, conversely if most of your operations are range scans (e.g. Iterator()
) then the Bloom Filter will not help.
The Bloom Filter uses a number of bits for each key, a good value is 10
, which yields a filter with ~1% false positive rate. Refer to Bloom Filter Wiki page for false positive rates with lower number of bits per key.
If Get()
is a common operation for your queries, you can configure the Bloom Filter, for example with 10 bits per key:
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
We also recommend
table_options.optimize_filters_for_memory = true;
for some memory saving.
To learn more about Bloom Filters, see: Bloom Filter.
It can be a good idea to limit the rate of compactions and flushes to smooth I/O operations, one reason for doing this is to avoid the read latency outliers. This can be done by means of the db_options.rate_limiter
option. Rate limiting is a complex topic, and is covered in Rate Limiter.
NOTE: Make sure to pass the same rate_limiter
object to all the DB's in your process.
If you are using flash storage, we recommend users to mount the file system with the discard
flag in order to improve write amplification.
If you are using flash storage and the discard
flag, trimming will be employed. Trimming can cause long I/O latencies temporarily if the trim size is very large. The SST File Manager can cap the file deletion speed, so that each trim's size is controlled.
The SST File Manager can be enabled, by setting the db_options.sst_file_manager
option. Details of the SST File Manager can be seen here: sst_file_manager_impl.h.
Below are a number of options, where we feel the values set achieve reasonable out-of-box performance for general workloads. We didn't change these options because of the concern of incompatibility or regression when users upgrade their existing RocksDB instance to a newer version. We suggest that users start their new DB projects with these settings:
cf_options.level_compaction_dynamic_level_bytes = true;
opts.max_background_jobs = 6;
options.bytes_per_sync = 1048576;
options.compaction_pri = kMinOverlappingRatio;
table_options.block_size = 16 * 1024;
table_options.cache_index_and_filter_blocks = true;
table_options.pin_l0_filter_and_index_blocks_in_cache = true;
table_options.format_version = <the latest version>;
Default format version usually lags behind the recommended version for compatibility reason. For new use cases, it's not a concern.
Don't feel sad if you have existing services running with the defaults instead of these options. Whilst we believe that these are better than the default options, none of them is likely to bring significant improvements.
Now you are ready to test your application and see how your initial RocksDB performance looks. Hopefully it will be good enough!
If the performance of RocksDB within your application after the basic set-up described above, is good enough for you, we don't recommend that you tune it further. As it is common for a workload to change over time, if you expend unnecessary resources upfront to tune RocksDB to be highly performant for your current workload, some modest change in future to that workload may push the performance off a cliff.
On the other hand, if the performance is not good enough for you, you can further tune RocksDB by following the more detailed Tuning Guide.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc