-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Rocksdb Architecture Guide
1. Introduction
The rocksdb project started at Facebook as an experiment to develop an efficient database software that can realize the full potential of storing data on flash drives. It is a C++ library and can be used to store keys-and-values where keys and values are arbitrary size byte streams. It has support for atomic reads and atomic writes but not general purpose transactions. It has highly flexible configurable settings that can be tuned to run on a variety of production environments: it can be configured to run on data on pure memory, flash, hard disks or on HDFS. It has support for various compression algorithms and good tools for production support and debugging.
Some portions of the code has been inherited from the open source leveldb project. If you have an application that uses leveldb, you should be able to use rocksdb by changing a few line of code in your application.
2. Assumptions and Goals
## Performance: The primary design point for rocksdb is that it should be performant for fast storage. It should be able to exploit the full potential of high read/write rates offered by flash or RAM-memory subsystems. It should support efficient point lookups as well as range scans. It should be configurable to support high random-read workloads, high update workloads or a combination of both.
## Production support: Rocksdb should be designed in such a way that it has built-in support for tools and utilities that help deployment and debugging in production environments. Most major parameters should be fully tunable so that it can be used by different applications on different hardware.
## Backward Compatibility: Newer versions of this software should be backward compatible, so that existing applications do not need to change when upgrading to newer releases of rocksdb.
- Puts -- batch puts -- disable wal -- async puts, fsync and fdatasync -- Iterators and snapshots -- batch commit to transaction log and manifest updates -- binary search for overlapping files for every level checksums for reads (default false) hardware assists bloom filters shared block cache ReadOnly mode
# Disk Format .sst files for data .log files for trasactions manifest_file for database versions LOG* for server information logs
# Compactions -- multi-threaded -- thread pool per environment -- priority queues for merge sort -- user defined hook for implementing ttl, sanity checks, etc -- avoid compression for two levels, snappy, bzip, zlib
# Incremental Backups GetLiveFiles GetUpdatesSince -- wals are archived
# Environments posix (production) hdfs environment (prototype)
# Tools and Tests sst_dump manifest_dump compact database, change number of levels stress test
# java api
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- SST Partitioner
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Performance
- Projects Being Developed
- Misc