-
Notifications
You must be signed in to change notification settings - Fork 6.8k
RocksDB Bloom Filter
A bloom filter is a bit array that could tell whether an arbitrary key may exist in a set of keys.
Say that there is a set of keys, an algorithm could be applied to create a bit array called bloom filter. Then, given an arbitrary key, this bit array could tell whether a key “may exist” or “definitely not exist” in the original key sets.
Here is a detailed explanation for how bloom filter works.
In RocksDB, every SST file contains a bloom filter. It is used to decide if we need to go into the file to find the exact key.
In RocksDB, each SST file has a bloom filter, it is created when writing SST file to storage. This bloom filter is stored as a part of SST file. There is no difference when building bloom filter for files in different levels.
There is no operation to combine bloom filters. Bloom filters can only be created from a set of keys. When we combine two SST files, new bloom filter is created from keys of new file.
When we open a SST file, the bloom filter is opened and loaded in memory.
When the SST file is closed, the bloom filter is removed from memory. But if
BlockBasedTableOptions::cache_index_and_filter_blocks=true,
the bloom filter may be cached in block cache.
The bloom filter needs all keys to fit in memory to build it. It seems at first sight that building bloom filter for large SST file is impossible. But in block based table, there is no need to worry because bloom filter is created for each data block that just contains a small range of keys.
Details for format of block based table’s filter could be found here.
The block size is defined in
Options::block_size
The default value is 4K. When building a SST file, key-value pairs are added in a sequence, when there is enough k-v pairs ( < 4K), RocksDB would create bloom filter for them and write it to file. So only a very small fraction of keys is stored in memory to build the bloom filter.
We are also working on a new bloom filter for block based table that contains a filter for all keys in SST file. It could improve RocksDB read performance but requires storing hashes of all the keys when building it. Details could be found in next section.
Original bloom filter creates filter for each "data block", thus requires little memory when building it.
We are working on a new bloom filter(called full filter) that contains a filter for all keys in the SST file. This could improve reading performance because it avoids traveling in complicated SST format. But it takes more memory when building because all keys in SST file is required.
Users could specify which kind of bloom filter to use in
tableOptions::FilterBlockType
It use the original bloom filter by default.
The full filter block is formatted as follows:
[filter for all keys in SST file]
(1) The filter is a big bits array that could be used to check for all keys in SST file.
(2) num probes is the number of hash functions used to create bloom filter. In original bloom filter format, it is attached at the end of each filter. (Actually it is the same with full filter)
####Optimization for New Filter Format (1) Store hashes of keys in memory to reduce memory consumption. But users still needs to think twice when using it for large SST files.
(2) Write/Read multiple probes (bits generated by different hash functions) in one CPU cache line. Thus write/read on filter is faster.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- SST Partitioner
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Performance
- Projects Being Developed
- Misc