Add memtable MultiGet finger search optimization#14428
Conversation
|
@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786. |
✅ clang-tidy: No findings on changed linesCompleted in 361.0s. |
|
@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786. |
xingbowang
left a comment
There was a problem hiding this comment.
I don't see test coverage on the skiplist height growing handling. I guess it would involve with concurrent read/write. Could we some add test to validate read after write consistency with concurrent read/write operation with some randomly generated keys?
Some other Findings from AI:
- Finger not updated after callback loop walk-forward
After the callback loop walks forward through level-0 nodes, finger->next_[0] still points to the search result rather than the last visited node. On the next FindGreaterOrEqualWithFinger call, the walk-up starts from this stale position, ascending more levels than necessary. For the typical case (callback returns false after 1 entry), this is negligible. For merge operations consuming multiple entries per key, the next lookup degrades toward O(log N).
- key_validation_callback not invoked during callback loop
In InlineSkipList::MultiGet, the callback loop at level 0 (walking forward through matching entries) only checks key ordering but does not invoke key_validation_callback:
inlineskiplist.h
Lines 1330-1337
for (; node != nullptr && callback_func(callback_args[i], node->Key());) {
Node* prev_node = node;
node = node->Next(0);
if (detect_key_out_of_order && prev_node != head_ && node != nullptr &&
compare_(prev_node->Key(), node->Key()) >= 0) {
return Corruption(prev_node, node, allow_data_in_errors);
}
}
During the search phase, FindSpliceForLevelValidated calls key_validation_callback on every traversed node. But nodes visited by the callback loop escape checksum/corruption validation. If paranoid_memory_checks is enabled, this is a gap compared to the per-key path (which validates every node via GetAndValidate → SeekAndValidate + NextAndValidate).
|
I would also like to see benchmark with smaller batch size to make sure there is no regression on smaller batch size. |
xingbowang
left a comment
There was a problem hiding this comment.
The improvement is mostly from reducing cache miss by reusing the same key from last search. With high batch, I felt we could do parallel lookup with multiple prefetch stages to further hide the latency. I claude coded it for insert and seen up to 4X throughput improvement. I guess it would work well for read as well. We can do that later. Feel free to explore it, if you have passion pursuing that direction.
main...xingbowang:rocksdb:experiment_skiplist_perf_up
b053913 to
e9904e4
Compare
| } | ||
| } | ||
|
|
||
| TEST_F(InlineSkipTest, ConcurrentMultiGet) { |
There was a problem hiding this comment.
This test runs one writer on the main thread and four readers on background threads, all operating on the same skip list simultaneously. This is good improvement. However, this is slightly off my goal. My goal is to validate read after write consistency. I would suggest to
- generate a long sequence of unique keys for insertion.
- split the sequence into different chunks.
- spin up different threads to insert different chunks of keys into the skiplist, and then immediately query a batch of random key including the key just inserted. Validate the key just inserted is visible.
Since each key is unique, we don't need worry about key override, the key read from which it just inserted should be easy to validate its value to validate read after write consistency.
To push this further, maybe there is some way to also have it read recent keys inserted by other threads through some other synchronization mechanism, although it is a bit more complicate. Maybe AI could help.
My inclination would be to binary search + partition the (sub)batch at each height of the skip list into sub-batches to send further down the skip list. This approach would "visit" each skip list node only once (per batch). Nodes pushed on the BFS queue to visit could be prefetched to be hot in cache. (A DFS stack might not hide latency as well.) |
pdillinger
left a comment
There was a problem hiding this comment.
Thanks Xingbo for first round reviews. Overall looks good to me. I trust the crash test to find bugs here if there are any.
Implement finger search on skip list for memtable MultiGet. After each key lookup, the search path (splice) is cached and reused for the next key, reducing per-key cost from O(log N) to O(log d) where d is the distance between consecutive keys. Changes: - Add FindGreaterOrEqualWithFinger() and MultiGet() to InlineSkipList - Add virtual MultiGet() to MemTableRep, override in SkipListRep - Add memtable_multi_get_finger_search CF option (default: false) - Integrate finger search into MemTable::MultiGet when enabled - Add db_bench, db_stress, and crash test support - Add unit tests (InlineSkipList level) and integration tests (DB level) Benchmark results (2M keys in memtable, batch_size=64): - Baseline: ~265,700 ops/sec - Finger search: ~283,800 ops/sec - Improvement: ~7%
Optimize MultiGet by using a finger search on the memtable skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends — costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log n) from the head each time. Key changes: - Add InlineSkipList::FindGreaterOrEqualWithFinger() and MultiGet() with optional paranoid validation parameters (default off) - Add MemTableRep::MultiGet() virtual with validation support - Add SkipListRep::MultiGet() override delegating to skip list - Add MemTable::MultiGet() batched lookup phase that sorts keys, performs finger search, then unscrambles results - Rename option from memtable_multi_get_finger_search to memtable_batch_lookup_optimization - Add db_bench, db_stress, and crash test support - Add comprehensive unit tests including paranoid validation tests The optimization provides +3.6% to +10.1% throughput improvement on MultiGet benchmarks across various configurations.
Fix ASSERT_STATUS_CHECKED failure: wrap all list.MultiGet() calls in inlineskiplist_test.cc with ASSERT_OK() to check the returned Status, which is now non-void after the function deduplication.
- Value-initialize std::array declarations in memtable.cc to fix cppcoreguidelines-pro-type-member-init warnings - Add reserve() before push_back loop in inlineskiplist_test.cc to fix performance-inefficient-vector-operation warning - Add braces around single-statement for loop in db_basic_test.cc to fix readability-braces-around-statements warning
Two issues in InlineSkipList::MultiGet's callback walk-forward loop: 1. After the callback loop walks forward through level-0 nodes (merge operations consuming multiple entries), finger.next_[0] still pointed to the original search result. The next key's search would start from this stale position, ascending more levels than needed. Fix: update finger.prev_[0] and finger.next_[0] after each step. 2. The callback loop checked detect_key_out_of_order but did not invoke key_validation_callback on nodes traversed via Next(0). This meant nodes visited during merge walk-forward escaped checksum validation when paranoid_memory_checks was enabled. Fix: call the validation callback on each new node before it is passed to SaveValue.
Add ConcurrentMultiGet test that exercises skip list height growth handling in FindGreaterOrEqualWithFinger. Multiple reader threads perform concurrent MultiGet lookups while a writer thread inserts keys via InsertConcurrently, causing the skip list height to grow during ongoing finger searches. The test validates that each MultiGet result is >= its query key (read-after-write consistency). Also fix two issues in the MultiGet callback loop: - Update finger.prev_[0]/next_[0] after each walk-forward step so the next key's search starts from the current position rather than the stale original search result (performance fix for merge ops) - Invoke key_validation_callback on each node reached via Next(0) in the callback loop, closing a validation gap when paranoid memory checks are enabled
Rewrite the ConcurrentMultiGet test to validate read-after-write consistency rather than just basic concurrent safety: - Generate 20,000 unique keys, shuffle, and split into 4 thread chunks - Each thread inserts its keys one at a time via InsertConcurrently - After each insert, immediately does a MultiGet batch that includes the just-inserted key plus random keys from a shared ring buffer - Validates the just-inserted key is always visible (read-after-write) - Shared ring buffer allows cross-thread visibility testing: each thread publishes inserted keys and samples recently inserted keys from other threads for its MultiGet queries - Use StartThread instead of Schedule for proper WaitForJoin semantics This exercises skip list height growth during active finger searches while providing strong read-after-write consistency guarantees.
e9904e4 to
6d64f87
Compare
|
@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786. |
|
Thanks Xingbo and Peter for the review.
That's a nice idea. To Xingbo's point, hiding cache latency would give a significant boost. I'll explore this when I have some time. The BFS approach could also be applied to batched index lookups in SSTs (though impact may be less due to smaller batches). |
|
@anand1976 merged this pull request in c66c142. |
…facebook#14428) Summary: The memtable MultiGet finger search optimization (facebook#14428) added a walk-forward loop that updates finger.prev_[0] to the last node processed by the callback. When a MultiGet batch contains duplicate user keys, the walk-forward for the first occurrence processes all entries for that user key, leaving finger.prev_[0] at the last entry (lowest sequence number). For the duplicate key, the lookup key's snapshot sequence number is higher, causing it to sort BEFORE finger.prev_[0] in InternalKeyComparator order, violating FindSpliceForLevel's assertion "before == head_ || KeyIsAfterNode(key, before)". Fixed by not updating finger.prev_[0] during the walk-forward. Only finger.next_[0] is updated to track the walk-forward position. The finger.prev_[0] set by FindGreaterOrEqualWithFinger is always a safe starting point for subsequent lookups. Test Plan: - Added unit test MultiGetDuplicateKeys that reproduces the assertion failure with duplicate keys in the batch - Verified test fails without the fix, passes with the fix - All InlineSkipList tests pass (no regressions) - All DB-level BatchLookup tests pass (no regressions)
Summary: Add memtable batch lookup optimization with finger search Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time. Controlled by the new `memtable_batch_lookup_optimization` column family option (default: false). ## Changes - Add `FindGreaterOrEqualWithFinger()` and `MultiGet()` to `InlineSkipList` with optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters) - Add virtual `MultiGet()` to `MemTableRep`, override in `SkipListRep` - Add `memtable_batch_lookup_optimization` CF option - Integrate finger search into `MemTable::MultiGet` -- sorts keys, performs batched finger search, then unscrambles results - Add `db_bench`, `db_stress`, and crash test support - Add unit tests for `InlineSkipList::MultiGet` (5 tests) and integration tests at the DB level (25 `BatchLookup` tests), including paranoid validation tests ## Benchmark Results Setup: 2M keys in memtable, release build (`DEBUG_LEVEL=0`). ### Single-threaded `multireadrandom` | Batch Size | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 2 | 363,593 | 376,562 | **+3.6%** | | 8 | 385,204 | 386,082 | **+0.2%** | | 32 | 360,339 | 375,105 | **+4.1%** | | 64 | 352,696 | 378,497 | **+7.3%** | ### Multithreaded `multireadrandom` (batch_size=64) | Threads | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 4 | 171,356 | 185,633 | **+8.3%** | | 8 | 161,478 | 171,392 | **+6.1%** | ### `multireadwhilewriting` (batch_size=64) | Threads | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 1 | 163,938 | 175,068 | **+6.8%** | | 4 | 109,698 | 120,790 | **+10.1%** | | 8 | 116,623 | 123,658 | **+6.0%** | No regression at any batch size or thread count. The finger is stack-allocated per `MultiGet` call, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-load `Next()` pointers. Small Batch Size Regression Check: memtable_batch_lookup_optimization ===================================================================== All values are ops/sec. Measured with db_bench (release build). 2M keys in memtable, value_size=100, duration=30s per run. Multithreaded multireadrandom — small batch sizes -------------------------------------------------- Threads | Batch Size | Baseline | BatchOpt | Change 4 | 2 | 532,302 | 540,682 | +1.6% 8 | 2 | 1,044,046 | 1,046,920 | +0.3% 4 | 8 | 633,733 | 630,039 | -0.6% 8 | 8 | 1,260,818 | 1,241,792 | -1.5% multireadwhilewriting — small batch sizes ----------------------------------------- Threads | Batch Size | Baseline | BatchOpt | Change 1 | 2 | 107,284 | 105,217 | -1.9% 4 | 2 | 428,508 | 423,747 | -1.1% 8 | 2 | 854,654 | 864,923 | +1.2% 1 | 8 | 114,788 | 118,496 | +3.2% 4 | 8 | 467,995 | 473,437 | +1.2% 8 | 8 | 935,636 | 960,791 | +2.7% No regression at small batch sizes. Variations at batch_size=2 are within noise (~1-2%). At batch_size=8, modest positive trend (+1-3% in read-while-writing). Pull Request resolved: facebook#14428 Test Plan: - `make check` -- all tests pass - `ASSERT_STATUS_CHECKED=1 make check` -- all tests pass - 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1` -- passed - 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1` -- passed Reviewed By: pdillinger Differential Revision: D95813786 Pulled By: anand1976 fbshipit-source-id: b182a023e1026021f1b9682d1cc024c7729326c7
Add memtable batch lookup optimization with finger search
Summary
Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time.
Controlled by the new
memtable_batch_lookup_optimizationcolumn family option (default: false).Changes
FindGreaterOrEqualWithFinger()andMultiGet()toInlineSkipListwith optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters)MultiGet()toMemTableRep, override inSkipListRepmemtable_batch_lookup_optimizationCF optionMemTable::MultiGet-- sorts keys, performs batched finger search, then unscrambles resultsdb_bench,db_stress, and crash test supportInlineSkipList::MultiGet(5 tests) and integration tests at the DB level (25BatchLookuptests), including paranoid validation testsBenchmark Results
Setup: 2M keys in memtable, release build (
DEBUG_LEVEL=0).Single-threaded
multireadrandomMultithreaded
multireadrandom(batch_size=64)multireadwhilewriting(batch_size=64)No regression at any batch size or thread count. The finger is stack-allocated per
MultiGetcall, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-loadNext()pointers.Small Batch Size Regression Check: memtable_batch_lookup_optimization
All values are ops/sec. Measured with db_bench (release build).
2M keys in memtable, value_size=100, duration=30s per run.
Multithreaded multireadrandom — small batch sizes
Threads | Batch Size | Baseline | BatchOpt | Change
4 | 2 | 532,302 | 540,682 | +1.6%
8 | 2 | 1,044,046 | 1,046,920 | +0.3%
4 | 8 | 633,733 | 630,039 | -0.6%
8 | 8 | 1,260,818 | 1,241,792 | -1.5%
multireadwhilewriting — small batch sizes
Threads | Batch Size | Baseline | BatchOpt | Change
1 | 2 | 107,284 | 105,217 | -1.9%
4 | 2 | 428,508 | 423,747 | -1.1%
8 | 2 | 854,654 | 864,923 | +1.2%
1 | 8 | 114,788 | 118,496 | +3.2%
4 | 8 | 467,995 | 473,437 | +1.2%
8 | 8 | 935,636 | 960,791 | +2.7%
No regression at small batch sizes. Variations at batch_size=2 are
within noise (~1-2%). At batch_size=8, modest positive trend (+1-3%
in read-while-writing).
Test Plan
make check-- all tests passASSERT_STATUS_CHECKED=1 make check-- all tests passdb_crashtest.py blackboxwith--memtable_batch_lookup_optimization=1-- passeddb_crashtest.py blackboxwith--memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1-- passed