Add memtable MultiGet finger search optimization by anand1976 · Pull Request #14428 · facebook/rocksdb

anand1976 · 2026-03-05T06:02:36Z

Add memtable batch lookup optimization with finger search

Summary

Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time.

Controlled by the new memtable_batch_lookup_optimization column family option (default: false).

Changes

Add FindGreaterOrEqualWithFinger() and MultiGet() to InlineSkipList with optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters)
Add virtual MultiGet() to MemTableRep, override in SkipListRep
Add memtable_batch_lookup_optimization CF option
Integrate finger search into MemTable::MultiGet -- sorts keys, performs batched finger search, then unscrambles results
Add db_bench, db_stress, and crash test support
Add unit tests for InlineSkipList::MultiGet (5 tests) and integration tests at the DB level (25 BatchLookup tests), including paranoid validation tests

Benchmark Results

Setup: 2M keys in memtable, release build (DEBUG_LEVEL=0).

Single-threaded `multireadrandom`

Batch Size	Baseline	BatchOpt	vs Base
2	363,593	376,562	+3.6%
8	385,204	386,082	+0.2%
32	360,339	375,105	+4.1%
64	352,696	378,497	+7.3%

Multithreaded `multireadrandom` (batch_size=64)

Threads	Baseline	BatchOpt	vs Base
4	171,356	185,633	+8.3%
8	161,478	171,392	+6.1%

`multireadwhilewriting` (batch_size=64)

Threads	Baseline	BatchOpt	vs Base
1	163,938	175,068	+6.8%
4	109,698	120,790	+10.1%
8	116,623	123,658	+6.0%

No regression at any batch size or thread count. The finger is stack-allocated per MultiGet call, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-load Next() pointers.

Small Batch Size Regression Check: memtable_batch_lookup_optimization

All values are ops/sec. Measured with db_bench (release build).
2M keys in memtable, value_size=100, duration=30s per run.

Multithreaded multireadrandom — small batch sizes

Threads | Batch Size | Baseline | BatchOpt | Change
4 | 2 | 532,302 | 540,682 | +1.6%
8 | 2 | 1,044,046 | 1,046,920 | +0.3%
4 | 8 | 633,733 | 630,039 | -0.6%
8 | 8 | 1,260,818 | 1,241,792 | -1.5%

multireadwhilewriting — small batch sizes

Threads | Batch Size | Baseline | BatchOpt | Change
1 | 2 | 107,284 | 105,217 | -1.9%
4 | 2 | 428,508 | 423,747 | -1.1%
8 | 2 | 854,654 | 864,923 | +1.2%
1 | 8 | 114,788 | 118,496 | +3.2%
4 | 8 | 467,995 | 473,437 | +1.2%
8 | 8 | 935,636 | 960,791 | +2.7%

No regression at small batch sizes. Variations at batch_size=2 are
within noise (~1-2%). At batch_size=8, modest positive trend (+1-3%
in read-while-writing).

Test Plan

make check -- all tests pass
ASSERT_STATUS_CHECKED=1 make check -- all tests pass
2-hour db_crashtest.py blackbox with --memtable_batch_lookup_optimization=1 -- passed
2-hour db_crashtest.py blackbox with --memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1 -- passed

meta-codesync · 2026-03-09T16:57:38Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

github-actions · 2026-03-09T18:44:25Z

✅ clang-tidy: No findings on changed lines

Completed in 361.0s.

meta-codesync · 2026-03-10T04:31:21Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

xingbowang

I don't see test coverage on the skiplist height growing handling. I guess it would involve with concurrent read/write. Could we some add test to validate read after write consistency with concurrent read/write operation with some randomly generated keys?

Some other Findings from AI:

Finger not updated after callback loop walk-forward

After the callback loop walks forward through level-0 nodes, finger->next_[0] still points to the search result rather than the last visited node. On the next FindGreaterOrEqualWithFinger call, the walk-up starts from this stale position, ascending more levels than necessary. For the typical case (callback returns false after 1 entry), this is negligible. For merge operations consuming multiple entries per key, the next lookup degrades toward O(log N).

key_validation_callback not invoked during callback loop

In InlineSkipList::MultiGet, the callback loop at level 0 (walking forward through matching entries) only checks key ordering but does not invoke key_validation_callback:

inlineskiplist.h
Lines 1330-1337
for (; node != nullptr && callback_func(callback_args[i], node->Key());) {
Node* prev_node = node;
node = node->Next(0);
if (detect_key_out_of_order && prev_node != head_ && node != nullptr &&
compare_(prev_node->Key(), node->Key()) >= 0) {
return Corruption(prev_node, node, allow_data_in_errors);
}
}
During the search phase, FindSpliceForLevelValidated calls key_validation_callback on every traversed node. But nodes visited by the callback loop escape checksum/corruption validation. If paranoid_memory_checks is enabled, this is a gap compared to the per-key path (which validates every node via GetAndValidate → SeekAndValidate + NextAndValidate).

xingbowang · 2026-03-11T20:09:09Z

I would also like to see benchmark with smaller batch size to make sure there is no regression on smaller batch size.

xingbowang

The improvement is mostly from reducing cache miss by reusing the same key from last search. With high batch, I felt we could do parallel lookup with multiple prefetch stages to further hide the latency. I claude coded it for insert and seen up to 4X throughput improvement. I guess it would work well for read as well. We can do that later. Feel free to explore it, if you have passion pursuing that direction.
main...xingbowang:rocksdb:experiment_skiplist_perf_up

xingbowang · 2026-03-12T20:56:21Z

+  }
+}
+
+TEST_F(InlineSkipTest, ConcurrentMultiGet) {


This test runs one writer on the main thread and four readers on background threads, all operating on the same skip list simultaneously. This is good improvement. However, this is slightly off my goal. My goal is to validate read after write consistency. I would suggest to

generate a long sequence of unique keys for insertion.

split the sequence into different chunks.

spin up different threads to insert different chunks of keys into the skiplist, and then immediately query a batch of random key including the key just inserted. Validate the key just inserted is visible.
Since each key is unique, we don't need worry about key override, the key read from which it just inserted should be easy to validate its value to validate read after write consistency.
To push this further, maybe there is some way to also have it read recent keys inserted by other threads through some other synchronization mechanism, although it is a bit more complicate. Maybe AI could help.

pdillinger · 2026-03-13T17:10:57Z

I felt we could do parallel lookup with multiple prefetch stages to further hide the latency.

My inclination would be to binary search + partition the (sub)batch at each height of the skip list into sub-batches to send further down the skip list. This approach would "visit" each skip list node only once (per batch). Nodes pushed on the BFS queue to visit could be prefetched to be hot in cache. (A DFS stack might not hide latency as well.)

pdillinger

Thanks Xingbo for first round reviews. Overall looks good to me. I trust the crash test to find bugs here if there are any.

Implement finger search on skip list for memtable MultiGet. After each key lookup, the search path (splice) is cached and reused for the next key, reducing per-key cost from O(log N) to O(log d) where d is the distance between consecutive keys. Changes: - Add FindGreaterOrEqualWithFinger() and MultiGet() to InlineSkipList - Add virtual MultiGet() to MemTableRep, override in SkipListRep - Add memtable_multi_get_finger_search CF option (default: false) - Integrate finger search into MemTable::MultiGet when enabled - Add db_bench, db_stress, and crash test support - Add unit tests (InlineSkipList level) and integration tests (DB level) Benchmark results (2M keys in memtable, batch_size=64): - Baseline: ~265,700 ops/sec - Finger search: ~283,800 ops/sec - Improvement: ~7%

Optimize MultiGet by using a finger search on the memtable skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends — costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log n) from the head each time. Key changes: - Add InlineSkipList::FindGreaterOrEqualWithFinger() and MultiGet() with optional paranoid validation parameters (default off) - Add MemTableRep::MultiGet() virtual with validation support - Add SkipListRep::MultiGet() override delegating to skip list - Add MemTable::MultiGet() batched lookup phase that sorts keys, performs finger search, then unscrambles results - Rename option from memtable_multi_get_finger_search to memtable_batch_lookup_optimization - Add db_bench, db_stress, and crash test support - Add comprehensive unit tests including paranoid validation tests The optimization provides +3.6% to +10.1% throughput improvement on MultiGet benchmarks across various configurations.

Fix ASSERT_STATUS_CHECKED failure: wrap all list.MultiGet() calls in inlineskiplist_test.cc with ASSERT_OK() to check the returned Status, which is now non-void after the function deduplication.

- Value-initialize std::array declarations in memtable.cc to fix cppcoreguidelines-pro-type-member-init warnings - Add reserve() before push_back loop in inlineskiplist_test.cc to fix performance-inefficient-vector-operation warning - Add braces around single-statement for loop in db_basic_test.cc to fix readability-braces-around-statements warning

Two issues in InlineSkipList::MultiGet's callback walk-forward loop: 1. After the callback loop walks forward through level-0 nodes (merge operations consuming multiple entries), finger.next_[0] still pointed to the original search result. The next key's search would start from this stale position, ascending more levels than needed. Fix: update finger.prev_[0] and finger.next_[0] after each step. 2. The callback loop checked detect_key_out_of_order but did not invoke key_validation_callback on nodes traversed via Next(0). This meant nodes visited during merge walk-forward escaped checksum validation when paranoid_memory_checks was enabled. Fix: call the validation callback on each new node before it is passed to SaveValue.

Add ConcurrentMultiGet test that exercises skip list height growth handling in FindGreaterOrEqualWithFinger. Multiple reader threads perform concurrent MultiGet lookups while a writer thread inserts keys via InsertConcurrently, causing the skip list height to grow during ongoing finger searches. The test validates that each MultiGet result is >= its query key (read-after-write consistency). Also fix two issues in the MultiGet callback loop: - Update finger.prev_[0]/next_[0] after each walk-forward step so the next key's search starts from the current position rather than the stale original search result (performance fix for merge ops) - Invoke key_validation_callback on each node reached via Next(0) in the callback loop, closing a validation gap when paranoid memory checks are enabled

Rewrite the ConcurrentMultiGet test to validate read-after-write consistency rather than just basic concurrent safety: - Generate 20,000 unique keys, shuffle, and split into 4 thread chunks - Each thread inserts its keys one at a time via InsertConcurrently - After each insert, immediately does a MultiGet batch that includes the just-inserted key plus random keys from a shared ring buffer - Validates the just-inserted key is always visible (read-after-write) - Shared ring buffer allows cross-thread visibility testing: each thread publishes inserted keys and samples recently inserted keys from other threads for its MultiGet queries - Use StartThread instead of Schedule for proper WaitForJoin semantics This exercises skip list height growth during active finger searches while providing strong read-after-write consistency guarantees.

meta-codesync · 2026-03-16T00:47:31Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

anand1976 · 2026-03-16T17:23:01Z

Thanks Xingbo and Peter for the review.

My inclination would be to binary search + partition the (sub)batch at each height of the skip list into sub-batches to send further down the skip list. This approach would "visit" each skip list node only once (per batch). Nodes pushed on the BFS queue to visit could be prefetched to be hot in cache. (A DFS stack might not hide latency as well.)

That's a nice idea. To Xingbo's point, hiding cache latency would give a significant boost. I'll explore this when I have some time. The BFS approach could also be applied to batched index lookups in SSTs (though impact may be less due to smaller batches).

meta-codesync · 2026-03-16T17:48:35Z

@anand1976 merged this pull request in c66c142.

…facebook#14428) Summary: The memtable MultiGet finger search optimization (facebook#14428) added a walk-forward loop that updates finger.prev_[0] to the last node processed by the callback. When a MultiGet batch contains duplicate user keys, the walk-forward for the first occurrence processes all entries for that user key, leaving finger.prev_[0] at the last entry (lowest sequence number). For the duplicate key, the lookup key's snapshot sequence number is higher, causing it to sort BEFORE finger.prev_[0] in InternalKeyComparator order, violating FindSpliceForLevel's assertion "before == head_ || KeyIsAfterNode(key, before)". Fixed by not updating finger.prev_[0] during the walk-forward. Only finger.next_[0] is updated to track the walk-forward position. The finger.prev_[0] set by FindGreaterOrEqualWithFinger is always a safe starting point for subsequent lookups. Test Plan: - Added unit test MultiGetDuplicateKeys that reproduces the assertion failure with duplicate keys in the batch - Verified test fails without the fix, passes with the fix - All InlineSkipList tests pass (no regressions) - All DB-level BatchLookup tests pass (no regressions)

Summary: Add memtable batch lookup optimization with finger search Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time. Controlled by the new `memtable_batch_lookup_optimization` column family option (default: false). ## Changes - Add `FindGreaterOrEqualWithFinger()` and `MultiGet()` to `InlineSkipList` with optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters) - Add virtual `MultiGet()` to `MemTableRep`, override in `SkipListRep` - Add `memtable_batch_lookup_optimization` CF option - Integrate finger search into `MemTable::MultiGet` -- sorts keys, performs batched finger search, then unscrambles results - Add `db_bench`, `db_stress`, and crash test support - Add unit tests for `InlineSkipList::MultiGet` (5 tests) and integration tests at the DB level (25 `BatchLookup` tests), including paranoid validation tests ## Benchmark Results Setup: 2M keys in memtable, release build (`DEBUG_LEVEL=0`). ### Single-threaded `multireadrandom` | Batch Size | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 2 | 363,593 | 376,562 | **+3.6%** | | 8 | 385,204 | 386,082 | **+0.2%** | | 32 | 360,339 | 375,105 | **+4.1%** | | 64 | 352,696 | 378,497 | **+7.3%** | ### Multithreaded `multireadrandom` (batch_size=64) | Threads | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 4 | 171,356 | 185,633 | **+8.3%** | | 8 | 161,478 | 171,392 | **+6.1%** | ### `multireadwhilewriting` (batch_size=64) | Threads | Baseline | BatchOpt | vs Base | |:---:|---:|---:|---:| | 1 | 163,938 | 175,068 | **+6.8%** | | 4 | 109,698 | 120,790 | **+10.1%** | | 8 | 116,623 | 123,658 | **+6.0%** | No regression at any batch size or thread count. The finger is stack-allocated per `MultiGet` call, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-load `Next()` pointers. Small Batch Size Regression Check: memtable_batch_lookup_optimization ===================================================================== All values are ops/sec. Measured with db_bench (release build). 2M keys in memtable, value_size=100, duration=30s per run. Multithreaded multireadrandom — small batch sizes -------------------------------------------------- Threads | Batch Size | Baseline | BatchOpt | Change 4 | 2 | 532,302 | 540,682 | +1.6% 8 | 2 | 1,044,046 | 1,046,920 | +0.3% 4 | 8 | 633,733 | 630,039 | -0.6% 8 | 8 | 1,260,818 | 1,241,792 | -1.5% multireadwhilewriting — small batch sizes ----------------------------------------- Threads | Batch Size | Baseline | BatchOpt | Change 1 | 2 | 107,284 | 105,217 | -1.9% 4 | 2 | 428,508 | 423,747 | -1.1% 8 | 2 | 854,654 | 864,923 | +1.2% 1 | 8 | 114,788 | 118,496 | +3.2% 4 | 8 | 467,995 | 473,437 | +1.2% 8 | 8 | 935,636 | 960,791 | +2.7% No regression at small batch sizes. Variations at batch_size=2 are within noise (~1-2%). At batch_size=8, modest positive trend (+1-3% in read-while-writing). Pull Request resolved: facebook#14428 Test Plan: - `make check` -- all tests pass - `ASSERT_STATUS_CHECKED=1 make check` -- all tests pass - 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1` -- passed - 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1` -- passed Reviewed By: pdillinger Differential Revision: D95813786 Pulled By: anand1976 fbshipit-source-id: b182a023e1026021f1b9682d1cc024c7729326c7

anand1976 added the WIP Work in progress label Mar 5, 2026

meta-cla Bot added the CLA Signed label Mar 5, 2026

facebook deleted a comment from github-actions Bot Mar 9, 2026

anand1976 requested a review from pdillinger March 9, 2026 16:47

anand1976 removed the WIP Work in progress label Mar 9, 2026

anand1976 requested a review from xingbowang March 9, 2026 17:38

xingbowang reviewed Mar 11, 2026

View reviewed changes

anand1976 force-pushed the memtable_mget branch from b053913 to e9904e4 Compare March 12, 2026 18:00

xingbowang reviewed Mar 12, 2026

View reviewed changes

pdillinger approved these changes Mar 13, 2026

View reviewed changes

anand1976 added 7 commits March 15, 2026 11:19

Check Status returned by InlineSkipList::MultiGet in tests

8999c3c

Fix ASSERT_STATUS_CHECKED failure: wrap all list.MultiGet() calls in inlineskiplist_test.cc with ASSERT_OK() to check the returned Status, which is now non-void after the function deduplication.

anand1976 force-pushed the memtable_mget branch from e9904e4 to 6d64f87 Compare March 16, 2026 00:16

meta-codesync Bot closed this in c66c142 Mar 16, 2026

facebook-github-tools Bot added the Merged label Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memtable MultiGet finger search optimization#14428

Add memtable MultiGet finger search optimization#14428
anand1976 wants to merge 7 commits into
facebook:mainfrom
anand1976:memtable_mget

anand1976 commented Mar 5, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Mar 10, 2026

Uh oh!

xingbowang left a comment

Uh oh!

xingbowang commented Mar 11, 2026

Uh oh!

xingbowang left a comment

Uh oh!

xingbowang Mar 12, 2026

Uh oh!

pdillinger commented Mar 13, 2026

Uh oh!

pdillinger left a comment

Uh oh!

meta-codesync Bot commented Mar 16, 2026

Uh oh!

anand1976 commented Mar 16, 2026

Uh oh!

meta-codesync Bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anand1976 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark Results

Single-threaded multireadrandom

Multithreaded multireadrandom (batch_size=64)

multireadwhilewriting (batch_size=64)

Small Batch Size Regression Check: memtable_batch_lookup_optimization

Multithreaded multireadrandom — small batch sizes

multireadwhilewriting — small batch sizes

Test Plan

Uh oh!

meta-codesync Bot commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

Uh oh!

meta-codesync Bot commented Mar 10, 2026

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

xingbowang commented Mar 11, 2026

Uh oh!

xingbowang left a comment

Choose a reason for hiding this comment

Uh oh!

xingbowang Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

pdillinger commented Mar 13, 2026

Uh oh!

pdillinger left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented Mar 16, 2026

Uh oh!

anand1976 commented Mar 16, 2026

Uh oh!

meta-codesync Bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anand1976 commented Mar 5, 2026 •

edited

Loading

Single-threaded `multireadrandom`

Multithreaded `multireadrandom` (batch_size=64)

`multireadwhilewriting` (batch_size=64)

github-actions Bot commented Mar 9, 2026 •

edited

Loading