Skip to content

Add memtable MultiGet finger search optimization#14428

Closed
anand1976 wants to merge 7 commits into
facebook:mainfrom
anand1976:memtable_mget
Closed

Add memtable MultiGet finger search optimization#14428
anand1976 wants to merge 7 commits into
facebook:mainfrom
anand1976:memtable_mget

Conversation

@anand1976
Copy link
Copy Markdown
Contributor

@anand1976 anand1976 commented Mar 5, 2026

Add memtable batch lookup optimization with finger search

Summary

Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time.

Controlled by the new memtable_batch_lookup_optimization column family option (default: false).

Changes

  • Add FindGreaterOrEqualWithFinger() and MultiGet() to InlineSkipList with optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters)
  • Add virtual MultiGet() to MemTableRep, override in SkipListRep
  • Add memtable_batch_lookup_optimization CF option
  • Integrate finger search into MemTable::MultiGet -- sorts keys, performs batched finger search, then unscrambles results
  • Add db_bench, db_stress, and crash test support
  • Add unit tests for InlineSkipList::MultiGet (5 tests) and integration tests at the DB level (25 BatchLookup tests), including paranoid validation tests

Benchmark Results

Setup: 2M keys in memtable, release build (DEBUG_LEVEL=0).

Single-threaded multireadrandom

Batch Size Baseline BatchOpt vs Base
2 363,593 376,562 +3.6%
8 385,204 386,082 +0.2%
32 360,339 375,105 +4.1%
64 352,696 378,497 +7.3%

Multithreaded multireadrandom (batch_size=64)

Threads Baseline BatchOpt vs Base
4 171,356 185,633 +8.3%
8 161,478 171,392 +6.1%

multireadwhilewriting (batch_size=64)

Threads Baseline BatchOpt vs Base
1 163,938 175,068 +6.8%
4 109,698 120,790 +10.1%
8 116,623 123,658 +6.0%

No regression at any batch size or thread count. The finger is stack-allocated per MultiGet call, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-load Next() pointers.

Small Batch Size Regression Check: memtable_batch_lookup_optimization

All values are ops/sec. Measured with db_bench (release build).
2M keys in memtable, value_size=100, duration=30s per run.

Multithreaded multireadrandom — small batch sizes

Threads | Batch Size | Baseline | BatchOpt | Change
4 | 2 | 532,302 | 540,682 | +1.6%
8 | 2 | 1,044,046 | 1,046,920 | +0.3%
4 | 8 | 633,733 | 630,039 | -0.6%
8 | 8 | 1,260,818 | 1,241,792 | -1.5%

multireadwhilewriting — small batch sizes

Threads | Batch Size | Baseline | BatchOpt | Change
1 | 2 | 107,284 | 105,217 | -1.9%
4 | 2 | 428,508 | 423,747 | -1.1%
8 | 2 | 854,654 | 864,923 | +1.2%
1 | 8 | 114,788 | 118,496 | +3.2%
4 | 8 | 467,995 | 473,437 | +1.2%
8 | 8 | 935,636 | 960,791 | +2.7%

No regression at small batch sizes. Variations at batch_size=2 are
within noise (~1-2%). At batch_size=8, modest positive trend (+1-3%
in read-while-writing).

Test Plan

  • make check -- all tests pass
  • ASSERT_STATUS_CHECKED=1 make check -- all tests pass
  • 2-hour db_crashtest.py blackbox with --memtable_batch_lookup_optimization=1 -- passed
  • 2-hour db_crashtest.py blackbox with --memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1 -- passed

@anand1976 anand1976 added the WIP Work in progress label Mar 5, 2026
@meta-cla meta-cla Bot added the CLA Signed label Mar 5, 2026
@facebook facebook deleted a comment from github-actions Bot Mar 9, 2026
@anand1976 anand1976 requested a review from pdillinger March 9, 2026 16:47
@anand1976 anand1976 removed the WIP Work in progress label Mar 9, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 9, 2026

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

@anand1976 anand1976 requested a review from xingbowang March 9, 2026 17:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 9, 2026

✅ clang-tidy: No findings on changed lines

Completed in 361.0s.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 10, 2026

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

Copy link
Copy Markdown
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see test coverage on the skiplist height growing handling. I guess it would involve with concurrent read/write. Could we some add test to validate read after write consistency with concurrent read/write operation with some randomly generated keys?

Some other Findings from AI:

  1. Finger not updated after callback loop walk-forward

After the callback loop walks forward through level-0 nodes, finger->next_[0] still points to the search result rather than the last visited node. On the next FindGreaterOrEqualWithFinger call, the walk-up starts from this stale position, ascending more levels than necessary. For the typical case (callback returns false after 1 entry), this is negligible. For merge operations consuming multiple entries per key, the next lookup degrades toward O(log N).

  1. key_validation_callback not invoked during callback loop

In InlineSkipList::MultiGet, the callback loop at level 0 (walking forward through matching entries) only checks key ordering but does not invoke key_validation_callback:

inlineskiplist.h
Lines 1330-1337
for (; node != nullptr && callback_func(callback_args[i], node->Key());) {
Node* prev_node = node;
node = node->Next(0);
if (detect_key_out_of_order && prev_node != head_ && node != nullptr &&
compare_(prev_node->Key(), node->Key()) >= 0) {
return Corruption(prev_node, node, allow_data_in_errors);
}
}
During the search phase, FindSpliceForLevelValidated calls key_validation_callback on every traversed node. But nodes visited by the callback loop escape checksum/corruption validation. If paranoid_memory_checks is enabled, this is a gap compared to the per-key path (which validates every node via GetAndValidate → SeekAndValidate + NextAndValidate).

@xingbowang
Copy link
Copy Markdown
Contributor

I would also like to see benchmark with smaller batch size to make sure there is no regression on smaller batch size.

Copy link
Copy Markdown
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement is mostly from reducing cache miss by reusing the same key from last search. With high batch, I felt we could do parallel lookup with multiple prefetch stages to further hide the latency. I claude coded it for insert and seen up to 4X throughput improvement. I guess it would work well for read as well. We can do that later. Feel free to explore it, if you have passion pursuing that direction.
main...xingbowang:rocksdb:experiment_skiplist_perf_up

}
}

TEST_F(InlineSkipTest, ConcurrentMultiGet) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test runs one writer on the main thread and four readers on background threads, all operating on the same skip list simultaneously. This is good improvement. However, this is slightly off my goal. My goal is to validate read after write consistency. I would suggest to

  1. generate a long sequence of unique keys for insertion.
  2. split the sequence into different chunks.
  3. spin up different threads to insert different chunks of keys into the skiplist, and then immediately query a batch of random key including the key just inserted. Validate the key just inserted is visible.
    Since each key is unique, we don't need worry about key override, the key read from which it just inserted should be easy to validate its value to validate read after write consistency.
    To push this further, maybe there is some way to also have it read recent keys inserted by other threads through some other synchronization mechanism, although it is a bit more complicate. Maybe AI could help.

@pdillinger
Copy link
Copy Markdown
Contributor

I felt we could do parallel lookup with multiple prefetch stages to further hide the latency.

My inclination would be to binary search + partition the (sub)batch at each height of the skip list into sub-batches to send further down the skip list. This approach would "visit" each skip list node only once (per batch). Nodes pushed on the BFS queue to visit could be prefetched to be hot in cache. (A DFS stack might not hide latency as well.)

Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Xingbo for first round reviews. Overall looks good to me. I trust the crash test to find bugs here if there are any.

Implement finger search on skip list for memtable MultiGet. After each
key lookup, the search path (splice) is cached and reused for the next
key, reducing per-key cost from O(log N) to O(log d) where d is the
distance between consecutive keys.

Changes:
- Add FindGreaterOrEqualWithFinger() and MultiGet() to InlineSkipList
- Add virtual MultiGet() to MemTableRep, override in SkipListRep
- Add memtable_multi_get_finger_search CF option (default: false)
- Integrate finger search into MemTable::MultiGet when enabled
- Add db_bench, db_stress, and crash test support
- Add unit tests (InlineSkipList level) and integration tests (DB level)

Benchmark results (2M keys in memtable, batch_size=64):
- Baseline: ~265,700 ops/sec
- Finger search: ~283,800 ops/sec
- Improvement: ~7%
Optimize MultiGet by using a finger search on the memtable skip list.
After finding key[i], the search path (Splice) is retained as a
"finger" for key[i+1]. The next search walks up the finger until the
forward pointer overshoots, then descends — costing O(log d) where
d is the distance between consecutive sorted keys, rather than
O(log n) from the head each time.

Key changes:
- Add InlineSkipList::FindGreaterOrEqualWithFinger() and MultiGet()
  with optional paranoid validation parameters (default off)
- Add MemTableRep::MultiGet() virtual with validation support
- Add SkipListRep::MultiGet() override delegating to skip list
- Add MemTable::MultiGet() batched lookup phase that sorts keys,
  performs finger search, then unscrambles results
- Rename option from memtable_multi_get_finger_search to
  memtable_batch_lookup_optimization
- Add db_bench, db_stress, and crash test support
- Add comprehensive unit tests including paranoid validation tests

The optimization provides +3.6% to +10.1% throughput improvement
on MultiGet benchmarks across various configurations.
Fix ASSERT_STATUS_CHECKED failure: wrap all list.MultiGet() calls
in inlineskiplist_test.cc with ASSERT_OK() to check the returned
Status, which is now non-void after the function deduplication.
- Value-initialize std::array declarations in memtable.cc to fix
  cppcoreguidelines-pro-type-member-init warnings
- Add reserve() before push_back loop in inlineskiplist_test.cc to
  fix performance-inefficient-vector-operation warning
- Add braces around single-statement for loop in db_basic_test.cc
  to fix readability-braces-around-statements warning
Two issues in InlineSkipList::MultiGet's callback walk-forward loop:

1. After the callback loop walks forward through level-0 nodes (merge
   operations consuming multiple entries), finger.next_[0] still
   pointed to the original search result. The next key's search would
   start from this stale position, ascending more levels than needed.
   Fix: update finger.prev_[0] and finger.next_[0] after each step.

2. The callback loop checked detect_key_out_of_order but did not invoke
   key_validation_callback on nodes traversed via Next(0). This meant
   nodes visited during merge walk-forward escaped checksum validation
   when paranoid_memory_checks was enabled. Fix: call the validation
   callback on each new node before it is passed to SaveValue.
Add ConcurrentMultiGet test that exercises skip list height growth
handling in FindGreaterOrEqualWithFinger. Multiple reader threads
perform concurrent MultiGet lookups while a writer thread inserts
keys via InsertConcurrently, causing the skip list height to grow
during ongoing finger searches. The test validates that each
MultiGet result is >= its query key (read-after-write consistency).

Also fix two issues in the MultiGet callback loop:
- Update finger.prev_[0]/next_[0] after each walk-forward step so
  the next key's search starts from the current position rather than
  the stale original search result (performance fix for merge ops)
- Invoke key_validation_callback on each node reached via Next(0)
  in the callback loop, closing a validation gap when paranoid
  memory checks are enabled
Rewrite the ConcurrentMultiGet test to validate read-after-write
consistency rather than just basic concurrent safety:

- Generate 20,000 unique keys, shuffle, and split into 4 thread chunks
- Each thread inserts its keys one at a time via InsertConcurrently
- After each insert, immediately does a MultiGet batch that includes
  the just-inserted key plus random keys from a shared ring buffer
- Validates the just-inserted key is always visible (read-after-write)
- Shared ring buffer allows cross-thread visibility testing: each
  thread publishes inserted keys and samples recently inserted keys
  from other threads for its MultiGet queries
- Use StartThread instead of Schedule for proper WaitForJoin semantics

This exercises skip list height growth during active finger searches
while providing strong read-after-write consistency guarantees.
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 16, 2026

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D95813786.

@anand1976
Copy link
Copy Markdown
Contributor Author

Thanks Xingbo and Peter for the review.

My inclination would be to binary search + partition the (sub)batch at each height of the skip list into sub-batches to send further down the skip list. This approach would "visit" each skip list node only once (per batch). Nodes pushed on the BFS queue to visit could be prefetched to be hot in cache. (A DFS stack might not hide latency as well.)

That's a nice idea. To Xingbo's point, hiding cache latency would give a significant boost. I'll explore this when I have some time. The BFS approach could also be applied to batched index lookups in SSTs (though impact may be less due to smaller batches).

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 16, 2026

@anand1976 merged this pull request in c66c142.

xingbowang added a commit to xingbowang/rocksdb that referenced this pull request Mar 17, 2026
…facebook#14428)

Summary:
The memtable MultiGet finger search optimization (facebook#14428) added a
walk-forward loop that updates finger.prev_[0] to the last node processed
by the callback. When a MultiGet batch contains duplicate user keys,
the walk-forward for the first occurrence processes all entries for that
user key, leaving finger.prev_[0] at the last entry (lowest sequence
number). For the duplicate key, the lookup key's snapshot sequence number
is higher, causing it to sort BEFORE finger.prev_[0] in
InternalKeyComparator order, violating FindSpliceForLevel's assertion
"before == head_ || KeyIsAfterNode(key, before)".

Fixed by not updating finger.prev_[0] during the walk-forward. Only
finger.next_[0] is updated to track the walk-forward position. The
finger.prev_[0] set by FindGreaterOrEqualWithFinger is always a safe
starting point for subsequent lookups.

Test Plan:
- Added unit test MultiGetDuplicateKeys that reproduces the assertion
  failure with duplicate keys in the batch
- Verified test fails without the fix, passes with the fix
- All InlineSkipList tests pass (no regressions)
- All DB-level BatchLookup tests pass (no regressions)
doxtop pushed a commit to flyingw/rocksdb that referenced this pull request Apr 7, 2026
Summary:
Add memtable batch lookup optimization with finger search

Optimize memtable MultiGet by using a finger search on the skip list. After finding key[i], the search path (Splice) is retained as a "finger" for key[i+1]. The next search walks up the finger until the forward pointer overshoots, then descends -- costing O(log d) where d is the distance between consecutive sorted keys, rather than O(log N) from the head each time.

Controlled by the new `memtable_batch_lookup_optimization` column family option (default: false).

## Changes

- Add `FindGreaterOrEqualWithFinger()` and `MultiGet()` to `InlineSkipList` with optional paranoid validation support (key ordering checks and per-key checksum verification via default parameters)
- Add virtual `MultiGet()` to `MemTableRep`, override in `SkipListRep`
- Add `memtable_batch_lookup_optimization` CF option
- Integrate finger search into `MemTable::MultiGet` -- sorts keys, performs batched finger search, then unscrambles results
- Add `db_bench`, `db_stress`, and crash test support
- Add unit tests for `InlineSkipList::MultiGet` (5 tests) and integration tests at the DB level (25 `BatchLookup` tests), including paranoid validation tests

## Benchmark Results

Setup: 2M keys in memtable, release build (`DEBUG_LEVEL=0`).

### Single-threaded `multireadrandom`

| Batch Size | Baseline | BatchOpt | vs Base |
|:---:|---:|---:|---:|
| 2 | 363,593 | 376,562 | **+3.6%** |
| 8 | 385,204 | 386,082 | **+0.2%** |
| 32 | 360,339 | 375,105 | **+4.1%** |
| 64 | 352,696 | 378,497 | **+7.3%** |

### Multithreaded `multireadrandom` (batch_size=64)

| Threads | Baseline | BatchOpt | vs Base |
|:---:|---:|---:|---:|
| 4 | 171,356 | 185,633 | **+8.3%** |
| 8 | 161,478 | 171,392 | **+6.1%** |

### `multireadwhilewriting` (batch_size=64)

| Threads | Baseline | BatchOpt | vs Base |
|:---:|---:|---:|---:|
| 1 | 163,938 | 175,068 | **+6.8%** |
| 4 | 109,698 | 120,790 | **+10.1%** |
| 8 | 116,623 | 123,658 | **+6.0%** |

No regression at any batch size or thread count. The finger is stack-allocated per `MultiGet` call, so there is no shared state or cache-line contention between threads. Concurrent writes don't invalidate the finger's bracket since it only reads via acquire-load `Next()` pointers.

Small Batch Size Regression Check: memtable_batch_lookup_optimization
=====================================================================

All values are ops/sec. Measured with db_bench (release build).
2M keys in memtable, value_size=100, duration=30s per run.

Multithreaded multireadrandom — small batch sizes
--------------------------------------------------
Threads | Batch Size | Baseline   | BatchOpt   | Change
   4    |     2      |   532,302  |   540,682  |  +1.6%
   8    |     2      | 1,044,046  | 1,046,920  |  +0.3%
   4    |     8      |   633,733  |   630,039  |  -0.6%
   8    |     8      | 1,260,818  | 1,241,792  |  -1.5%

multireadwhilewriting — small batch sizes
-----------------------------------------
Threads | Batch Size | Baseline   | BatchOpt   | Change
   1    |     2      |   107,284  |   105,217  |  -1.9%
   4    |     2      |   428,508  |   423,747  |  -1.1%
   8    |     2      |   854,654  |   864,923  |  +1.2%
   1    |     8      |   114,788  |   118,496  |  +3.2%
   4    |     8      |   467,995  |   473,437  |  +1.2%
   8    |     8      |   935,636  |   960,791  |  +2.7%

No regression at small batch sizes. Variations at batch_size=2 are
within noise (~1-2%). At batch_size=8, modest positive trend (+1-3%
in read-while-writing).

Pull Request resolved: facebook#14428

Test Plan:
- `make check` -- all tests pass
- `ASSERT_STATUS_CHECKED=1 make check` -- all tests pass
- 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1` -- passed
- 2-hour `db_crashtest.py blackbox` with `--memtable_batch_lookup_optimization=1 --paranoid_memory_checks=1` -- passed

Reviewed By: pdillinger

Differential Revision: D95813786

Pulled By: anand1976

fbshipit-source-id: b182a023e1026021f1b9682d1cc024c7729326c7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants