Move the MultiScan seek key check to upper layer by xingbowang · Pull Request #14040 · facebook/rocksdb

xingbowang · 2025-10-09T21:07:08Z

Summary:

The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.

Test:

Unit test

Summary: The current seek key validation is too strict, this change relaxes it to allow better handling with query span across multiple files after compaction. Test: Unit test

meta-codesync · 2025-10-09T21:07:57Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D84292297.

anand1976 · 2025-10-09T22:07:02Z

db/db_iter.cc

+    }
+
+    // Set the upper bound with the limit if it is not set yet.
+    if (iterate_upper_bound_ == nullptr) {


If this is non-null, we should validate that it matches the range.limit. If it doesn't, that's a violation of the contract.

Will add it.

I tried to do this, but test failed. The issue is that upper bound is set when iterator is created, the pointer is passed from read_options down to merging iterator, etc. I could add additional set method to all the iterator, but it becomes more complicated. For now, let's just tighten the API constraint to always require upper bound set when limit is set.

anand1976 · 2025-10-09T22:19:08Z

table/block_based/block_based_table_iterator.cc

+  // Case 1: seek before the first prepared ranges, return out of bound
+  // Case 2: seek at the beginning of a prepared range (expected case)
+  // Case 3: seek within a prepared range (unexpected, but supported)
+  // Case 4: seek between 2 of the prepared ranges, return out of bound


Maybe we can assert in a debug build. This should technically not happen.

At block iteration level, this could still happen, right. The reason is due to partial range delete with multiple levels.
E.g. with 3 levels
L1 : 0-10
L2: Delete range 0-5
L3: 0-10
If a range 2-8 was prepared, on L3, the prepared key would be 2, but the seek key would be 5, as the seek key was updated by the largest key of delete range. I believe all of the cases could be possible, when we adjust the ranges in the above examples.

Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.

anand1976 · 2025-10-10T00:47:40Z

table/block_based/block_based_table_iterator.cc

+  // The iterator from previous seek may have moved forward a few blocks,
+  // In that case, have block_idx catch up the cur_data_block_idx
+  // Note no need to handle block unpin, as it has been handled during iterating
+  block_idx = std::max(block_idx, multi_scan_->cur_data_block_idx);


Shouldn't block_idx always be >= multi_scan_->cur_data_block_idx (since iterator should never seek backwards from current key)?

This was added before we tighten the contract. I will try to remove it.

anand1976 · 2025-10-10T00:55:21Z

table/block_based/block_based_table_iterator.cc

-    block_iter_.Seek(*target);
-    FindKeyForward();
  }
+  multi_scan_->cur_data_block_idx = block_idx;


This is redundant right?

anand1976 · 2025-10-10T05:00:18Z

table/block_based/block_based_table_iterator.cc

+      // next_scan_idx :|
+      //                V
+      // table:     ____[prepared range 1]_____[prepared range 2]_____
+      // seek :                 3           4                      5


Its not clear how case 5 is handled. When MultiScanSeekTargetFromBlock is called, block_idx == data_block_seperators.size(). And then MultiScanLoadDataBlock will be called if I'm reading this correctly. Should we add a check somewhere that we've gone past the last block?

anand1976

LGTM. I'll download the diff locally and do a couple of db_stress runs.

anand1976 · 2025-10-10T20:57:24Z

db/db_iter.cc

  PERF_CPU_TIMER_GUARD(iter_seek_cpu_nanos, clock_);
  StopWatch sw(clock_, statistics_, DB_SEEK);

+  if (scan_opts_.has_value()) {


Should we also validate the MultiScanArgs in Prepare to ensure the ranges are sorted and non-overlapping? Its currently done in block based iterator, but we could move it here and immediately fail on the next seek rather than sometime in the middle

anand1976 · 2025-10-10T21:04:52Z

db/version_set.cc

    } else {
-      file_iter_.Prepare(scan_opts_);
+      // If file is not prepared with ranges, skip it
+      delete old_iter;


Is this strictly necessary? I'm worried that if FileMetaData is not initialized fully for some reason, we might end up in this situation and we'd want to Prepare with scan_opts_ in that case.

It is not strictly required. Mostly an optimization. Meantime, tighten the contract. If you think there is more risk, I can revert it back

I later on run into another issue. The problem is with delete range only file. The delete range only file is skipped for file to scan ranges mapping during level iterator prepare. However, when seek happens, it triggers file_iter_.SeekToFirst(); in LevelIterator::SkipEmptyFileForward. This happened with partial delete range file was added.

anand1976 · 2025-10-10T21:08:42Z

table/block_based/block_based_table_iterator.cc

+  // Case 1: seek before the first prepared ranges, return out of bound
+  // Case 2: seek at the beginning of a prepared range (expected case)
+  // Case 3: seek within a prepared range (unexpected, but supported)
+  // Case 4: seek between 2 of the prepared ranges, return out of bound


Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.

anand1976 · 2025-10-11T00:52:10Z

table/block_based/block_based_table_iterator.cc

+void BlockBasedTableIterator::SeekMultiScan(const Slice* seek_target) {
+  if (SeekMultiScanImpl(seek_target)) {
+    is_out_of_bound_ = true;
+    ResetDataIter();


I'm not sure if calling ResetDataIter() is the right thing to do. The same block might be needed for the next scan, even though this one might be empty.

I have thought about this. But I don't think in any of the cases, the block would be needed for the next scan.

It could happen with UDI, if the index doesn't find the prefix and returns out of bound result, but the next scan falls in the same block. Deferring the call to ResetDataIter() shouldn't cause any problems I think. If the block is not required for the next scan, it'll be called anyway.

Ack. I will remove iterator reset.

anand1976 · 2025-10-11T01:17:36Z

table/block_based/block_based_table_iterator.cc

  } else {
+    if (multi_scan_->next_scan_idx >=
+        multi_scan_->block_index_ranges_per_scan.size()) {
+      // Seeking a range that is out side of prepared ranges.


I'm not sure if we can conclude this. What if there's a seek to start of the last prepared range, and then immediately reseek to the start of the same range for whatever reason (too many internal keys skipped in higher levels, for example)? The reseek will encounter next_scan_idx == block_index_ranges_per_scan.size() and incorrectly return out of bounds.

I see. Good call. A reseek on the same key caused by too many deletion. I will think about how to handle that correctly.

anand1976 · 2025-10-11T01:53:10Z

table/block_based/block_based_table_iterator.cc

+  // linear search the block that contains the seek target, and unpin blocks
+  // that are before it.
+  auto const& data_block_separators = multi_scan_->data_block_separators;
+  while (block_idx < data_block_separators.size() &&


I wonder if it'd be safer to flip the order of the two while loops, i.e first find the right prepared range, then find the block within that prepared range, and then unpin everything between cur_data_block_idx and block_idx.

Could you elaborate more on the safety concern?

I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.

OK. I will try to merge them into a single loop.

I found the current logic of storing blocks across different range scan to be very tricky to maintain. I am wondering whether we could convert the 1 dimensional vector into 2 dimensions, so that each scan range has its own corresponding vector of blocks pinned for it. It would be much easier to implement this logic and improve the readability. There is a challenge of handling overlap blocks between adjacent ranges. I am wondering whether we could just add duplicate the block cache pinnable object. It is more expensive to maintain? From memory allocation perspective, smaller vector might be easier to be allocated given we are using jemalloc by default.

I am wondering whether we could just add duplicate the block cache pinnable object.

Good idea, I think we can do this. The cost to maintain them should be small compared to the actual block reading etc.

I tried to to switch to 2D array. The challenge is IO coalescing. I need to spend more time to think about this. Meantime, I also tried to add some assertion to make sure they are aligned. However, I found that they are not always guaranteed to be aligned and it is ok. The reason is due to delete range mixing with data in the same file. When a prepared range fall into the delete range, it would get 0 block prepared. Meantime, the next_scan_idx is decided by the start key of the prepared range and the seek target. While the block_idx is decided by actual prepared blocks, so they would not be aligned.

Instead, I rewrite this section. The scan range has higher granularity, so using it to find the right range is more efficient. Once it locate the right scan range, it then scan the blocks one at a time until it find the right block using the separator. This should align them well.

…_test_rebase

meta-codesync · 2025-10-13T19:51:09Z

@xingbowang merged this pull request in 1585f22.

cbi42 · 2025-10-13T17:39:22Z

table/block_based/block_based_table_iterator.cc

+  // Case 2: seek at the beginning of a prepared range (expected case)
+  // Case 3: seek within a prepared range (unexpected, but supported)
+  // Case 4: seek between 2 of the prepared ranges, return out of bound
+  // Case 5: seek after all of the prepared ranges, should move on to next file


If the seek key is outside of this file's range, I think LevelIterator will just init another file iterator, potentially call Prepare() on it, and seek into that file. We may avoid the seek based on the iterator bound check:

rocksdb/db/version_set.cc

Lines 1379 to 1382 in 04c085a

const FdWithKeyRange& cur_file = flevel_->files[file_index_];

if (KeyReachedUpperBound(cur_file.smallest_key)) {

return;

}

, but we will waste the prepare() work, which is done when initializing the file iterator.

Prepared range is different from file range. E.g. a file could contain keys from 0 to 9, but prepared range only span between 2-4 and 6-8. If key 9 is seeked, it would call this code path and trigger case 5.

cbi42 · 2025-10-13T18:40:58Z

table/block_based/block_based_table_iterator.cc

-                index_iter_->user_key(),
-                /*a_has_ts*/ true, *scan_opt.range.limit,
-                /*b_has_ts=*/false) <= 0)) {
+            user_comparator_.CompareWithoutTimestamp(index_iter_->user_key(),


cbi42 · 2025-10-13T18:48:55Z

table/block_based/block_based_table_iterator.cc

+
+  auto compare_next_scan_start_result =
+      user_comparator_.CompareWithoutTimestamp(
+          ExtractUserKey(*seek_target), /*a_has_ts=*/true,


nit: can use multi_scan_->prev_seek_key_ instead of ExtractUserKey(*seek_target)

Good point. done.

cbi42 · 2025-10-13T19:04:47Z

table/block_based/block_based_table_iterator.cc

+    if (compare_next_scan_start_result < 0) {
+      // Needs to handle Cases: 1, 3, 4
+      //
+      // next_scan_idx :                    |


The comment for the cases is a bit confusing with next_scan_idx pointing to different ranges

Sorry, I didn't fully capture the confusion part.

cbi42 · 2025-10-13T19:06:04Z

table/block_based/block_based_table_iterator.cc

+    return false;
+  }
+
+  constexpr auto out_of_bound = true;


nit: kOutOfBound?

cbi42 · 2025-10-13T19:54:04Z

table/block_based/block_based_table_iterator.cc

+                  ->block_index_ranges_per_scan[multi_scan_->next_scan_idx]));
+    }
  } else {
+    if (multi_scan_->next_scan_idx >=


Can this happen given

multi_scan_->next_scan_idx = std::min(multi_scan_->next_scan_idx, multi_scan_->block_index_ranges_per_scan.size() - 1);

?

Good catch. No long possible. Cleaned up.

cbi42 · 2025-10-13T20:04:11Z

table/block_based/block_based_table_iterator.cc

-      is_out_of_bound_ = true;
-      assert(!Valid());
-      return;
+      return out_of_bound;


If this is the last scan within a multiscan, I think we can't set out-of-bound, so that LevelIterator can continue to read the next file.

cc @anand1976 if you are working on the change that supports scans with zero data blocks.

You are probably right. I agree that the is_out_of_bound_ flag is not correctly handled. It works in certain shape of LSM tree. We need fix this.

cbi42 · 2025-10-13T20:27:39Z

table/block_based/block_based_table_iterator.h

+      // TODO: Fix the max_prefetch_size support for multiple files.
+      // The goal is to limit the memory usage, prefetch could be done
+      // incrementally.
+      if (multi_scan_->scan_opts->max_prefetch_size == 0) {


Can this case happen? Given that we init prefetch_max_idx to be the pinned_data_blocks.size() and idx should be less than that.

This could happen due to UDI, when property bag specify the number of data item is small, but the limit of scan range is high. On the last scan range of the file, if caller continue call "Next()" call and does not stop on reaching specified the number of data item, we could run into this situation.

cbi42 · 2025-10-13T21:13:26Z

table/block_based/block_based_table_iterator.cc

+  // linear search the block that contains the seek target, and unpin blocks
+  // that are before it.
+  auto const& data_block_separators = multi_scan_->data_block_separators;
+  while (block_idx < data_block_separators.size() &&


I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.

cbi42 · 2025-10-13T21:18:22Z

table/block_based/block_based_table_iterator.cc

+    return;
+  }
+
+  // // The iterator from previous seek may have moved forward a few blocks,


outdated comment?

Summary: The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status. Pull Request resolved: facebook#14040 Test Plan: Unit test Reviewed By: anand1976 Differential Revision: D84292297 Pulled By: xingbowang fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869

cbi42 · 2025-10-14T00:12:53Z

table/block_based/block_based_table_iterator.cc

+              ExtractUserKey(*seek_target), /*a_has_ts=*/true,
+              data_block_separators[block_idx],
+              /*b_has_ts=*/false) > 0)) {
+    if (!multi_scan_->pinned_data_blocks[block_idx].IsEmpty()) {


I thought unpinning may not be desired for the following case, but now I think it's actually correct:

L1 : 0-10
L2: Delete range 0-5
L3: 0-10

MultiScan on ranges 1-2, 3-4, and 5-6.
When user first do Seek(1), we will internally do Seek(5) (due to DeleteRange) and unpins all blocks until 5. Then the next scan's blocks from 3-4 are unpinned. I thought we should not unpin them since we will need to read 3-4 as the next scan. But Seek(5) implies that these keys are all covered by some range deletion, so the next Seek(3) will also do Seek(5) internally, so the blocks from 3-4 could be safely unpinned.

Yes, you are right. I had similar discussion with Anand. Looks like this is a bit tricky. I will add some comment to explain this.

Summary: Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after #14040 . Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range. Pull Request resolved: #14046 Test Plan: 1. Add unit test in db_iterator_test 2. Run crash test Reviewed By: xingbowang Differential Revision: D84623871 Pulled By: anand1976 fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4

…book#14046) Summary: Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after facebook#14040 . Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range. Pull Request resolved: facebook#14046 Test Plan: 1. Add unit test in db_iterator_test 2. Run crash test Reviewed By: xingbowang Differential Revision: D84623871 Pulled By: anand1976 fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4

Summary: Address feedback from facebook#14040 Add additional test for MultiScan Fix a bug when del range and data are in same file for multi-scan Rewrite the cases need to be handled in SeekMultiScan Test Plan: Unit test Unit test Reviewers: Subscribers: Tasks: Tags:

Summary: * Address feedback from #14040 * Add additional test for MultiScan * Fix a bug when del range and data are in same file for multi-scan * Rewrite the cases need to be handled in SeekMultiScan Pull Request resolved: #14055 Test Plan: Unit test Reviewed By: cbi42, anand1976 Differential Revision: D84851788 Pulled By: xingbowang fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38

Relax the constraint on expected seek key in MultiScan

69e3f97

Summary: The current seek key validation is too strict, this change relaxes it to allow better handling with query span across multiple files after compaction. Test: Unit test

meta-cla bot added the CLA Signed label Oct 9, 2025

xingbowang marked this pull request as ready for review October 9, 2025 21:07

Add release note, fix build

7320592

anand1976 reviewed Oct 10, 2025

View reviewed changes

Xingbo Wang added 2 commits October 10, 2025 06:44

address feedback

114adf2

fix test and skip file if no range is prepared for it

78fe130

anand1976 approved these changes Oct 10, 2025

View reviewed changes

Xingbo Wang added 2 commits October 10, 2025 15:01

address feedback

10235a9

fix tests and move multiscan validation to be at prepare stage only

8149efd

anand1976 reviewed Oct 11, 2025

View reviewed changes

anand1976 requested a review from cbi42 October 11, 2025 00:53

anand1976 reviewed Oct 11, 2025

View reviewed changes

Xingbo Wang added 10 commits October 11, 2025 06:27

Fix old multi-scan test due to tighten multi-scan contract

3468921

fix tests

786ee2c

fix a boundary check bug in collect data block

d0faa52

more test and fix

d6fc5a2

fix build

36465b4

fix build

7e58be5

clean up

4875e0f

update delete range test with compaction

2c104c8

address linter

0026a8a

Merge remote-tracking branch 'upstream/main' into 2025_10_07_udi_rand…

2bc03ce

…_test_rebase

xingbowang force-pushed the 2025_10_07_udi_rand_test_rebase branch from d0dd675 to 2bc03ce Compare October 13, 2025 16:59

meta-codesync bot closed this in 1585f22 Oct 13, 2025

facebook-github-bot added the Merged label Oct 13, 2025

cbi42 reviewed Oct 13, 2025

View reviewed changes

cbi42 reviewed Oct 14, 2025

View reviewed changes

anand1976 mentioned this pull request Oct 14, 2025

Allow empty MultiScan result in BlockBasedTableIterator Prepare #14046

Closed

xingbowang mentioned this pull request Oct 16, 2025

Follow up on MultiScan change in #14040 #14055

Closed

	const FdWithKeyRange& cur_file = flevel_->files[file_index_];
	if (KeyReachedUpperBound(cur_file.smallest_key)) {
	return;
	}

Conversation

xingbowang commented Oct 9, 2025

Uh oh!

meta-codesync bot commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anand1976 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbi42 Oct 13, 2025 •

edited

Loading