Skip to content

Move the MultiScan seek key check to upper layer #14040

Closed
xingbowang wants to merge 16 commits intofacebook:mainfrom
xingbowang:2025_10_07_udi_rand_test_rebase
Closed

Move the MultiScan seek key check to upper layer #14040
xingbowang wants to merge 16 commits intofacebook:mainfrom
xingbowang:2025_10_07_udi_rand_test_rebase

Conversation

@xingbowang
Copy link
Contributor

Summary:

The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.

Test:

Unit test

Summary:

The current seek key validation is too strict, this change relaxes it to
allow better handling with query span across multiple files after
compaction.

Test:

Unit test
@meta-cla meta-cla bot added the CLA Signed label Oct 9, 2025
@xingbowang xingbowang marked this pull request as ready for review October 9, 2025 21:07
@meta-codesync
Copy link

meta-codesync bot commented Oct 9, 2025

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D84292297.

db/db_iter.cc Outdated
}

// Set the upper bound with the limit if it is not set yet.
if (iterate_upper_bound_ == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is non-null, we should validate that it matches the range.limit. If it doesn't, that's a violation of the contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this, but test failed. The issue is that upper bound is set when iterator is created, the pointer is passed from read_options down to merging iterator, etc. I could add additional set method to all the iterator, but it becomes more complicated. For now, let's just tighten the API constraint to always require upper bound set when limit is set.

// Case 1: seek before the first prepared ranges, return out of bound
// Case 2: seek at the beginning of a prepared range (expected case)
// Case 3: seek within a prepared range (unexpected, but supported)
// Case 4: seek between 2 of the prepared ranges, return out of bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can assert in a debug build. This should technically not happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At block iteration level, this could still happen, right. The reason is due to partial range delete with multiple levels.
E.g. with 3 levels
L1 : 0-10
L2: Delete range 0-5
L3: 0-10
If a range 2-8 was prepared, on L3, the prepared key would be 2, but the seek key would be 5, as the seek key was updated by the largest key of delete range. I believe all of the cases could be possible, when we adjust the ranges in the above examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.

// The iterator from previous seek may have moved forward a few blocks,
// In that case, have block_idx catch up the cur_data_block_idx
// Note no need to handle block unpin, as it has been handled during iterating
block_idx = std::max(block_idx, multi_scan_->cur_data_block_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't block_idx always be >= multi_scan_->cur_data_block_idx (since iterator should never seek backwards from current key)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added before we tighten the contract. I will try to remove it.

block_iter_.Seek(*target);
FindKeyForward();
}
multi_scan_->cur_data_block_idx = block_idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduped

// next_scan_idx :|
// V
// table: ____[prepared range 1]_____[prepared range 2]_____
// seek : 3 4 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear how case 5 is handled. When MultiScanSeekTargetFromBlock is called, block_idx == data_block_seperators.size(). And then MultiScanLoadDataBlock will be called if I'm reading this correctly. Should we add a check somewhere that we've gone past the last block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll download the diff locally and do a couple of db_stress runs.

PERF_CPU_TIMER_GUARD(iter_seek_cpu_nanos, clock_);
StopWatch sw(clock_, statistics_, DB_SEEK);

if (scan_opts_.has_value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also validate the MultiScanArgs in Prepare to ensure the ranges are sorted and non-overlapping? Its currently done in block based iterator, but we could move it here and immediately fail on the next seek rather than sometime in the middle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

} else {
file_iter_.Prepare(scan_opts_);
// If file is not prepared with ranges, skip it
delete old_iter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strictly necessary? I'm worried that if FileMetaData is not initialized fully for some reason, we might end up in this situation and we'd want to Prepare with scan_opts_ in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not strictly required. Mostly an optimization. Meantime, tighten the contract. If you think there is more risk, I can revert it back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I later on run into another issue. The problem is with delete range only file. The delete range only file is skipped for file to scan ranges mapping during level iterator prepare. However, when seek happens, it triggers file_iter_.SeekToFirst(); in LevelIterator::SkipEmptyFileForward. This happened with partial delete range file was added.

// Case 1: seek before the first prepared ranges, return out of bound
// Case 2: seek at the beginning of a prepared range (expected case)
// Case 3: seek within a prepared range (unexpected, but supported)
// Case 4: seek between 2 of the prepared ranges, return out of bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.

void BlockBasedTableIterator::SeekMultiScan(const Slice* seek_target) {
if (SeekMultiScanImpl(seek_target)) {
is_out_of_bound_ = true;
ResetDataIter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if calling ResetDataIter() is the right thing to do. The same block might be needed for the next scan, even though this one might be empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this. But I don't think in any of the cases, the block would be needed for the next scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could happen with UDI, if the index doesn't find the prefix and returns out of bound result, but the next scan falls in the same block. Deferring the call to ResetDataIter() shouldn't cause any problems I think. If the block is not required for the next scan, it'll be called anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I will remove iterator reset.

@anand1976 anand1976 requested a review from cbi42 October 11, 2025 00:53
} else {
if (multi_scan_->next_scan_idx >=
multi_scan_->block_index_ranges_per_scan.size()) {
// Seeking a range that is out side of prepared ranges.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can conclude this. What if there's a seek to start of the last prepared range, and then immediately reseek to the start of the same range for whatever reason (too many internal keys skipped in higher levels, for example)? The reseek will encounter next_scan_idx == block_index_ranges_per_scan.size() and incorrectly return out of bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Good call. A reseek on the same key caused by too many deletion. I will think about how to handle that correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// linear search the block that contains the seek target, and unpin blocks
// that are before it.
auto const& data_block_separators = multi_scan_->data_block_separators;
while (block_idx < data_block_separators.size() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be safer to flip the order of the two while loops, i.e first find the right prepared range, then find the block within that prepared range, and then unpin everything between cur_data_block_idx and block_idx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on the safety concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will try to merge them into a single loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the current logic of storing blocks across different range scan to be very tricky to maintain. I am wondering whether we could convert the 1 dimensional vector into 2 dimensions, so that each scan range has its own corresponding vector of blocks pinned for it. It would be much easier to implement this logic and improve the readability. There is a challenge of handling overlap blocks between adjacent ranges. I am wondering whether we could just add duplicate the block cache pinnable object. It is more expensive to maintain? From memory allocation perspective, smaller vector might be easier to be allocated given we are using jemalloc by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we could just add duplicate the block cache pinnable object.

Good idea, I think we can do this. The cost to maintain them should be small compared to the actual block reading etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to to switch to 2D array. The challenge is IO coalescing. I need to spend more time to think about this. Meantime, I also tried to add some assertion to make sure they are aligned. However, I found that they are not always guaranteed to be aligned and it is ok. The reason is due to delete range mixing with data in the same file. When a prepared range fall into the delete range, it would get 0 block prepared. Meantime, the next_scan_idx is decided by the start key of the prepared range and the seek target. While the block_idx is decided by actual prepared blocks, so they would not be aligned.

Instead, I rewrite this section. The scan range has higher granularity, so using it to find the right range is more efficient. Once it locate the right scan range, it then scan the blocks one at a time until it find the right block using the separator. This should align them well.

@xingbowang xingbowang force-pushed the 2025_10_07_udi_rand_test_rebase branch from d0dd675 to 2bc03ce Compare October 13, 2025 16:59
@meta-codesync meta-codesync bot closed this in 1585f22 Oct 13, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 13, 2025

@xingbowang merged this pull request in 1585f22.

// Case 2: seek at the beginning of a prepared range (expected case)
// Case 3: seek within a prepared range (unexpected, but supported)
// Case 4: seek between 2 of the prepared ranges, return out of bound
// Case 5: seek after all of the prepared ranges, should move on to next file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the seek key is outside of this file's range, I think LevelIterator will just init another file iterator, potentially call Prepare() on it, and seek into that file. We may avoid the seek based on the iterator bound check:

rocksdb/db/version_set.cc

Lines 1379 to 1382 in 04c085a

const FdWithKeyRange& cur_file = flevel_->files[file_index_];
if (KeyReachedUpperBound(cur_file.smallest_key)) {
return;
}
, but we will waste the prepare() work, which is done when initializing the file iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prepared range is different from file range. E.g. a file could contain keys from 0 to 9, but prepared range only span between 2-4 and 6-8. If key 9 is seeked, it would call this code path and trigger case 5.

index_iter_->user_key(),
/*a_has_ts*/ true, *scan_opt.range.limit,
/*b_has_ts=*/false) <= 0)) {
user_comparator_.CompareWithoutTimestamp(index_iter_->user_key(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


auto compare_next_scan_start_result =
user_comparator_.CompareWithoutTimestamp(
ExtractUserKey(*seek_target), /*a_has_ts=*/true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use multi_scan_->prev_seek_key_ instead of ExtractUserKey(*seek_target)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. done.

if (compare_next_scan_start_result < 0) {
// Needs to handle Cases: 1, 3, 4
//
// next_scan_idx : |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for the cases is a bit confusing with next_scan_idx pointing to different ranges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't fully capture the confusion part.

return false;
}

constexpr auto out_of_bound = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kOutOfBound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

->block_index_ranges_per_scan[multi_scan_->next_scan_idx]));
}
} else {
if (multi_scan_->next_scan_idx >=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen given

multi_scan_->next_scan_idx =
      std::min(multi_scan_->next_scan_idx,
               multi_scan_->block_index_ranges_per_scan.size() - 1);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. No long possible. Cleaned up.

is_out_of_bound_ = true;
assert(!Valid());
return;
return out_of_bound;
Copy link
Contributor

@cbi42 cbi42 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the last scan within a multiscan, I think we can't set out-of-bound, so that LevelIterator can continue to read the next file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @anand1976 if you are working on the change that supports scans with zero data blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right. I agree that the is_out_of_bound_ flag is not correctly handled. It works in certain shape of LSM tree. We need fix this.

// TODO: Fix the max_prefetch_size support for multiple files.
// The goal is to limit the memory usage, prefetch could be done
// incrementally.
if (multi_scan_->scan_opts->max_prefetch_size == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this case happen? Given that we init prefetch_max_idx to be the pinned_data_blocks.size() and idx should be less than that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could happen due to UDI, when property bag specify the number of data item is small, but the limit of scan range is high. On the last scan range of the file, if caller continue call "Next()" call and does not stop on reaching specified the number of data item, we could run into this situation.

// linear search the block that contains the seek target, and unpin blocks
// that are before it.
auto const& data_block_separators = multi_scan_->data_block_separators;
while (block_idx < data_block_separators.size() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.

return;
}

// // The iterator from previous seek may have moved forward a few blocks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

xingbowang pushed a commit to xingbowang/rocksdb that referenced this pull request Oct 13, 2025
Summary:
The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.

Pull Request resolved: facebook#14040

Test Plan: Unit test

Reviewed By: anand1976

Differential Revision: D84292297

Pulled By: xingbowang

fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869
ExtractUserKey(*seek_target), /*a_has_ts=*/true,
data_block_separators[block_idx],
/*b_has_ts=*/false) > 0)) {
if (!multi_scan_->pinned_data_blocks[block_idx].IsEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought unpinning may not be desired for the following case, but now I think it's actually correct:

L1 : 0-10
L2: Delete range 0-5
L3: 0-10

MultiScan on ranges 1-2, 3-4, and 5-6.
When user first do Seek(1), we will internally do Seek(5) (due to DeleteRange) and unpins all blocks until 5. Then the next scan's blocks from 3-4 are unpinned. I thought we should not unpin them since we will need to read 3-4 as the next scan. But Seek(5) implies that these keys are all covered by some range deletion, so the next Seek(3) will also do Seek(5) internally, so the blocks from 3-4 could be safely unpinned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I had similar discussion with Anand. Looks like this is a bit tricky. I will add some comment to explain this.

meta-codesync bot pushed a commit that referenced this pull request Oct 14, 2025
Summary:
Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after #14040 .

Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range.

Pull Request resolved: #14046

Test Plan:
1. Add unit test in db_iterator_test
2. Run crash test

Reviewed By: xingbowang

Differential Revision: D84623871

Pulled By: anand1976

fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
xingbowang pushed a commit to xingbowang/rocksdb that referenced this pull request Oct 14, 2025
…book#14046)

Summary:
Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after facebook#14040 .

Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range.

Pull Request resolved: facebook#14046

Test Plan:
1. Add unit test in db_iterator_test
2. Run crash test

Reviewed By: xingbowang

Differential Revision: D84623871

Pulled By: anand1976

fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
xingbowang pushed a commit to xingbowang/rocksdb that referenced this pull request Oct 16, 2025
Summary:

Address feedback from facebook#14040
Add additional test for MultiScan
Fix a bug when del range and data are in same file for multi-scan
Rewrite the cases need to be handled in SeekMultiScan

Test Plan:

Unit test

Unit test

Reviewers:

Subscribers:

Tasks:

Tags:
meta-codesync bot pushed a commit that referenced this pull request Oct 24, 2025
Summary:
* Address feedback from #14040
* Add additional test for MultiScan
* Fix a bug when del range and data are in same file for multi-scan
* Rewrite the cases need to be handled in SeekMultiScan

Pull Request resolved: #14055

Test Plan: Unit test

Reviewed By: cbi42, anand1976

Differential Revision: D84851788

Pulled By: xingbowang

fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
anand1976 pushed a commit that referenced this pull request Oct 27, 2025
Summary:
* Address feedback from #14040
* Add additional test for MultiScan
* Fix a bug when del range and data are in same file for multi-scan
* Rewrite the cases need to be handled in SeekMultiScan

Pull Request resolved: #14055

Test Plan: Unit test

Reviewed By: cbi42, anand1976

Differential Revision: D84851788

Pulled By: xingbowang

fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants