Move the MultiScan seek key check to upper layer #14040
Move the MultiScan seek key check to upper layer #14040xingbowang wants to merge 16 commits intofacebook:mainfrom
Conversation
Summary: The current seek key validation is too strict, this change relaxes it to allow better handling with query span across multiple files after compaction. Test: Unit test
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D84292297. |
db/db_iter.cc
Outdated
| } | ||
|
|
||
| // Set the upper bound with the limit if it is not set yet. | ||
| if (iterate_upper_bound_ == nullptr) { |
There was a problem hiding this comment.
If this is non-null, we should validate that it matches the range.limit. If it doesn't, that's a violation of the contract.
There was a problem hiding this comment.
I tried to do this, but test failed. The issue is that upper bound is set when iterator is created, the pointer is passed from read_options down to merging iterator, etc. I could add additional set method to all the iterator, but it becomes more complicated. For now, let's just tighten the API constraint to always require upper bound set when limit is set.
| // Case 1: seek before the first prepared ranges, return out of bound | ||
| // Case 2: seek at the beginning of a prepared range (expected case) | ||
| // Case 3: seek within a prepared range (unexpected, but supported) | ||
| // Case 4: seek between 2 of the prepared ranges, return out of bound |
There was a problem hiding this comment.
Maybe we can assert in a debug build. This should technically not happen.
There was a problem hiding this comment.
At block iteration level, this could still happen, right. The reason is due to partial range delete with multiple levels.
E.g. with 3 levels
L1 : 0-10
L2: Delete range 0-5
L3: 0-10
If a range 2-8 was prepared, on L3, the prepared key would be 2, but the seek key would be 5, as the seek key was updated by the largest key of delete range. I believe all of the cases could be possible, when we adjust the ranges in the above examples.
There was a problem hiding this comment.
Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.
| // The iterator from previous seek may have moved forward a few blocks, | ||
| // In that case, have block_idx catch up the cur_data_block_idx | ||
| // Note no need to handle block unpin, as it has been handled during iterating | ||
| block_idx = std::max(block_idx, multi_scan_->cur_data_block_idx); |
There was a problem hiding this comment.
Shouldn't block_idx always be >= multi_scan_->cur_data_block_idx (since iterator should never seek backwards from current key)?
There was a problem hiding this comment.
This was added before we tighten the contract. I will try to remove it.
| block_iter_.Seek(*target); | ||
| FindKeyForward(); | ||
| } | ||
| multi_scan_->cur_data_block_idx = block_idx; |
There was a problem hiding this comment.
This is redundant right?
| // next_scan_idx :| | ||
| // V | ||
| // table: ____[prepared range 1]_____[prepared range 2]_____ | ||
| // seek : 3 4 5 |
There was a problem hiding this comment.
Its not clear how case 5 is handled. When MultiScanSeekTargetFromBlock is called, block_idx == data_block_seperators.size(). And then MultiScanLoadDataBlock will be called if I'm reading this correctly. Should we add a check somewhere that we've gone past the last block?
anand1976
left a comment
There was a problem hiding this comment.
LGTM. I'll download the diff locally and do a couple of db_stress runs.
| PERF_CPU_TIMER_GUARD(iter_seek_cpu_nanos, clock_); | ||
| StopWatch sw(clock_, statistics_, DB_SEEK); | ||
|
|
||
| if (scan_opts_.has_value()) { |
There was a problem hiding this comment.
Should we also validate the MultiScanArgs in Prepare to ensure the ranges are sorted and non-overlapping? Its currently done in block based iterator, but we could move it here and immediately fail on the next seek rather than sometime in the middle
db/version_set.cc
Outdated
| } else { | ||
| file_iter_.Prepare(scan_opts_); | ||
| // If file is not prepared with ranges, skip it | ||
| delete old_iter; |
There was a problem hiding this comment.
Is this strictly necessary? I'm worried that if FileMetaData is not initialized fully for some reason, we might end up in this situation and we'd want to Prepare with scan_opts_ in that case.
There was a problem hiding this comment.
It is not strictly required. Mostly an optimization. Meantime, tighten the contract. If you think there is more risk, I can revert it back
There was a problem hiding this comment.
I later on run into another issue. The problem is with delete range only file. The delete range only file is skipped for file to scan ranges mapping during level iterator prepare. However, when seek happens, it triggers file_iter_.SeekToFirst(); in LevelIterator::SkipEmptyFileForward. This happened with partial delete range file was added.
| // Case 1: seek before the first prepared ranges, return out of bound | ||
| // Case 2: seek at the beginning of a prepared range (expected case) | ||
| // Case 3: seek within a prepared range (unexpected, but supported) | ||
| // Case 4: seek between 2 of the prepared ranges, return out of bound |
There was a problem hiding this comment.
Thanks. Yeah, it could happen with delete range since we'd seek to the end which could fall in the middle of prepared ranges, and we may not do bounds check on the seek target in LevelIterator.
| void BlockBasedTableIterator::SeekMultiScan(const Slice* seek_target) { | ||
| if (SeekMultiScanImpl(seek_target)) { | ||
| is_out_of_bound_ = true; | ||
| ResetDataIter(); |
There was a problem hiding this comment.
I'm not sure if calling ResetDataIter() is the right thing to do. The same block might be needed for the next scan, even though this one might be empty.
There was a problem hiding this comment.
I have thought about this. But I don't think in any of the cases, the block would be needed for the next scan.
There was a problem hiding this comment.
It could happen with UDI, if the index doesn't find the prefix and returns out of bound result, but the next scan falls in the same block. Deferring the call to ResetDataIter() shouldn't cause any problems I think. If the block is not required for the next scan, it'll be called anyway.
There was a problem hiding this comment.
Ack. I will remove iterator reset.
| } else { | ||
| if (multi_scan_->next_scan_idx >= | ||
| multi_scan_->block_index_ranges_per_scan.size()) { | ||
| // Seeking a range that is out side of prepared ranges. |
There was a problem hiding this comment.
I'm not sure if we can conclude this. What if there's a seek to start of the last prepared range, and then immediately reseek to the start of the same range for whatever reason (too many internal keys skipped in higher levels, for example)? The reseek will encounter next_scan_idx == block_index_ranges_per_scan.size() and incorrectly return out of bounds.
There was a problem hiding this comment.
I see. Good call. A reseek on the same key caused by too many deletion. I will think about how to handle that correctly.
| // linear search the block that contains the seek target, and unpin blocks | ||
| // that are before it. | ||
| auto const& data_block_separators = multi_scan_->data_block_separators; | ||
| while (block_idx < data_block_separators.size() && |
There was a problem hiding this comment.
I wonder if it'd be safer to flip the order of the two while loops, i.e first find the right prepared range, then find the block within that prepared range, and then unpin everything between cur_data_block_idx and block_idx.
There was a problem hiding this comment.
Could you elaborate more on the safety concern?
There was a problem hiding this comment.
I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.
There was a problem hiding this comment.
OK. I will try to merge them into a single loop.
There was a problem hiding this comment.
I found the current logic of storing blocks across different range scan to be very tricky to maintain. I am wondering whether we could convert the 1 dimensional vector into 2 dimensions, so that each scan range has its own corresponding vector of blocks pinned for it. It would be much easier to implement this logic and improve the readability. There is a challenge of handling overlap blocks between adjacent ranges. I am wondering whether we could just add duplicate the block cache pinnable object. It is more expensive to maintain? From memory allocation perspective, smaller vector might be easier to be allocated given we are using jemalloc by default.
There was a problem hiding this comment.
I am wondering whether we could just add duplicate the block cache pinnable object.
Good idea, I think we can do this. The cost to maintain them should be small compared to the actual block reading etc.
There was a problem hiding this comment.
I tried to to switch to 2D array. The challenge is IO coalescing. I need to spend more time to think about this. Meantime, I also tried to add some assertion to make sure they are aligned. However, I found that they are not always guaranteed to be aligned and it is ok. The reason is due to delete range mixing with data in the same file. When a prepared range fall into the delete range, it would get 0 block prepared. Meantime, the next_scan_idx is decided by the start key of the prepared range and the seek target. While the block_idx is decided by actual prepared blocks, so they would not be aligned.
Instead, I rewrite this section. The scan range has higher granularity, so using it to find the right range is more efficient. Once it locate the right scan range, it then scan the blocks one at a time until it find the right block using the separator. This should align them well.
d0dd675 to
2bc03ce
Compare
|
@xingbowang merged this pull request in 1585f22. |
| // Case 2: seek at the beginning of a prepared range (expected case) | ||
| // Case 3: seek within a prepared range (unexpected, but supported) | ||
| // Case 4: seek between 2 of the prepared ranges, return out of bound | ||
| // Case 5: seek after all of the prepared ranges, should move on to next file |
There was a problem hiding this comment.
If the seek key is outside of this file's range, I think LevelIterator will just init another file iterator, potentially call Prepare() on it, and seek into that file. We may avoid the seek based on the iterator bound check:
Lines 1379 to 1382 in 04c085a
There was a problem hiding this comment.
Prepared range is different from file range. E.g. a file could contain keys from 0 to 9, but prepared range only span between 2-4 and 6-8. If key 9 is seeked, it would call this code path and trigger case 5.
| index_iter_->user_key(), | ||
| /*a_has_ts*/ true, *scan_opt.range.limit, | ||
| /*b_has_ts=*/false) <= 0)) { | ||
| user_comparator_.CompareWithoutTimestamp(index_iter_->user_key(), |
|
|
||
| auto compare_next_scan_start_result = | ||
| user_comparator_.CompareWithoutTimestamp( | ||
| ExtractUserKey(*seek_target), /*a_has_ts=*/true, |
There was a problem hiding this comment.
nit: can use multi_scan_->prev_seek_key_ instead of ExtractUserKey(*seek_target)
There was a problem hiding this comment.
Good point. done.
| if (compare_next_scan_start_result < 0) { | ||
| // Needs to handle Cases: 1, 3, 4 | ||
| // | ||
| // next_scan_idx : | |
There was a problem hiding this comment.
The comment for the cases is a bit confusing with next_scan_idx pointing to different ranges
There was a problem hiding this comment.
Sorry, I didn't fully capture the confusion part.
| return false; | ||
| } | ||
|
|
||
| constexpr auto out_of_bound = true; |
| ->block_index_ranges_per_scan[multi_scan_->next_scan_idx])); | ||
| } | ||
| } else { | ||
| if (multi_scan_->next_scan_idx >= |
There was a problem hiding this comment.
Can this happen given
multi_scan_->next_scan_idx =
std::min(multi_scan_->next_scan_idx,
multi_scan_->block_index_ranges_per_scan.size() - 1);
?
There was a problem hiding this comment.
Good catch. No long possible. Cleaned up.
| is_out_of_bound_ = true; | ||
| assert(!Valid()); | ||
| return; | ||
| return out_of_bound; |
There was a problem hiding this comment.
If this is the last scan within a multiscan, I think we can't set out-of-bound, so that LevelIterator can continue to read the next file.
There was a problem hiding this comment.
cc @anand1976 if you are working on the change that supports scans with zero data blocks.
There was a problem hiding this comment.
You are probably right. I agree that the is_out_of_bound_ flag is not correctly handled. It works in certain shape of LSM tree. We need fix this.
| // TODO: Fix the max_prefetch_size support for multiple files. | ||
| // The goal is to limit the memory usage, prefetch could be done | ||
| // incrementally. | ||
| if (multi_scan_->scan_opts->max_prefetch_size == 0) { |
There was a problem hiding this comment.
Can this case happen? Given that we init prefetch_max_idx to be the pinned_data_blocks.size() and idx should be less than that.
There was a problem hiding this comment.
This could happen due to UDI, when property bag specify the number of data item is small, but the limit of scan range is high. On the last scan range of the file, if caller continue call "Next()" call and does not stop on reaching specified the number of data item, we could run into this situation.
| // linear search the block that contains the seek target, and unpin blocks | ||
| // that are before it. | ||
| auto const& data_block_separators = multi_scan_->data_block_separators; | ||
| while (block_idx < data_block_separators.size() && |
There was a problem hiding this comment.
I think the concern here is that we are setting block index and scan index separately, so there's a risk them being out of sync.
| return; | ||
| } | ||
|
|
||
| // // The iterator from previous seek may have moved forward a few blocks, |
Summary: The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status. Pull Request resolved: facebook#14040 Test Plan: Unit test Reviewed By: anand1976 Differential Revision: D84292297 Pulled By: xingbowang fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869
| ExtractUserKey(*seek_target), /*a_has_ts=*/true, | ||
| data_block_separators[block_idx], | ||
| /*b_has_ts=*/false) > 0)) { | ||
| if (!multi_scan_->pinned_data_blocks[block_idx].IsEmpty()) { |
There was a problem hiding this comment.
I thought unpinning may not be desired for the following case, but now I think it's actually correct:
L1 : 0-10
L2: Delete range 0-5
L3: 0-10
MultiScan on ranges 1-2, 3-4, and 5-6.
When user first do Seek(1), we will internally do Seek(5) (due to DeleteRange) and unpins all blocks until 5. Then the next scan's blocks from 3-4 are unpinned. I thought we should not unpin them since we will need to read 3-4 as the next scan. But Seek(5) implies that these keys are all covered by some range deletion, so the next Seek(3) will also do Seek(5) internally, so the blocks from 3-4 could be safely unpinned.
There was a problem hiding this comment.
Yes, you are right. I had similar discussion with Anand. Looks like this is a bit tricky. I will add some comment to explain this.
Summary: Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after #14040 . Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range. Pull Request resolved: #14046 Test Plan: 1. Add unit test in db_iterator_test 2. Run crash test Reviewed By: xingbowang Differential Revision: D84623871 Pulled By: anand1976 fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
…book#14046) Summary: Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after facebook#14040 . Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range. Pull Request resolved: facebook#14046 Test Plan: 1. Add unit test in db_iterator_test 2. Run crash test Reviewed By: xingbowang Differential Revision: D84623871 Pulled By: anand1976 fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
Summary: Address feedback from facebook#14040 Add additional test for MultiScan Fix a bug when del range and data are in same file for multi-scan Rewrite the cases need to be handled in SeekMultiScan Test Plan: Unit test Unit test Reviewers: Subscribers: Tasks: Tags:
Summary: * Address feedback from #14040 * Add additional test for MultiScan * Fix a bug when del range and data are in same file for multi-scan * Rewrite the cases need to be handled in SeekMultiScan Pull Request resolved: #14055 Test Plan: Unit test Reviewed By: cbi42, anand1976 Differential Revision: D84851788 Pulled By: xingbowang fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
Summary: * Address feedback from #14040 * Add additional test for MultiScan * Fix a bug when del range and data are in same file for multi-scan * Rewrite the cases need to be handled in SeekMultiScan Pull Request resolved: #14055 Test Plan: Unit test Reviewed By: cbi42, anand1976 Differential Revision: D84851788 Pulled By: xingbowang fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
Summary:
The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.
Test:
Unit test