-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor IndexBuilder::AddIndexEntry #12867
Conversation
Summary: Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key` and the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys. This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches. Additional improvements in PartitionedIndexBuilder specifically: * Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry. * Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr. * Eliminate unnecessary NewFlushBlockPolicy call/object. Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
table/block_based/index_builder.cc
Outdated
first_key_in_next_block, block_handle); | ||
auto sep = sub_index_builder_->AddIndexEntry( | ||
last_key_in_current_block, first_key_in_next_block, block_handle, | ||
separator_scratch); | ||
if (!seperator_is_key_plus_seq_ && | ||
sub_index_builder_->seperator_is_key_plus_seq_) { | ||
// then we need to apply it to all sub-index builders and reset | ||
// flush_policy to point to Block Builder of sub_index_builder_ that store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs to be updated.
// To allow further optimization, we provide `last_key_in_current_block` and | ||
// `first_key_in_next_block`, based on which the specific implementation can | ||
// determine the best index key to be used for the index block. | ||
// Called before the OnKeyAdded() call for first_key_in_next_block. | ||
// @last_key_in_current_block: this parameter maybe overridden with the value | ||
// "substitute key". | ||
// @last_key_in_current_block: TODO lifetime details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the TODO?
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger merged this pull request in f456a72. |
Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in facebook#12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in facebook#12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: facebook#12872 facebook#12911 facebook#12874 facebook#12867 facebook#12903 facebook#12904. "After" includes this PR (and all of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.
Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in #12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: #12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in #12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: #12872 #12911 #12874 #12867 #12903 #12904. "After" includes this PR (and all of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
Summary: Something I am working on is going to expand usage of
BlockBasedTableBuilder::Rep::last_key
, but the existing code contract forIndexBuilder::AddIndexEntry
makes that difficult because it modifies itslast_key
parameter to be the separator value recorded in the index, often something between the two boundary keys.This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches.
Additional improvements in PartitionedIndexBuilder specifically:
sub_index_last_key_
and instead tracking the key for the next partition in a placeholder Entry.sub_index_builder_
to unique_ptr.Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.