-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache fragmented range tombstones in BlockBasedTableReader #4493
Conversation
table/block_based_table_reader.cc
Outdated
NewDataBlockIterator<DataBlockIter>( | ||
rep_, read_options, rep_->range_del_handle)); | ||
} | ||
auto new_tombstone_fragments = std::make_shared<FragmentedRangeTombstoneList>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race here. If two threads try to call NewRangeTombstoneIterator
, they may both create a new FragmentedRangeTombstoneList
and overwrite rep_->fragmented_tombstones
. Probably need to serialize access here.
9a5edb5
to
60c6f3a
Compare
944c3ee
to
f1e5c97
Compare
Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: #4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
885f6d6
to
0bb0198
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This allows tombstone fragmenting to only be performed when the table is first opened, and cached for subsequent accesses.
0bb0198
to
be445f8
Compare
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice design and code, lgtm
@@ -66,7 +100,7 @@ class FragmentedRangeTombstoneIterator : public InternalIterator { | |||
}; | |||
|
|||
void MaybePinKey() const { | |||
if (pos_ != tombstones_.end() && pinned_pos_ != pos_) { | |||
if (pos_ != tombstones_->end() && pinned_pos_ != pos_) { | |||
current_start_key_.Set(pos_->start_key_, pos_->seq_, kTypeRangeDeletion); | |||
pinned_pos_ = pos_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to ask about this on the last PR. What does pinned_pos_
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinned_pos_
points to the key that's currently pinned. It's mainly used to avoid re-pinning the key that's currently pinned.
table/block_based_table_reader.h
Outdated
@@ -115,6 +116,9 @@ class BlockBasedTable : public TableReader { | |||
InternalIterator* NewRangeTombstoneIterator( | |||
const ReadOptions& read_options) override; | |||
|
|||
InternalIterator* NewUnfragmentedRangeTombstoneIterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it should be; thanks for catching that!
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running
readrandom
results in the following:Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a
readwhilewriting
benchmark (in order to provide somewhat more realistic results):After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Test Plan: make check