-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308
base: main
Are you sure you want to change the base?
Conversation
Hi @liyichao! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
879571a
to
6b5c066
Compare
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
1c862b3
to
1d1a1aa
Compare
* memtable format is updated to v2 to reduce cpu usage when getting. * FragmentedRangeTombstoneList is adapted the new memtable format so we can iterate range tombstones as before. * sstable format is the same as before.
1d1a1aa
to
6739d35
Compare
@ajkr please help review this. this pr continues the work from #5032 , and implements according to https://docs.google.com/document/d/1NfjELi8Zz27TUZ3HoxujPRLKe6ifx_Jg9cAEF6WtZNg/edit#heading=h.2irdxocmr0eu . the one failed tests is due to fail to install clang, and the other three failed tests seem to all releated to InsertKeyWithHint, and i do not know why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following unit test will fail with the current implementation:
TEST_F(DBRangeDelTest, NewTest) {
ASSERT_OK(db_->Put(WriteOptions(), "b", "b"));
ASSERT_TRUE(
db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "a", "d").ok());
auto snapshot = db_->GetSnapshot();
ASSERT_TRUE(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "b", "c").ok());
ASSERT_EQ("NOT_FOUND", Get("b", snapshot));
db_->ReleaseSnapshot(snapshot);
}
I'm still a little fuzzy on the detail of how memtable v2 should work. Take this unit test as an example, I think one possible way to handle this is:
Notation: key@tombstone seqno : value@fragment seqno
key@fragment seqno : value@tombstone seqno
DeleteRange('a', 'd'):
insert a@2 : d@2
DeleteRange('b', 'c'):
insert c@3 : d@2
insert b@3 : c@3
insert a@3 : b@2
Then when computing the max covering seqno for 'b' at snapshot with seqno 2, we create a DB iter on this snapshot. Do a SeekForPrev('b')
which will give us a@2:d@2 (all other memtable entries have tombstone fragment sequence number > 2 and are skipped by the DB iter).
db/memtable.cc
Outdated
to_insert.emplace(key, value, tombstone_seq); | ||
to_insert.emplace(key, value, s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two new fragments will have the same key and sequence number which our memtable does not support. You can check the insert(table.get(), h)
call below that false
is returned for this fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this is the exact reason why the NewTest fails, i expected b@2:c@2 and b@3:c@3 to be inserted, so when searching, b@2:c@2 will be found, i think your approach to fix NewTest works for this case, but consider this.
insert a@2: d@2
insert b@5:c@3
finding b@3, then the b@5:c@3 will be ignored, but shouldnot, because we can not use tombstone seqno to filter, we can only use fragment seqno, correct me if i am wrong.
let me think how to handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have come up with a method: encoding multiple sequence into the entry with the same start and end, how do you think about this?
By the way, this case is corresponding to the DeleteRange(“d”, “f”)` is called at seqnum 200.
case in the doc. But with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200
, which is costly. what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward, and stop once you reach a start bigger than the lookup key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not following where does b@5:c@3
come from. For b@5:c@3
to be inserted, I think there should be some range tombstone inserted previously that covers 'b' at 3.
Btw, I confused the two sequence number in the previous comment (now corrected). Fragment seqno is the seqno in key, and tombstone seqno is in value and tells us what this range tombstone covers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding multiple sequence into the entry with the same start and end
Could you elaborate on the new method?
with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200
My method above is also a little different from the method in the doc, and it should only require one seek on the DB iter (although the seek itself on the DB iter might iterate over some entries).
what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward
I thought the above test shows that this does not work, maybe I need to understand the new method first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new method: in FormatEntry, when encode value, encode val_size, then encode a count, then memcpy value, then encode count sequence numbers. this way, we avoid the insert failure you mention. when searching, just decode all sequence numbers and iterate them all.
The idea looks good to me! Some thoughts to consider:
- During insertion, for each stream of tombstones with the same starting key, we only need to consider the first one and move on to the next tombstone with larger starting key. For example, consider this sequence of range deletions:
a:f@1, a:e@2, a:b@3
. Duringa:b@3
, it only needs to considera@2:e@[2, 1]
and then can skipa@1:f@1
and considerse@2:f@1
next.- In
MaxCoveringTombstoneSeqnum
, the loop may not be necessary anymore. After two seeks, we should land at the tombstone covering the look up key that gives us a full list of sequence numbers .- I think we essentially have a representation in memtable that is much more closer to
FragmentedRangeTombstoneList
: each range tombstone has its list of sequence numbers. The construction ofFragmentedRangeTombstoneList
can be potentially optimized.
I agree with you all. I will do the first two first, and maybe leave the third for future optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Btw I have another PR out related to range deletion optimization too #10380. It'd be interesting to run some benchmark to see the perf improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will run your same test for pre-pr/post-pr/your pr/scan-and-del. I have these questions
- where can I get the
scan-and-del
source for testing? - in your pr, fragmented_range_tombstone_list_ seems not protected by lock? so when get and delete range are concurrent, are there problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, can you help me see the failed InsertWithHint test ? it says use an address after free, but I have no idea where the problem is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I used
--expand_range_tombstones=true
flag in db_bench forscan-and-del
case. - I think you are right: reading and writing to the same shared_ptr object might not be thread-safe. I've updated the PR.
Hi @liyichao, thanks for contributing and picking up the work on delete range optimization. I added some initial thoughts and comments as I'm still going through the PR. |
3ae09d8
to
a72ebc4
Compare
Add some benchmark as said in #10380 : |
49896ea
to
3a3d938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, mostly regarding to tombstone insertion. Great work on updating to memtable "V3" now : )
unfragmented_tombstones->Next()) { | ||
total_tombstone_payload_bytes_ += unfragmented_tombstones->key().size() + | ||
unfragmented_tombstones->value().size(); | ||
unfragmented_tombstones->Next(), num_unfragmented_tombstones_++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to use value size as raw payload bytes here, which keeps the code cleaner and less change to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db_flush_test will fail:
Expected equality of these values:
mem_data_bytes
Which is: 8205814
EXPECTED_MEMTABLE_PAYLOAD_BYTES_AT_FLUSH
Which is: 194050
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the difference here supposed to be how much space sequence number takes? That number seems quite large, a lot more than key + value size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is the sequence number space.
46bcd5d
to
d0b8610
Compare
@cbi42 any update to this? |
if (!last_start_key.empty() && | ||
user_comparator->Compare(last_start_key, tombstone_start_key) == 0) { | ||
if (!last_key.empty() && | ||
user_comparator->Compare(tombstone_start_key, last_key) <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why this change is needed from comparing start keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because last_key is now set to last tombstone_end_key, and we only have to look at entry with tombstone_start_key >= last_key.
In our insert algorithm, We have ensured that if we insert entry B with tombstone_start_key < A.tombstone_end_key, we will insert an entry [B.tombstone_start_key, A.tombstone_end_key) and [A.tombstone_end_key,xxx).
when we have processed entry [B.tombstone_start_key, A.tombstone_end_key), assume A.tombstone_start_key > B.tombstone_start_key, we can ignore [A.tombstone_start_key, A.tombstone_end_key) and just process [A.tombstone_end_key, xxx)
@cbi42 updated according to review |
@cbi42 any update to this? |
Hi @liyichao, sorry for the late reply. We plan to go with the caching approach #10547 for now as it provides good enough (hopefully) performance improvement and it is easier to reason about correctness. I think the approach in this PR is better in terms of concurrent operations and worst case performance if we consider both read and write operations, so we can revisit this when it is needed. |
Oh, I see. #10547 seems reasonable. It reuses the FragmentedRangeTombstoneList class which is needed anyway when constructed from sst. So it is simpler. The only risk I can think of is when the request stream is delete_range_1, read_2, delete_range_2, read_3... or multiple reads comes before the first read finishes compute the FragmentedRangeTombstoneList, whose probability seem low. Besides, user may need to flush memtable when the count of delete range reaches a certain number or the first read's latency may be unacceptable, which is a complexity for user. Maybe more use cases will reveal if this pr is needed, thanks anyway for the review. |
Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: #10547 Test Plan: - TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (#10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
@cbi42, In one of our environment, the read latency continues to increase after open rocksdb's perf, we see that after switching to we are now reverted back to the |
Hi @liyichao, thanks for reporting and trying out DeleteRange(). I guess this is the "risk" use case you described in a previous comment. I assume this latency increase is caused by too many range tombstones being accumulated in a memtable. Each fragmented range tombstone list reconstruction becomes increasingly expensive. Curious how many range tombstones are accumulated in memtable? If write perf is not of a concern, then |
I can not find the number of range tombstones now as we have already reverted. scan-and-delete introduce much latency to our io thread, so our write latency increases. It is bad because delete_range is our normal case instead of rare case. It may be a big problem when more load comes in. If anyone uses rocksdb as a backend for storing data with a model like cassandra (where a row have a partition key and other key), then a delete using partition key will become a delete range, then this problem will appear. |
we can iterate range tombstones as before.