reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

liyichao · 2022-07-05T04:20:28Z

memtable format is updated to v2 to reduce cpu usage when getting.
FragmentedRangeTombstoneList is adapted to the new memtable format so
we can iterate range tombstones as before.
sstable format is the same as before.

facebook-github-bot · 2022-07-05T04:20:32Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2022-07-05T08:58:46Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-07-05T08:58:49Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

* memtable format is updated to v2 to reduce cpu usage when getting. * FragmentedRangeTombstoneList is adapted the new memtable format so we can iterate range tombstones as before. * sstable format is the same as before.

liyichao · 2022-07-07T02:56:09Z

@ajkr please help review this. this pr continues the work from #5032 , and implements according to https://docs.google.com/document/d/1NfjELi8Zz27TUZ3HoxujPRLKe6ifx_Jg9cAEF6WtZNg/edit#heading=h.2irdxocmr0eu .

the one failed tests is due to fail to install clang, and the other three failed tests seem to all releated to InsertKeyWithHint, and i do not know why.

cbi42

The following unit test will fail with the current implementation:

TEST_F(DBRangeDelTest, NewTest) {
  ASSERT_OK(db_->Put(WriteOptions(), "b", "b"));
  ASSERT_TRUE(
          db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "a", "d").ok());
  auto snapshot = db_->GetSnapshot();
  ASSERT_TRUE(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "b", "c").ok());
  ASSERT_EQ("NOT_FOUND", Get("b", snapshot));
  db_->ReleaseSnapshot(snapshot);
}

I'm still a little fuzzy on the detail of how memtable v2 should work. Take this unit test as an example, I think one possible way to handle this is:

Notation: ~~key@tombstone seqno : value@fragment seqno~~
key@fragment seqno : value@tombstone seqno

DeleteRange('a', 'd'):
insert a@2 : d@2

DeleteRange('b', 'c'):
insert c@3 : d@2
insert b@3 : c@3
insert a@3 : b@2

Then when computing the max covering seqno for 'b' at snapshot with seqno 2, we create a DB iter on this snapshot. Do a SeekForPrev('b') which will give us a@2:d@2 (all other memtable entries have ~~tombstone~~ fragment sequence number > 2 and are skipped by the DB iter).

db/memtable.cc

cbi42 · 2022-07-15T05:00:37Z

db/memtable.cc

+        to_insert.emplace(key, value, tombstone_seq);
+        to_insert.emplace(key, value, s);


These two new fragments will have the same key and sequence number which our memtable does not support. You can check the insert(table.get(), h) call below that false is returned for this fragment.

oh, this is the exact reason why the NewTest fails, i expected b@2:c@2 and b@3:c@3 to be inserted, so when searching, b@2:c@2 will be found, i think your approach to fix NewTest works for this case, but consider this.

insert a@2: d@2
insert b@5:c@3
finding b@3, then the b@5:c@3 will be ignored, but shouldnot, because we can not use tombstone seqno to filter, we can only use fragment seqno, correct me if i am wrong.

let me think how to handle this

I have come up with a method: encoding multiple sequence into the entry with the same start and end, how do you think about this?

By the way, this case is corresponding to the DeleteRange(“d”, “f”)` is called at seqnum 200. case in the doc. But with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200, which is costly. what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward, and stop once you reach a start bigger than the lookup key.

Sorry I'm not following where does b@5:c@3 come from. For b@5:c@3 to be inserted, I think there should be some range tombstone inserted previously that covers 'b' at 3.

Btw, I confused the two sequence number in the previous comment (now corrected). Fragment seqno is the seqno in key, and tombstone seqno is in value and tells us what this range tombstone covers.

encoding multiple sequence into the entry with the same start and end

Could you elaborate on the new method?

with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200

My method above is also a little different from the method in the doc, and it should only require one seek on the DB iter (although the seek itself on the DB iter might iterate over some entries).

what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward

I thought the above test shows that this does not work, maybe I need to understand the new method first.

the new method: in FormatEntry, when encode value, encode val_size, then encode a count, then memcpy value, then encode count sequence numbers. this way, we avoid the insert failure you mention. when searching, just decode all sequence numbers and iterate them all.

The idea looks good to me! Some thoughts to consider:

During insertion, for each stream of tombstones with the same starting key, we only need to consider the first one and move on to the next tombstone with larger starting key. For example, consider this sequence of range deletions: a:f@1, a:e@2, a:b@3. During a:b@3, it only needs to consider a@2:e@[2, 1] and then can skip a@1:f@1 and considers e@2:f@1 next.

In MaxCoveringTombstoneSeqnum, the loop may not be necessary anymore. After two seeks, we should land at the tombstone covering the look up key that gives us a full list of sequence numbers .

I think we essentially have a representation in memtable that is much more closer to FragmentedRangeTombstoneList: each range tombstone has its list of sequence numbers. The construction of FragmentedRangeTombstoneList can be potentially optimized.

I agree with you all. I will do the first two first, and maybe leave the third for future optimization.

Sounds good! Btw I have another PR out related to range deletion optimization too #10380. It'd be interesting to run some benchmark to see the perf improvement.

Yeah, I will run your same test for pre-pr/post-pr/your pr/scan-and-del. I have these questions

where can I get the scan-and-del source for testing?

in your pr, fragmented_range_tombstone_list_ seems not protected by lock? so when get and delete range are concurrent, are there problems?

by the way, can you help me see the failed InsertWithHint test ? it says use an address after free, but I have no idea where the problem is.

I used --expand_range_tombstones=true flag in db_bench for scan-and-del case.

I think you are right: reading and writing to the same shared_ptr object might not be thread-safe. I've updated the PR.

cbi42 · 2022-07-15T05:16:07Z

Hi @liyichao, thanks for contributing and picking up the work on delete range optimization. I added some initial thoughts and comments as I'm still going through the PR.

db/memtable.cc

liyichao · 2022-07-22T03:13:36Z

Add some benchmark as said in #10380 :
pr10380 refers to eb178b9fda27f5c7bf99778953040dececaefaad which has atomic load/store.

cbi42

Added some comments, mostly regarding to tombstone insertion. Great work on updating to memtable "V3" now : )

db/memtable.cc

cbi42 · 2022-07-23T18:17:35Z

db/range_tombstone_fragmenter.cc

-       unfragmented_tombstones->Next()) {
-    total_tombstone_payload_bytes_ += unfragmented_tombstones->key().size() +
-                                      unfragmented_tombstones->value().size();
+       unfragmented_tombstones->Next(), num_unfragmented_tombstones_++) {


I think it's okay to use value size as raw payload bytes here, which keeps the code cleaner and less change to make.

db_flush_test will fail:

Expected equality of these values: mem_data_bytes Which is: 8205814 EXPECTED_MEMTABLE_PAYLOAD_BYTES_AT_FLUSH Which is: 194050

Is the difference here supposed to be how much space sequence number takes? That number seems quite large, a lot more than key + value size?

Yes, it is the sequence number space.

liyichao · 2022-08-05T02:46:00Z

@cbi42 any update to this?

cbi42 · 2022-08-05T04:46:14Z

@cbi42 any update to this?

@liyichao I did not know you were done with the updates, I'll take a look at the PR soon.

cbi42 · 2022-08-11T21:24:32Z

db/memtable.cc

-    if (!last_start_key.empty() &&
-        user_comparator->Compare(last_start_key, tombstone_start_key) == 0) {
+    if (!last_key.empty() &&
+        user_comparator->Compare(tombstone_start_key, last_key) <= 0) {


Just curious why this change is needed from comparing start keys?

because last_key is now set to last tombstone_end_key, and we only have to look at entry with tombstone_start_key >= last_key.

In our insert algorithm, We have ensured that if we insert entry B with tombstone_start_key < A.tombstone_end_key, we will insert an entry [B.tombstone_start_key, A.tombstone_end_key) and [A.tombstone_end_key,xxx).

when we have processed entry [B.tombstone_start_key, A.tombstone_end_key), assume A.tombstone_start_key > B.tombstone_start_key, we can ignore [A.tombstone_start_key, A.tombstone_end_key) and just process [A.tombstone_end_key, xxx)

db/memtable.cc

liyichao · 2022-08-16T06:52:23Z

@cbi42 updated according to review

liyichao · 2022-08-23T02:41:28Z

@cbi42 any update to this?

cbi42 · 2022-08-29T17:00:04Z

@cbi42 any update to this?

Hi @liyichao, sorry for the late reply. We plan to go with the caching approach #10547 for now as it provides good enough (hopefully) performance improvement and it is easier to reason about correctness. I think the approach in this PR is better in terms of concurrent operations and worst case performance if we consider both read and write operations, so we can revisit this when it is needed.

liyichao · 2022-08-30T13:31:38Z

Oh, I see. #10547 seems reasonable. It reuses the FragmentedRangeTombstoneList class which is needed anyway when constructed from sst. So it is simpler. The only risk I can think of is when the request stream is delete_range_1, read_2, delete_range_2, read_3... or multiple reads comes before the first read finishes compute the FragmentedRangeTombstoneList, whose probability seem low. Besides, user may need to flush memtable when the count of delete range reaches a certain number or the first read's latency may be unacceptable, which is a complexity for user.

Maybe more use cases will reveal if this pr is needed, thanks anyway for the review.

Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: #10547 Test Plan: - TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (#10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559

liyichao · 2023-02-13T02:58:46Z

@cbi42, In one of our environment, the read latency continues to increase

after open rocksdb's perf, we see that user_key_comparison_count = 2586508, get_from_memtable_time = 117709781, and the cpu flame graph shows it is range tombstone:

after switching to scan-and-delete, the latency returns to normal:

we are now reverted back to the scan-and-delete method. Our workload uses delete_range to delete keys because one application key corresponding to many rocksdb keys, and they share a prefix. So the workload uses delete_range more frequently than in testing, and mixes delete_range with regular get, I think it is why the cache is not working as expected.

cbi42 · 2023-02-15T05:56:41Z

Hi @liyichao, thanks for reporting and trying out DeleteRange(). I guess this is the "risk" use case you described in a previous comment. I assume this latency increase is caused by too many range tombstones being accumulated in a memtable. Each fragmented range tombstone list reconstruction becomes increasingly expensive. Curious how many range tombstones are accumulated in memtable? If write perf is not of a concern, then scan-and-delete is probably faster in this case.

liyichao · 2023-02-16T07:41:49Z

I can not find the number of range tombstones now as we have already reverted. scan-and-delete introduce much latency to our io thread, so our write latency increases. It is bad because delete_range is our normal case instead of rare case. It may be a big problem when more load comes in. If anyone uses rocksdb as a backend for storing data with a model like cassandra (where a row have a partition key and other key), then a delete using partition key will become a delete range, then this problem will appear.

liyichao changed the title ~~fragment range tombstones when writing.~~ reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 5, 2022

liyichao force-pushed the delete_range_write_main branch from 879571a to 6b5c066 Compare July 5, 2022 07:16

facebook-github-bot added the CLA Signed label Jul 5, 2022

liyichao changed the title ~~reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path~~ WIP: reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 5, 2022

liyichao force-pushed the delete_range_write_main branch from 1c862b3 to 1d1a1aa Compare July 6, 2022 11:50

fragment range tombstones when writing.

6739d35

* memtable format is updated to v2 to reduce cpu usage when getting. * FragmentedRangeTombstoneList is adapted the new memtable format so we can iterate range tombstones as before. * sstable format is the same as before.

liyichao force-pushed the delete_range_write_main branch from 1d1a1aa to 6739d35 Compare July 6, 2022 14:29

liyichao changed the title ~~WIP: reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path~~ reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 8, 2022

cbi42 reviewed Jul 15, 2022

View reviewed changes

liyichao added 2 commits July 15, 2022 15:20

fix cmp return value.

6457848

encode multiple fragment sequence number in one memtable entry.

a72ebc4

liyichao force-pushed the delete_range_write_main branch from 3ae09d8 to a72ebc4 Compare July 21, 2022 06:29

cbi42 reviewed Jul 21, 2022

View reviewed changes

db/memtable.cc Outdated Show resolved Hide resolved

fix ASAN complain.

3a3d938

liyichao force-pushed the delete_range_write_main branch from 49896ea to 3a3d938 Compare July 22, 2022 06:47

optimize read perf for case with infrequent range deletion.

84af8aa

cbi42 requested changes Jul 23, 2022

View reviewed changes

skip to next key early when insertkey.

b65c677

cbi42 mentioned this pull request Jul 28, 2022

Fragment memtable range tombstone in the write path #10380

Closed

using binary search in MaxCoveringTombstoneSeqnum.

d0b8610

liyichao force-pushed the delete_range_write_main branch from 46bcd5d to d0b8610 Compare July 29, 2022 02:43

fix insert duplicate key error.

02a4871

cbi42 self-requested a review August 5, 2022 04:50

cbi42 reviewed Aug 11, 2022

View reviewed changes

liyichao added 3 commits August 16, 2022 11:06

update to_insert data structure.

b2214c1

fix race condition with read.

0225f4c

Merge branch 'main' into delete_range_write_main

2258ba5

fix test.

48ce4bb

cbi42 mentioned this pull request Aug 26, 2022

Cache fragmented range tombstone list for mutable memtables #10547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

liyichao commented Jul 5, 2022 •

edited

Loading

facebook-github-bot commented Jul 5, 2022

facebook-github-bot commented Jul 5, 2022

facebook-github-bot commented Jul 5, 2022

liyichao commented Jul 7, 2022 •

edited

Loading

cbi42 left a comment •

edited

Loading

cbi42 Jul 15, 2022 •

edited

Loading

liyichao Jul 15, 2022

liyichao Jul 15, 2022 •

edited

Loading

cbi42 Jul 15, 2022 •

edited

Loading

cbi42 Jul 15, 2022

liyichao Jul 20, 2022

cbi42 Jul 21, 2022

liyichao Jul 21, 2022 •

edited

Loading

liyichao Jul 21, 2022 •

edited

Loading

cbi42 Jul 21, 2022

cbi42 commented Jul 15, 2022

liyichao commented Jul 22, 2022 •

edited

Loading

cbi42 left a comment

cbi42 Jul 23, 2022

liyichao Jul 26, 2022

cbi42 Aug 11, 2022

liyichao Aug 15, 2022

liyichao commented Aug 5, 2022

cbi42 commented Aug 5, 2022

cbi42 Aug 11, 2022

liyichao Aug 15, 2022 •

edited

Loading

liyichao commented Aug 16, 2022

liyichao commented Aug 23, 2022

cbi42 commented Aug 29, 2022

liyichao commented Aug 30, 2022 •

edited

Loading

liyichao commented Feb 13, 2023 •

edited

Loading

cbi42 commented Feb 15, 2023

liyichao commented Feb 16, 2023 •

edited

Loading

		to_insert.emplace(key, value, tombstone_seq);
		to_insert.emplace(key, value, s);

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

Are you sure you want to change the base?

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

Conversation

liyichao commented Jul 5, 2022 • edited Loading

facebook-github-bot commented Jul 5, 2022

Action Required

Process

facebook-github-bot commented Jul 5, 2022

facebook-github-bot commented Jul 5, 2022

liyichao commented Jul 7, 2022 • edited Loading

cbi42 left a comment • edited Loading

Choose a reason for hiding this comment

cbi42 Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyichao Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

cbi42 Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyichao Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

liyichao Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbi42 commented Jul 15, 2022

liyichao commented Jul 22, 2022 • edited Loading

cbi42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyichao commented Aug 5, 2022

cbi42 commented Aug 5, 2022

Choose a reason for hiding this comment

liyichao Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

liyichao commented Aug 16, 2022

liyichao commented Aug 23, 2022

cbi42 commented Aug 29, 2022

liyichao commented Aug 30, 2022 • edited Loading

liyichao commented Feb 13, 2023 • edited Loading

cbi42 commented Feb 15, 2023

liyichao commented Feb 16, 2023 • edited Loading

liyichao commented Jul 5, 2022 •

edited

Loading

liyichao commented Jul 7, 2022 •

edited

Loading

cbi42 left a comment •

edited

Loading

cbi42 Jul 15, 2022 •

edited

Loading

liyichao Jul 15, 2022 •

edited

Loading

cbi42 Jul 15, 2022 •

edited

Loading

liyichao Jul 21, 2022 •

edited

Loading

liyichao Jul 21, 2022 •

edited

Loading

liyichao commented Jul 22, 2022 •

edited

Loading

liyichao Aug 15, 2022 •

edited

Loading

liyichao commented Aug 30, 2022 •

edited

Loading

liyichao commented Feb 13, 2023 •

edited

Loading

liyichao commented Feb 16, 2023 •

edited

Loading