Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path #10308

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

liyichao
Copy link

@liyichao liyichao commented Jul 5, 2022

  • memtable format is updated to v2 to reduce cpu usage when getting.
  • FragmentedRangeTombstoneList is adapted to the new memtable format so
    we can iterate range tombstones as before.
  • sstable format is the same as before.

@facebook-github-bot
Copy link
Contributor

Hi @liyichao!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@liyichao liyichao changed the title fragment range tombstones when writing. reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 5, 2022
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@liyichao liyichao changed the title reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path WIP: reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 5, 2022
* memtable format is updated to v2 to reduce cpu usage when getting.
* FragmentedRangeTombstoneList is adapted the new memtable format so
we can iterate range tombstones as before.
* sstable format is the same as before.
@liyichao
Copy link
Author

liyichao commented Jul 7, 2022

@ajkr please help review this. this pr continues the work from #5032 , and implements according to https://docs.google.com/document/d/1NfjELi8Zz27TUZ3HoxujPRLKe6ifx_Jg9cAEF6WtZNg/edit#heading=h.2irdxocmr0eu .

the one failed tests is due to fail to install clang, and the other three failed tests seem to all releated to InsertKeyWithHint, and i do not know why.

@liyichao liyichao changed the title WIP: reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path reduce DeleteRange's negative impact on read by fragmenting range tombstones in the write path Jul 8, 2022
Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following unit test will fail with the current implementation:

TEST_F(DBRangeDelTest, NewTest) {
  ASSERT_OK(db_->Put(WriteOptions(), "b", "b"));
  ASSERT_TRUE(
          db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "a", "d").ok());
  auto snapshot = db_->GetSnapshot();
  ASSERT_TRUE(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "b", "c").ok());
  ASSERT_EQ("NOT_FOUND", Get("b", snapshot));
  db_->ReleaseSnapshot(snapshot);
}

I'm still a little fuzzy on the detail of how memtable v2 should work. Take this unit test as an example, I think one possible way to handle this is:

Notation: key@tombstone seqno : value@fragment seqno
key@fragment seqno : value@tombstone seqno

DeleteRange('a', 'd'):
insert a@2 : d@2

DeleteRange('b', 'c'):
insert c@3 : d@2
insert b@3 : c@3
insert a@3 : b@2

Then when computing the max covering seqno for 'b' at snapshot with seqno 2, we create a DB iter on this snapshot. Do a SeekForPrev('b') which will give us a@2:d@2 (all other memtable entries have tombstone fragment sequence number > 2 and are skipped by the DB iter).

db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated
Comment on lines 615 to 616
to_insert.emplace(key, value, tombstone_seq);
to_insert.emplace(key, value, s);
Copy link
Member

@cbi42 cbi42 Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two new fragments will have the same key and sequence number which our memtable does not support. You can check the insert(table.get(), h) call below that false is returned for this fragment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this is the exact reason why the NewTest fails, i expected b@2:c@2 and b@3:c@3 to be inserted, so when searching, b@2:c@2 will be found, i think your approach to fix NewTest works for this case, but consider this.

insert a@2: d@2
insert b@5:c@3
finding b@3, then the b@5:c@3 will be ignored, but shouldnot, because we can not use tombstone seqno to filter, we can only use fragment seqno, correct me if i am wrong.

let me think how to handle this

Copy link
Author

@liyichao liyichao Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have come up with a method: encoding multiple sequence into the entry with the same start and end, how do you think about this?

By the way, this case is corresponding to the DeleteRange(“d”, “f”)` is called at seqnum 200. case in the doc. But with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200, which is costly. what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward, and stop once you reach a start bigger than the lookup key.

Copy link
Member

@cbi42 cbi42 Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not following where does b@5:c@3 come from. For b@5:c@3 to be inserted, I think there should be some range tombstone inserted previously that covers 'b' at 3.

Btw, I confused the two sequence number in the previous comment (now corrected). Fragment seqno is the seqno in key, and tombstone seqno is in value and tells us what this range tombstone covers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding multiple sequence into the entry with the same start and end

Could you elaborate on the new method?

with the method in the doc, consider we have to find e@101, we have to iterate all previous entries when we found d@200:f@200

My method above is also a little different from the method in the doc, and it should only require one seek on the DB iter (although the seek itself on the DB iter might iterate over some entries).

what this pr's method ensures is that once you found the start key d@kMaxSequenceNumber, you only have to iterate forward

I thought the above test shows that this does not work, maybe I need to understand the new method first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new method: in FormatEntry, when encode value, encode val_size, then encode a count, then memcpy value, then encode count sequence numbers. this way, we avoid the insert failure you mention. when searching, just decode all sequence numbers and iterate them all.

The idea looks good to me! Some thoughts to consider:

  • During insertion, for each stream of tombstones with the same starting key, we only need to consider the first one and move on to the next tombstone with larger starting key. For example, consider this sequence of range deletions: a:f@1, a:e@2, a:b@3. During a:b@3, it only needs to consider a@2:e@[2, 1] and then can skip a@1:f@1 and considers e@2:f@1 next.
  • In MaxCoveringTombstoneSeqnum, the loop may not be necessary anymore. After two seeks, we should land at the tombstone covering the look up key that gives us a full list of sequence numbers .
  • I think we essentially have a representation in memtable that is much more closer to FragmentedRangeTombstoneList: each range tombstone has its list of sequence numbers. The construction of FragmentedRangeTombstoneList can be potentially optimized.

I agree with you all. I will do the first two first, and maybe leave the third for future optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Btw I have another PR out related to range deletion optimization too #10380. It'd be interesting to run some benchmark to see the perf improvement.

Copy link
Author

@liyichao liyichao Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will run your same test for pre-pr/post-pr/your pr/scan-and-del. I have these questions

  1. where can I get the scan-and-del source for testing?
  2. in your pr, fragmented_range_tombstone_list_ seems not protected by lock? so when get and delete range are concurrent, are there problems?

Copy link
Author

@liyichao liyichao Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way, can you help me see the failed InsertWithHint test ? it says use an address after free, but I have no idea where the problem is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I used --expand_range_tombstones=true flag in db_bench for scan-and-del case.
  2. I think you are right: reading and writing to the same shared_ptr object might not be thread-safe. I've updated the PR.

@cbi42
Copy link
Member

cbi42 commented Jul 15, 2022

Hi @liyichao, thanks for contributing and picking up the work on delete range optimization. I added some initial thoughts and comments as I'm still going through the PR.

db/memtable.cc Outdated Show resolved Hide resolved
@liyichao
Copy link
Author

liyichao commented Jul 22, 2022

Add some benchmark as said in #10380 :
pr10380 refers to eb178b9fda27f5c7bf99778953040dececaefaad which has atomic load/store.

image

Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, mostly regarding to tombstone insertion. Great work on updating to memtable "V3" now : )

db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Show resolved Hide resolved
db/memtable.cc Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
unfragmented_tombstones->Next()) {
total_tombstone_payload_bytes_ += unfragmented_tombstones->key().size() +
unfragmented_tombstones->value().size();
unfragmented_tombstones->Next(), num_unfragmented_tombstones_++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to use value size as raw payload bytes here, which keeps the code cleaner and less change to make.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_flush_test will fail:

Expected equality of these values:
  mem_data_bytes
    Which is: 8205814
  EXPECTED_MEMTABLE_PAYLOAD_BYTES_AT_FLUSH
    Which is: 194050

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the difference here supposed to be how much space sequence number takes? That number seems quite large, a lot more than key + value size?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the sequence number space.

@liyichao
Copy link
Author

liyichao commented Aug 5, 2022

@cbi42 any update to this?

@cbi42
Copy link
Member

cbi42 commented Aug 5, 2022

@cbi42 any update to this?

@liyichao I did not know you were done with the updates, I'll take a look at the PR soon.

@cbi42 cbi42 self-requested a review August 5, 2022 04:50
if (!last_start_key.empty() &&
user_comparator->Compare(last_start_key, tombstone_start_key) == 0) {
if (!last_key.empty() &&
user_comparator->Compare(tombstone_start_key, last_key) <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this change is needed from comparing start keys?

Copy link
Author

@liyichao liyichao Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because last_key is now set to last tombstone_end_key, and we only have to look at entry with tombstone_start_key >= last_key.

In our insert algorithm, We have ensured that if we insert entry B with tombstone_start_key < A.tombstone_end_key, we will insert an entry [B.tombstone_start_key, A.tombstone_end_key) and [A.tombstone_end_key,xxx).

when we have processed entry [B.tombstone_start_key, A.tombstone_end_key), assume A.tombstone_start_key > B.tombstone_start_key, we can ignore [A.tombstone_start_key, A.tombstone_end_key) and just process [A.tombstone_end_key, xxx)

db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
db/memtable.cc Outdated Show resolved Hide resolved
@liyichao
Copy link
Author

@cbi42 updated according to review

@liyichao
Copy link
Author

@cbi42 any update to this?

@cbi42
Copy link
Member

cbi42 commented Aug 29, 2022

@cbi42 any update to this?

Hi @liyichao, sorry for the late reply. We plan to go with the caching approach #10547 for now as it provides good enough (hopefully) performance improvement and it is easier to reason about correctness. I think the approach in this PR is better in terms of concurrent operations and worst case performance if we consider both read and write operations, so we can revisit this when it is needed.

@liyichao
Copy link
Author

liyichao commented Aug 30, 2022

Oh, I see. #10547 seems reasonable. It reuses the FragmentedRangeTombstoneList class which is needed anyway when constructed from sst. So it is simpler. The only risk I can think of is when the request stream is delete_range_1, read_2, delete_range_2, read_3... or multiple reads comes before the first read finishes compute the FragmentedRangeTombstoneList, whose probability seem low. Besides, user may need to flush memtable when the count of delete range reaches a certain number or the first read's latency may be unacceptable, which is a complexity for user.

Maybe more use cases will reveal if this pr is needed, thanks anyway for the review.

facebook-github-bot pushed a commit that referenced this pull request Sep 14, 2022
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. #10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (#10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.

The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (#10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.

Pull Request resolved: #10547

Test Plan:
- TestGet() in stress test is updated in #10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.

micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
|   |fillrandom main           | write path caching          | read path caching          |memtable V3 (#10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
|---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
| 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
| 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
| 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
| 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
| 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
| 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
| 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |

  - Benchmark performance of`readwhilewriting` from #10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`

readrandom micros/op:
|  |main            |write path caching           |read path caching            |memtable V3      |
|---|---|---|---|---|
| single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
| 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |

Reviewed By: ajkr

Differential Revision: D38895410

Pulled By: cbi42

fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
@liyichao
Copy link
Author

liyichao commented Feb 13, 2023

@cbi42, In one of our environment, the read latency continues to increase
image

after open rocksdb's perf, we see that user_key_comparison_count = 2586508, get_from_memtable_time = 117709781, and the cpu flame graph shows it is range tombstone:
image

after switching to scan-and-delete, the latency returns to normal:

image

we are now reverted back to the scan-and-delete method. Our workload uses delete_range to delete keys because one application key corresponding to many rocksdb keys, and they share a prefix. So the workload uses delete_range more frequently than in testing, and mixes delete_range with regular get, I think it is why the cache is not working as expected.

@cbi42
Copy link
Member

cbi42 commented Feb 15, 2023

Hi @liyichao, thanks for reporting and trying out DeleteRange(). I guess this is the "risk" use case you described in a previous comment. I assume this latency increase is caused by too many range tombstones being accumulated in a memtable. Each fragmented range tombstone list reconstruction becomes increasingly expensive. Curious how many range tombstones are accumulated in memtable? If write perf is not of a concern, then scan-and-delete is probably faster in this case.

@liyichao
Copy link
Author

liyichao commented Feb 16, 2023

I can not find the number of range tombstones now as we have already reverted. scan-and-delete introduce much latency to our io thread, so our write latency increases. It is bad because delete_range is our normal case instead of rare case. It may be a big problem when more load comes in. If anyone uses rocksdb as a backend for storing data with a model like cassandra (where a row have a partition key and other key), then a delete using partition key will become a delete range, then this problem will appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants