Cache fragmented range tombstones in BlockBasedTableReader #4493

abhimadan · 2018-10-15T17:58:50Z

This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running readrandom results in the following:

readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a readwhilewriting benchmark (in order to provide somewhat more realistic results):

   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.

Test Plan: make check

abhimadan · 2018-10-16T16:38:02Z

table/block_based_table_reader.cc

+      NewDataBlockIterator<DataBlockIter>(
+          rep_, read_options, rep_->range_del_handle));
+  }
+  auto new_tombstone_fragments = std::make_shared<FragmentedRangeTombstoneList>(


There's a race here. If two threads try to call NewRangeTombstoneIterator, they may both create a new FragmentedRangeTombstoneList and overwrite rep_->fragmented_tombstones. Probably need to serialize access here.

Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: #4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This allows tombstone fragmenting to only be performed when the table is first opened, and cached for subsequent accesses.

facebook-github-bot · 2018-10-25T17:02:43Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr

nice design and code, lgtm

ajkr · 2018-10-25T20:23:27Z

db/range_tombstone_fragmenter.h

@@ -66,7 +100,7 @@ class FragmentedRangeTombstoneIterator : public InternalIterator {
  };

  void MaybePinKey() const {
-    if (pos_ != tombstones_.end() && pinned_pos_ != pos_) {
+    if (pos_ != tombstones_->end() && pinned_pos_ != pos_) {
      current_start_key_.Set(pos_->start_key_, pos_->seq_, kTypeRangeDeletion);
      pinned_pos_ = pos_;


I forgot to ask about this on the last PR. What does pinned_pos_ do?

pinned_pos_ points to the key that's currently pinned. It's mainly used to avoid re-pinning the key that's currently pinned.

ajkr · 2018-10-25T21:26:30Z

table/block_based_table_reader.h

@@ -115,6 +116,9 @@ class BlockBasedTable : public TableReader {
  InternalIterator* NewRangeTombstoneIterator(
      const ReadOptions& read_options) override;

+  InternalIterator* NewUnfragmentedRangeTombstoneIterator(


should it be private?

Yeah it should be; thanks for catching that!

facebook-github-bot · 2018-10-25T22:10:17Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

abhimadan added the WIP Work in progress label Oct 15, 2018

facebook-github-bot added the CLA Signed label Oct 15, 2018

abhimadan commented Oct 16, 2018

View reviewed changes

abhimadan force-pushed the cache-table-fragment branch 7 times, most recently from 9a5edb5 to 60c6f3a Compare October 22, 2018 23:28

abhimadan mentioned this pull request Oct 23, 2018

Use only "local" range tombstones during Get #4449

Closed

abhimadan force-pushed the cache-table-fragment branch 6 times, most recently from 944c3ee to f1e5c97 Compare October 24, 2018 18:28

abhimadan force-pushed the cache-table-fragment branch 2 times, most recently from 885f6d6 to 0bb0198 Compare October 24, 2018 20:27

abhimadan removed the WIP Work in progress label Oct 24, 2018

abhimadan requested a review from ajkr October 24, 2018 21:08

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

Cache fragmented range tombstones in BlockBasedTableReader

be445f8

This allows tombstone fragmenting to only be performed when the table is first opened, and cached for subsequent accesses.

abhimadan force-pushed the cache-table-fragment branch from 0bb0198 to be445f8 Compare October 25, 2018 17:02

facebook-github-bot reviewed Oct 25, 2018

View reviewed changes

ajkr approved these changes Oct 25, 2018

View reviewed changes

Make unfragmented tombstone iter method private

2ad9d61

facebook-github-bot reviewed Oct 25, 2018

View reviewed changes

facebook-github-bot closed this in 7528130 Oct 26, 2018

abhimadan deleted the cache-table-fragment branch October 26, 2018 16:57

abhimadan mentioned this pull request Nov 20, 2018

Introduce RangeDelAggregatorV2 #4649

Closed

absolute8511 mentioned this pull request Apr 2, 2019

Rocksdb DeleteRange v2 should be merged to improve read and scan with many delete ranges youzan/ZanRedisDB#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache fragmented range tombstones in BlockBasedTableReader #4493

Cache fragmented range tombstones in BlockBasedTableReader #4493

abhimadan commented Oct 15, 2018 •

edited

Loading

abhimadan Oct 16, 2018

facebook-github-bot left a comment

facebook-github-bot commented Oct 25, 2018

facebook-github-bot left a comment

ajkr left a comment

ajkr Oct 25, 2018

abhimadan Oct 25, 2018

ajkr Oct 25, 2018

abhimadan Oct 25, 2018

facebook-github-bot commented Oct 25, 2018

facebook-github-bot left a comment

Cache fragmented range tombstones in BlockBasedTableReader #4493

Cache fragmented range tombstones in BlockBasedTableReader #4493

Conversation

abhimadan commented Oct 15, 2018 • edited Loading

abhimadan Oct 16, 2018

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 25, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

ajkr Oct 25, 2018

Choose a reason for hiding this comment

abhimadan Oct 25, 2018

Choose a reason for hiding this comment

ajkr Oct 25, 2018

Choose a reason for hiding this comment

abhimadan Oct 25, 2018

Choose a reason for hiding this comment

facebook-github-bot commented Oct 25, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

abhimadan commented Oct 15, 2018 •

edited

Loading