Use only "local" range tombstones during Get #4449

abhimadan · 2018-10-04T00:36:15Z

Summary: Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

Performance Results

In the benchmark results, the following command was used to initialize the database:

./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8

...and the following command was used to measure read throughput:

./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
DEBUG_LEVEL=0.

Readrandom results before PR:

readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)

Readrandom results after PR:

readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

Test Plan: make check

abhimadan · 2018-10-04T00:39:02Z

Another future optimization would be to cache an SST's fragmented range tombstones in its table reader. This would also be useful for the iterator portion of the DeleteRange redesign.

abhimadan · 2018-10-04T18:56:41Z

db/range_tombstone_fragmenter.cc

+      }
+      std::sort(seqnums_to_flush.begin(), seqnums_to_flush.end(),
+                std::greater<SequenceNumber>());
+      for (auto seqnum : seqnums_to_flush) {


I think I could simplify this part by only writing a single tombstone for this fragment using the largest seqnum.

abhimadan · 2018-10-08T22:54:34Z

db/range_tombstone_fragmenter.h

+// the internal key ordering already provided by the input iterator. If there
+// are few overlaps, creating a FragmentedRangeTombstoneIterator should be
+// O(n), while the RangeDelAggregator tombstone collapsing is always O(n log n).
+class FragmentedRangeTombstoneIterator : public InternalIterator {


A note about making this an InternalIterator: in the follow-up work I'm planning, SSTs will cache their FragmentedRangeTombstoneIterators, since their contents won't change (though this will require some additional logic to disable snapshot filtering and the dropping of covered fragments). These fragmented iterators will be used everywhere a range tombstone iterator is required from an SST, and as a result it needs to be compatible with the InternalIterator interface.

abhimadan · 2018-10-12T23:21:57Z

After all the allocation improvements, the benchmark actually changed significantly here:

readrandom   :       4.917 micros/op 203382 ops/sec;   15.7 MB/s (63103 of 100000 found)

I'll update the PR description with this (for posterity, the old number was ~9.5MB/s).

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. Preliminary db_bench results show a readrandom throughput increase from 5.7 MB/s to 8.7 MB/s. This change is still a WIP since it does not handle snapshots. Test Plan: make check

facebook-github-bot · 2018-10-19T00:34:35Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr · 2018-10-22T22:32:39Z

Btw, have you run db_stress on this to verify correctness? One special config I try for DeleteRange lately is using merge operator with -max_key set to a low value, to try exposing issues with files that have overlapping endpoints.

ajkr · 2018-10-22T22:35:37Z

db/range_tombstone_fragmenter.h

+// meta block into an iterator over non-overlapping tombstone fragments. The
+// tombstone fragmentation process should be more efficient than the range
+// tombstone collapsing algorithm in RangeDelAggregator because this leverages
+// the internal key ordering already provided by the input iterator. If there


the range tombstone meta-blocks were not sorted in older versions, I believe. That was a change Nikhil made.

you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.

Thanks for pointing this out. I've removed that assumption while trying to make the sorted case fast. I think there's still some slowdown though, so I'll run the benchmarks again and compare the results.

Yeah, it slowed down by about 3x. Here's the updated readrandom result (using the same commands as in the PR description):

readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)

Fortunately, I think the effect is minimal in #4493; I ran some cursory benchmarks on that PR after rebasing and they don't seem to have changed.

facebook-github-bot · 2018-10-22T23:26:19Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-10-23T17:52:08Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

abhimadan · 2018-10-23T18:56:31Z

I ran the following db_stress command and didn't see any problems:

TEST_TMPDIR=/dev/shm ./db_stress --max_background_compactions=8 --subcompactions=8 --memtablerep=skip_list --acquire_snapshot_one_in=10000 --delpercent=4 --delrangepercent=1 --snapshot_hold_ops=100000 --allow_concurrent_memtable_write=1 --compact_files_one_in=10000 --clear_column_family_one_in=0 --writepercent=35 --readpercent=25 --write_buffer_size=1048576 --max_bytes_for_level_base=4194304 --target_file_size_base=1048576 --column_families=1 --compact_range_one_in=10000 --open_files=-1 --max_key=10000000 --prefixpercent=25 --ops_per_thread=1000000 --use_merge=true --max_key=100

ajkr

Looking good so far, thanks for the excellent code quality.

ajkr · 2018-10-22T22:38:15Z

db/range_tombstone_fragmenter.h

+// meta block into an iterator over non-overlapping tombstone fragments. The
+// tombstone fragmentation process should be more efficient than the range
+// tombstone collapsing algorithm in RangeDelAggregator because this leverages
+// the internal key ordering already provided by the input iterator. If there


you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.

ajkr · 2018-10-23T19:03:01Z

db/memtable.cc

-                   RangeDelAggregator* range_del_agg, SequenceNumber* seq,
-                   const ReadOptions& read_opts, ReadCallback* callback,
-                   bool* is_blob_index) {
+                   SequenceNumber* max_covering_tombstone_seq,


If max_covering_tombstone_seq is nonzero can we return immediately, knowing that a covering range tombstone in any earlier memtable must be covering the key?

I noticed in the next PR you do this for SST files (well, using the file's largest seqnum), but I didn't see any early return for memtables.

The main reason I didn't add this check to memtable is because I thought that having several memtables was fairly uncommon, and they're usually small enough that it doesn't matter whether or not a tombstone has been found (though I also didn't realize that max_covering_tombstone_seq being nonzero was sufficient; I think you're right that it should be, since the memtable, each L0 file, and each L1+ level cover disjoint bands of sequence numbers). I'll add this check to the second PR.

ajkr · 2018-10-23T19:05:29Z

db/range_tombstone_fragmenter.cc

+
+#include "db/range_tombstone_fragmenter.h"
+
+#include <set>


we usually alphabetize imports. I wonder if make format does this automatically.

It does do that; I didn't run it earlier since I usually run clang-format from my editor, but it looks good now. I'll get into the habit of running make format before sending out PRs.

facebook-github-bot · 2018-10-23T20:36:05Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-10-24T00:46:49Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr

lgtm, great work!

ajkr · 2018-10-24T01:17:33Z

db/range_tombstone_fragmenter.cc

+
+  // Given the next start key in unfragmented_tombstones,
+  // flush_current_tombstones writes every tombstone fragment that starts
+  // and ends with a key before next_start_key.


is "starts at or after cur_start_key" another constraint on the fragments output by this function?

Yeah, it is another constraint (cur_start_key is updated in flush_current_tombstones, but based on its value when the function is called, that constraint is true).

ajkr · 2018-10-24T01:38:58Z

db/range_tombstone_fragmenter.cc

+        cur_end_key = next_start_key;
+      }
+      SequenceNumber max_seqnum = 0;
+      for (auto flush_it = it; flush_it != cur_end_keys.end(); ++flush_it) {


Think I finally understand this. One suggestion I have for readability is writing down the invariants and how they're maintained as much as possible. Invariants can also be "documented" by adding asserts, or even more complicated verification logic surrounded by #ifndef NDEBUG.

For example, I think here [it, cur_end_keys.end()) all fully overlap the range [cur_start_key, cur_end_key). And the reason for this (my understanding) is:
(1) cur_end_keys has been populated only with range tombstones that start before or at cur_start_key.
(2) cur_end_key is at most *it, and all later elements in cur_end_keys must be even larger than *it.

Thanks for the suggestion; I'll add some invariant checks. And that invariant you gave is true, but not simple to verify because cur_end_keys doesn't store the start keys of the tombstones it contains.

I only ended up adding one more assertion, but I also expanded on some comments (some of which help explain the example invariant you gave). I'll land this now, but let me know if there are other things that would benefit from more explanation/assertions, and I'll add them in the follow-up PR.

facebook-github-bot · 2018-10-24T17:57:57Z

@abhimadan has updated the pull request. Re-import the pull request

facebook-github-bot

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s ----------------- | ------------- | ---------------- | ------------ | ------------ None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41 500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65 500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52 1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57 1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94 5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85 5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55 10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36 10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82 25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93 25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81 50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49 50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: #4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509

facebook-github-bot added the CLA Signed label Oct 4, 2018

abhimadan added the WIP Work in progress label Oct 4, 2018

abhimadan removed the WIP Work in progress label Oct 4, 2018

abhimadan requested review from yiwu-arbug and anand1976 October 4, 2018 17:44

abhimadan commented Oct 4, 2018

View reviewed changes

abhimadan commented Oct 8, 2018

View reviewed changes

abhimadan force-pushed the fragment-tombstones branch from 051e700 to c361efd Compare October 9, 2018 23:51

abhimadan mentioned this pull request Oct 11, 2018

Use vector in UncollapsedRangeDelMap #4487

Closed

abhimadan force-pushed the fragment-tombstones branch from c361efd to 39ee9c7 Compare October 12, 2018 00:24

facebook-github-bot reviewed Oct 13, 2018

View reviewed changes

abhimadan mentioned this pull request Oct 15, 2018

Cache fragmented range tombstones in BlockBasedTableReader #4493

Closed

abhimadan added 10 commits October 18, 2018 17:33

Fix member shadowing compile errors on CI

a8cbc7d

Add new files to cmake build, and forgotten range_del_agg_test

b42d897

Add more tests

25499ea

Drop overlapping tombstone fragments

ced7f32

Add more comments and another test

6238d85

Make FragmentedRangeTombstoneIterator an InternalIterator

a1719ba

Remove unnecessary allocations in MaxCoveringTombstoneSeqnum

595eceb

Lazily pin fragmented tombstone key

0de2525

Add a test for snapshot filtering

000101b

abhimadan force-pushed the fragment-tombstones branch from 4149ff5 to 000101b Compare October 19, 2018 00:34

facebook-github-bot reviewed Oct 19, 2018

View reviewed changes

ajkr reviewed Oct 22, 2018

View reviewed changes

Remove ordered iterator assumption

4f109a1

facebook-github-bot reviewed Oct 22, 2018

View reviewed changes

Speed up common case of ordered keys

8290e6c

facebook-github-bot reviewed Oct 23, 2018

View reviewed changes

ajkr reviewed Oct 23, 2018

View reviewed changes

Run make format, fix comment to mention unordered case

01ca572

facebook-github-bot reviewed Oct 23, 2018

View reviewed changes

Add explicit to comparator ctors

b3dbd00

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

ajkr approved these changes Oct 24, 2018

View reviewed changes

Add more comments

68f9c75

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

facebook-github-bot closed this in 8c78348 Oct 24, 2018

abhimadan deleted the fragment-tombstones branch October 24, 2018 19:32

zhangjinpeng87 mentioned this pull request Aug 6, 2020

Remove region replica may impact online requests' latency tikv/tikv#8410

Open


		#include "db/range_tombstone_fragmenter.h"

		#include <set>

Use only "local" range tombstones during Get #4449

Use only "local" range tombstones during Get #4449

Conversation

abhimadan commented Oct 4, 2018 • edited Loading

Performance Results

abhimadan commented Oct 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhimadan commented Oct 12, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 19, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr commented Oct 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 22, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 23, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

abhimadan commented Oct 23, 2018

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 23, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 24, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 24, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

abhimadan commented Oct 4, 2018 •

edited

Loading

ajkr commented Oct 22, 2018 •

edited

Loading