Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use only "local" range tombstones during Get #4449

Closed
wants to merge 15 commits into from

Conversation

abhimadan
Copy link
Contributor

@abhimadan abhimadan commented Oct 4, 2018

Summary: Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

Performance Results

In the benchmark results, the following command was used to initialize the database:

./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8

...and the following command was used to measure read throughput:

./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
DEBUG_LEVEL=0.

Readrandom results before PR:

readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)

Readrandom results after PR:

readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).


Test Plan: make check

@abhimadan abhimadan added the WIP Work in progress label Oct 4, 2018
@abhimadan
Copy link
Contributor Author

Another future optimization would be to cache an SST's fragmented range tombstones in its table reader. This would also be useful for the iterator portion of the DeleteRange redesign.

@abhimadan abhimadan removed the WIP Work in progress label Oct 4, 2018
}
std::sort(seqnums_to_flush.begin(), seqnums_to_flush.end(),
std::greater<SequenceNumber>());
for (auto seqnum : seqnums_to_flush) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could simplify this part by only writing a single tombstone for this fragment using the largest seqnum.

// the internal key ordering already provided by the input iterator. If there
// are few overlaps, creating a FragmentedRangeTombstoneIterator should be
// O(n), while the RangeDelAggregator tombstone collapsing is always O(n log n).
class FragmentedRangeTombstoneIterator : public InternalIterator {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about making this an InternalIterator: in the follow-up work I'm planning, SSTs will cache their FragmentedRangeTombstoneIterators, since their contents won't change (though this will require some additional logic to disable snapshot filtering and the dropping of covered fragments). These fragmented iterators will be used everywhere a range tombstone iterator is required from an SST, and as a result it needs to be compatible with the InternalIterator interface.

@abhimadan
Copy link
Contributor Author

After all the allocation improvements, the benchmark actually changed significantly here:

readrandom   :       4.917 micros/op 203382 ops/sec;   15.7 MB/s (63103 of 100000 found)

I'll update the PR description with this (for posterity, the old number was ~9.5MB/s).

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

Preliminary db_bench results show a readrandom throughput increase from 5.7 MB/s
to 8.7 MB/s.

This change is still a WIP since it does not handle snapshots.

Test Plan: make check
@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor

ajkr commented Oct 22, 2018

Btw, have you run db_stress on this to verify correctness? One special config I try for DeleteRange lately is using merge operator with -max_key set to a low value, to try exposing issues with files that have overlapping endpoints.

// meta block into an iterator over non-overlapping tombstone fragments. The
// tombstone fragmentation process should be more efficient than the range
// tombstone collapsing algorithm in RangeDelAggregator because this leverages
// the internal key ordering already provided by the input iterator. If there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the range tombstone meta-blocks were not sorted in older versions, I believe. That was a change Nikhil made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I've removed that assumption while trying to make the sorted case fast. I think there's still some slowdown though, so I'll run the benchmarks again and compare the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it slowed down by about 3x. Here's the updated readrandom result (using the same commands as in the PR description):

readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, I think the effect is minimal in #4493; I ran some cursory benchmarks on that PR after rebasing and they don't seem to have changed.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@abhimadan
Copy link
Contributor Author

I ran the following db_stress command and didn't see any problems:

TEST_TMPDIR=/dev/shm ./db_stress --max_background_compactions=8 --subcompactions=8 --memtablerep=skip_list --acquire_snapshot_one_in=10000 --delpercent=4 --delrangepercent=1 --snapshot_hold_ops=100000 --allow_concurrent_memtable_write=1 --compact_files_one_in=10000 --clear_column_family_one_in=0 --writepercent=35 --readpercent=25 --write_buffer_size=1048576 --max_bytes_for_level_base=4194304 --target_file_size_base=1048576 --column_families=1 --compact_range_one_in=10000 --open_files=-1 --max_key=10000000 --prefixpercent=25 --ops_per_thread=1000000 --use_merge=true --max_key=100

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, thanks for the excellent code quality.

// meta block into an iterator over non-overlapping tombstone fragments. The
// tombstone fragmentation process should be more efficient than the range
// tombstone collapsing algorithm in RangeDelAggregator because this leverages
// the internal key ordering already provided by the input iterator. If there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.

RangeDelAggregator* range_del_agg, SequenceNumber* seq,
const ReadOptions& read_opts, ReadCallback* callback,
bool* is_blob_index) {
SequenceNumber* max_covering_tombstone_seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If max_covering_tombstone_seq is nonzero can we return immediately, knowing that a covering range tombstone in any earlier memtable must be covering the key?

I noticed in the next PR you do this for SST files (well, using the file's largest seqnum), but I didn't see any early return for memtables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I didn't add this check to memtable is because I thought that having several memtables was fairly uncommon, and they're usually small enough that it doesn't matter whether or not a tombstone has been found (though I also didn't realize that max_covering_tombstone_seq being nonzero was sufficient; I think you're right that it should be, since the memtable, each L0 file, and each L1+ level cover disjoint bands of sequence numbers). I'll add this check to the second PR.


#include "db/range_tombstone_fragmenter.h"

#include <set>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually alphabetize imports. I wonder if make format does this automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does do that; I didn't run it earlier since I usually run clang-format from my editor, but it looks good now. I'll get into the habit of running make format before sending out PRs.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, great work!


// Given the next start key in unfragmented_tombstones,
// flush_current_tombstones writes every tombstone fragment that starts
// and ends with a key before next_start_key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "starts at or after cur_start_key" another constraint on the fragments output by this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is another constraint (cur_start_key is updated in flush_current_tombstones, but based on its value when the function is called, that constraint is true).

cur_end_key = next_start_key;
}
SequenceNumber max_seqnum = 0;
for (auto flush_it = it; flush_it != cur_end_keys.end(); ++flush_it) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I finally understand this. One suggestion I have for readability is writing down the invariants and how they're maintained as much as possible. Invariants can also be "documented" by adding asserts, or even more complicated verification logic surrounded by #ifndef NDEBUG.

For example, I think here [it, cur_end_keys.end()) all fully overlap the range [cur_start_key, cur_end_key). And the reason for this (my understanding) is:
(1) cur_end_keys has been populated only with range tombstones that start before or at cur_start_key.
(2) cur_end_key is at most *it, and all later elements in cur_end_keys must be even larger than *it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion; I'll add some invariant checks. And that invariant you gave is true, but not simple to verify because cur_end_keys doesn't store the start keys of the tombstones it contains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only ended up adding one more assertion, but I also expanded on some comments (some of which help explain the example invariant you gave). I'll land this now, but let me know if there are other things that would benefit from more explanation/assertions, and I'll add them in the follow-up PR.

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@abhimadan abhimadan deleted the fragment-tombstones branch October 24, 2018 19:32
facebook-github-bot pushed a commit that referenced this pull request Oct 26, 2018
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: #4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants