-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use only "local" range tombstones during Get #4449
Conversation
Another future optimization would be to cache an SST's fragmented range tombstones in its table reader. This would also be useful for the iterator portion of the DeleteRange redesign. |
db/range_tombstone_fragmenter.cc
Outdated
} | ||
std::sort(seqnums_to_flush.begin(), seqnums_to_flush.end(), | ||
std::greater<SequenceNumber>()); | ||
for (auto seqnum : seqnums_to_flush) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could simplify this part by only writing a single tombstone for this fragment using the largest seqnum.
// the internal key ordering already provided by the input iterator. If there | ||
// are few overlaps, creating a FragmentedRangeTombstoneIterator should be | ||
// O(n), while the RangeDelAggregator tombstone collapsing is always O(n log n). | ||
class FragmentedRangeTombstoneIterator : public InternalIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note about making this an InternalIterator
: in the follow-up work I'm planning, SSTs will cache their FragmentedRangeTombstoneIterator
s, since their contents won't change (though this will require some additional logic to disable snapshot filtering and the dropping of covered fragments). These fragmented iterators will be used everywhere a range tombstone iterator is required from an SST, and as a result it needs to be compatible with the InternalIterator
interface.
051e700
to
c361efd
Compare
c361efd
to
39ee9c7
Compare
After all the allocation improvements, the benchmark actually changed significantly here:
I'll update the PR description with this (for posterity, the old number was ~9.5MB/s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. Preliminary db_bench results show a readrandom throughput increase from 5.7 MB/s to 8.7 MB/s. This change is still a WIP since it does not handle snapshots. Test Plan: make check
4149ff5
to
000101b
Compare
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Btw, have you run |
db/range_tombstone_fragmenter.h
Outdated
// meta block into an iterator over non-overlapping tombstone fragments. The | ||
// tombstone fragmentation process should be more efficient than the range | ||
// tombstone collapsing algorithm in RangeDelAggregator because this leverages | ||
// the internal key ordering already provided by the input iterator. If there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the range tombstone meta-blocks were not sorted in older versions, I believe. That was a change Nikhil made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I've removed that assumption while trying to make the sorted case fast. I think there's still some slowdown though, so I'll run the benchmarks again and compare the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it slowed down by about 3x. Here's the updated readrandom
result (using the same commands as in the PR description):
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, I think the effect is minimal in #4493; I ran some cursory benchmarks on that PR after rebasing and they don't seem to have changed.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I ran the following db_stress command and didn't see any problems:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, thanks for the excellent code quality.
db/range_tombstone_fragmenter.h
Outdated
// meta block into an iterator over non-overlapping tombstone fragments. The | ||
// tombstone fragmentation process should be more efficient than the range | ||
// tombstone collapsing algorithm in RangeDelAggregator because this leverages | ||
// the internal key ordering already provided by the input iterator. If there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could do a linear scan to check ordering, then sort only if not. Then it'll at least stay O(n) in the most common case.
RangeDelAggregator* range_del_agg, SequenceNumber* seq, | ||
const ReadOptions& read_opts, ReadCallback* callback, | ||
bool* is_blob_index) { | ||
SequenceNumber* max_covering_tombstone_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If max_covering_tombstone_seq
is nonzero can we return immediately, knowing that a covering range tombstone in any earlier memtable must be covering the key?
I noticed in the next PR you do this for SST files (well, using the file's largest seqnum), but I didn't see any early return for memtables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I didn't add this check to memtable is because I thought that having several memtables was fairly uncommon, and they're usually small enough that it doesn't matter whether or not a tombstone has been found (though I also didn't realize that max_covering_tombstone_seq
being nonzero was sufficient; I think you're right that it should be, since the memtable, each L0 file, and each L1+ level cover disjoint bands of sequence numbers). I'll add this check to the second PR.
db/range_tombstone_fragmenter.cc
Outdated
|
||
#include "db/range_tombstone_fragmenter.h" | ||
|
||
#include <set> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually alphabetize imports. I wonder if make format
does this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does do that; I didn't run it earlier since I usually run clang-format
from my editor, but it looks good now. I'll get into the habit of running make format
before sending out PRs.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, great work!
db/range_tombstone_fragmenter.cc
Outdated
|
||
// Given the next start key in unfragmented_tombstones, | ||
// flush_current_tombstones writes every tombstone fragment that starts | ||
// and ends with a key before next_start_key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "starts at or after cur_start_key
" another constraint on the fragments output by this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is another constraint (cur_start_key
is updated in flush_current_tombstones
, but based on its value when the function is called, that constraint is true).
cur_end_key = next_start_key; | ||
} | ||
SequenceNumber max_seqnum = 0; | ||
for (auto flush_it = it; flush_it != cur_end_keys.end(); ++flush_it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I finally understand this. One suggestion I have for readability is writing down the invariants and how they're maintained as much as possible. Invariants can also be "documented" by adding assert
s, or even more complicated verification logic surrounded by #ifndef NDEBUG
.
For example, I think here [it, cur_end_keys.end())
all fully overlap the range [cur_start_key, cur_end_key)
. And the reason for this (my understanding) is:
(1) cur_end_keys
has been populated only with range tombstones that start before or at cur_start_key
.
(2) cur_end_key
is at most *it
, and all later elements in cur_end_keys
must be even larger than *it
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion; I'll add some invariant checks. And that invariant you gave is true, but not simple to verify because cur_end_keys
doesn't store the start keys of the tombstones it contains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only ended up adding one more assertion, but I also expanded on some comments (some of which help explain the example invariant you gave). I'll land this now, but let me know if there are other things that would benefit from more explanation/assertions, and I'll add them in the follow-up PR.
@abhimadan has updated the pull request. Re-import the pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses. On the same DB used in #4449, running `readrandom` results in the following: ``` readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found) ``` Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results): ``` Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s ----------------- | ------------- | ---------------- | ------------ | ------------ None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41 500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65 500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52 1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57 1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94 5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85 5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55 10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36 10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82 25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93 25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81 50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49 50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32 ``` After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones. Pull Request resolved: #4493 Differential Revision: D10842844 Pulled By: abhimadan fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
Summary: Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
Performance Results
In the benchmark results, the following command was used to initialize the database:
...and the following command was used to measure read throughput:
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
DEBUG_LEVEL=0
.Readrandom results before PR:
Readrandom results after PR:
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
Test Plan: make check