Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache fragmented range tombstones in BlockBasedTableReader #4493

Closed
wants to merge 2 commits into from

Conversation

abhimadan
Copy link
Contributor

@abhimadan abhimadan commented Oct 15, 2018

This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running readrandom results in the following:

readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a readwhilewriting benchmark (in order to provide somewhat more realistic results):

   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.

Test Plan: make check

@abhimadan abhimadan added the WIP Work in progress label Oct 15, 2018
NewDataBlockIterator<DataBlockIter>(
rep_, read_options, rep_->range_del_handle));
}
auto new_tombstone_fragments = std::make_shared<FragmentedRangeTombstoneList>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a race here. If two threads try to call NewRangeTombstoneIterator, they may both create a new FragmentedRangeTombstoneList and overwrite rep_->fragmented_tombstones. Probably need to serialize access here.

@abhimadan abhimadan force-pushed the cache-table-fragment branch 7 times, most recently from 9a5edb5 to 60c6f3a Compare October 22, 2018 23:28
@abhimadan abhimadan force-pushed the cache-table-fragment branch 6 times, most recently from 944c3ee to f1e5c97 Compare October 24, 2018 18:28
facebook-github-bot pushed a commit that referenced this pull request Oct 24, 2018
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: #4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
@abhimadan abhimadan force-pushed the cache-table-fragment branch 2 times, most recently from 885f6d6 to 0bb0198 Compare October 24, 2018 20:27
@abhimadan abhimadan removed the WIP Work in progress label Oct 24, 2018
@abhimadan abhimadan requested a review from ajkr October 24, 2018 21:08
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This allows tombstone fragmenting to only be performed when the table is
first opened, and cached for subsequent accesses.
@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice design and code, lgtm

@@ -66,7 +100,7 @@ class FragmentedRangeTombstoneIterator : public InternalIterator {
};

void MaybePinKey() const {
if (pos_ != tombstones_.end() && pinned_pos_ != pos_) {
if (pos_ != tombstones_->end() && pinned_pos_ != pos_) {
current_start_key_.Set(pos_->start_key_, pos_->seq_, kTypeRangeDeletion);
pinned_pos_ = pos_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to ask about this on the last PR. What does pinned_pos_ do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinned_pos_ points to the key that's currently pinned. It's mainly used to avoid re-pinning the key that's currently pinned.

@@ -115,6 +116,9 @@ class BlockBasedTable : public TableReader {
InternalIterator* NewRangeTombstoneIterator(
const ReadOptions& read_options) override;

InternalIterator* NewUnfragmentedRangeTombstoneIterator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it should be; thanks for catching that!

@facebook-github-bot
Copy link
Contributor

@abhimadan has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants