Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Jan 26, 2022

Summary: MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Test Plan: new unit test added. Added memtable Bloom to crash test.

Performance testing

Build WAL-only DB (recovers to memtable):

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000

Query test command, to maximize sensitivity to the changed code:

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS

(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' slightly faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

PERF_COUNTER_ADD(bloom_memtable_hit_count, 1);
} else if (whole_key) {
bloom_keys[num_keys] = iter->ukey_without_ts;
range_indexes[num_keys++] = iter.index();
} else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it wrong to do this without checking memtable_whole_key_filtering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Anand said, just suboptimal, not wrong.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great catch!

PERF_COUNTER_ADD(bloom_memtable_hit_count, 1);
} else if (whole_key) {
bloom_keys[num_keys] = iter->ukey_without_ts;
range_indexes[num_keys++] = iter.index();
} else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.

@pdillinger
Copy link
Contributor Author

I am reconsidering the approach based on the apparent small perf regression.

@pdillinger
Copy link
Contributor Author

A subtle quirk of this implementation is that any range tombstones effectively disable the memtable bloom filter. Making that more clear in the code should simplify & optimize a bit.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

std::array<Slice*, MultiGetContext::MAX_BATCH_SIZE> keys;
std::array<bool, MultiGetContext::MAX_BATCH_SIZE> may_match = {{true}};
autovector<Slice, MultiGetContext::MAX_BATCH_SIZE> prefixes;
if (bloom_filter_ && no_range_del) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to skip all filter checks when any range tombstone is present? I thought in the previous version of this PR it only skipped checking for a given lookup key when a range tombstone covering that particular lookup key was found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it was always like that. I guess you succeeded at "Making that more clear in the code"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to skip all filter checks when any range tombstone is present?

I think that's best for now (backport-able with low risk)

I thought in the previous version ...

Looks can be deceiving. If you dig into details, range_del_iter == nullptr is logically equivalent to no_range_del. I believe you'd have to check max_covering_tombstone_seq==0 for coverage check, but I'm not super confident and that's more than I want to write unit tests for at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry let me start from the beginning. I originally wondered why we skip filter at all when there are range tombstones. They seem useful whether we found a range tombstone or not. Perhaps that can be explained in the comment/TODO -- or if we don't need to skip, perhaps it can give the optimization without introducing edge cases to test.

Summary: MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up while until the range
tombstone covering it has been flushed.

This is fixed by considering any range tombstones in memtable before
querying memtable bloom filter.

Did some other cleanup/optimization in the same code that might help.

Test Plan: new unit test added. Added memtable Bloom to crash test.

TODO: performance testing
@pdillinger pdillinger force-pushed the multiget_deleterange_memtable_bloom_bug branch from 1e079f3 to ed66040 Compare January 27, 2022 20:20
@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 27, 2022
Summary: Changes in facebook#9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```

Test Plan: run `make blackbox_crashtest` for a long time
facebook-github-bot pushed a commit that referenced this pull request Jan 28, 2022
Summary:
Changes in #9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```

Pull Request resolved: #9461

Test Plan: run `make blackbox_crashtest` for a long time

Reviewed By: ajkr

Differential Revision: D33830751

Pulled By: pdillinger

fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 31, 2022
…k#9453)

Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: facebook#9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 31, 2022
Summary:
Changes in facebook#9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```

Pull Request resolved: facebook#9461

Test Plan: run `make blackbox_crashtest` for a long time

Reviewed By: ajkr

Differential Revision: D33830751

Pulled By: pdillinger

fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 31, 2022
…k#9453)

Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Pull Request resolved: facebook#9453

Test Plan:
new unit test added. Added memtable Bloom to crash test.

Performance testing
--------------------

Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```

Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before:  12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

Reviewed By: ajkr

Differential Revision: D33805025

Pulled By: pdillinger

fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 31, 2022
Summary:
Changes in facebook#9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```

Pull Request resolved: facebook#9461

Test Plan: run `make blackbox_crashtest` for a long time

Reviewed By: ajkr

Differential Revision: D33830751

Pulled By: pdillinger

fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants