-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453
Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453
Conversation
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PERF_COUNTER_ADD(bloom_memtable_hit_count, 1); | ||
} else if (whole_key) { | ||
bloom_keys[num_keys] = iter->ukey_without_ts; | ||
range_indexes[num_keys++] = iter.index(); | ||
} else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it wrong to do this without checking memtable_whole_key_filtering
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Anand said, just suboptimal, not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great catch!
PERF_COUNTER_ADD(bloom_memtable_hit_count, 1); | ||
} else if (whole_key) { | ||
bloom_keys[num_keys] = iter->ukey_without_ts; | ||
range_indexes[num_keys++] = iter.index(); | ||
} else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.
I am reconsidering the approach based on the apparent small perf regression. |
A subtle quirk of this implementation is that any range tombstones effectively disable the memtable bloom filter. Making that more clear in the code should simplify & optimize a bit. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
std::array<Slice*, MultiGetContext::MAX_BATCH_SIZE> keys; | ||
std::array<bool, MultiGetContext::MAX_BATCH_SIZE> may_match = {{true}}; | ||
autovector<Slice, MultiGetContext::MAX_BATCH_SIZE> prefixes; | ||
if (bloom_filter_ && no_range_del) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to skip all filter checks when any range tombstone is present? I thought in the previous version of this PR it only skipped checking for a given lookup key when a range tombstone covering that particular lookup key was found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it was always like that. I guess you succeeded at "Making that more clear in the code"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to skip all filter checks when any range tombstone is present?
I think that's best for now (backport-able with low risk)
I thought in the previous version ...
Looks can be deceiving. If you dig into details, range_del_iter == nullptr
is logically equivalent to no_range_del
. I believe you'd have to check max_covering_tombstone_seq==0
for coverage check, but I'm not super confident and that's more than I want to write unit tests for at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry let me start from the beginning. I originally wondered why we skip filter at all when there are range tombstones. They seem useful whether we found a range tombstone or not. Perhaps that can be explained in the comment/TODO -- or if we don't need to skip, perhaps it can give the optimization without introducing edge cases to test.
Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up while until the range tombstone covering it has been flushed. This is fixed by considering any range tombstones in memtable before querying memtable bloom filter. Did some other cleanup/optimization in the same code that might help. Test Plan: new unit test added. Added memtable Bloom to crash test. TODO: performance testing
1e079f3
to
ed66040
Compare
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Test Plan: run `make blackbox_crashtest` for a long time
Summary: Changes in #9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: #9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
…k#9453) Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: facebook#9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: facebook#9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
…k#9453) Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: facebook#9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: facebook#9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Summary: MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Test Plan: new unit test added. Added memtable Bloom to crash test.
Performance testing
Build WAL-only DB (recovers to memtable):
Query test command, to maximize sensitivity to the changed code:
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' slightly faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.