Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

pdillinger · 2022-01-26T23:39:36Z

Summary: MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.

This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)

Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.

Test Plan: new unit test added. Added memtable Bloom to crash test.

Performance testing

Build WAL-only DB (recovers to memtable):

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000

Query test command, to maximize sensitivity to the changed code:

TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS

(Note -num here is 10x larger for mostly memtable misses)

Before & after run simultaneously, average over 10 iterations per data point, ops/sec.

MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066

MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990

MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603

MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029

MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591

Only in this last case is the 'before' slightly faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.

facebook-github-bot · 2022-01-26T23:43:09Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

LGTM

ajkr · 2022-01-27T00:47:36Z

db/memtable.cc

+        PERF_COUNTER_ADD(bloom_memtable_hit_count, 1);
+      } else if (whole_key) {
+        bloom_keys[num_keys] = iter->ukey_without_ts;
+        range_indexes[num_keys++] = iter.index();
      } else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) {


Was it wrong to do this without checking memtable_whole_key_filtering?

Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.

Like Anand said, just suboptimal, not wrong.

anand1976

LGTM. Great catch!

anand1976 · 2022-01-27T01:32:21Z

db/memtable.cc

+        PERF_COUNTER_ADD(bloom_memtable_hit_count, 1);
+      } else if (whole_key) {
+        bloom_keys[num_keys] = iter->ukey_without_ts;
+        range_indexes[num_keys++] = iter.index();
      } else if (prefix_extractor_->InDomain(iter->ukey_without_ts)) {


Don't think so. The comment for memtable_prefix_bloom_size_ratio says the feature is disabled if both prefix_extractor and memtable_whole_key_filtering are not set. And if if both are set but key is not in domain, we would have just skipped the bloom filter and called GetFromTable.

pdillinger · 2022-01-27T06:53:13Z

I am reconsidering the approach based on the apparent small perf regression.

pdillinger · 2022-01-27T18:04:36Z

A subtle quirk of this implementation is that any range tombstones effectively disable the memtable bloom filter. Making that more clear in the code should simplify & optimize a bit.

facebook-github-bot · 2022-01-27T18:25:13Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

ajkr · 2022-01-27T19:21:36Z

db/memtable.cc

-    std::array<Slice*, MultiGetContext::MAX_BATCH_SIZE> keys;
-    std::array<bool, MultiGetContext::MAX_BATCH_SIZE> may_match = {{true}};
-    autovector<Slice, MultiGetContext::MAX_BATCH_SIZE> prefixes;
+  if (bloom_filter_ && no_range_del) {


You want to skip all filter checks when any range tombstone is present? I thought in the previous version of this PR it only skipped checking for a given lookup key when a range tombstone covering that particular lookup key was found.

Never mind, it was always like that. I guess you succeeded at "Making that more clear in the code"

You want to skip all filter checks when any range tombstone is present?

I think that's best for now (backport-able with low risk)

I thought in the previous version ...

Looks can be deceiving. If you dig into details, range_del_iter == nullptr is logically equivalent to no_range_del. I believe you'd have to check max_covering_tombstone_seq==0 for coverage check, but I'm not super confident and that's more than I want to write unit tests for at the moment.

Sorry let me start from the beginning. I originally wondered why we skip filter at all when there are range tombstones. They seem useful whether we found a range tombstone or not. Perhaps that can be explained in the comment/TODO -- or if we don't need to skip, perhaps it can give the optimization without introducing edge cases to test.

Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up while until the range tombstone covering it has been flushed. This is fixed by considering any range tombstones in memtable before querying memtable bloom filter. Did some other cleanup/optimization in the same code that might help. Test Plan: new unit test added. Added memtable Bloom to crash test. TODO: performance testing

facebook-github-bot · 2022-01-27T20:20:22Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-27T20:33:36Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-01-27T20:48:46Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-27T20:58:01Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Test Plan: run `make blackbox_crashtest` for a long time

Summary: Changes in #9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: #9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476

…k#9453) Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: facebook#9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf

Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: facebook#9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476

…k#9453) Summary: MemTable::MultiGet was not considering range tombstones before querying Bloom filter. This means range tombstones would be skipped for keys (or prefixes) with no other entries in the memtable. This could cause old values for a key (in SST files) to still show up until the range tombstone covering it has been flushed. This is fixed by essentially disabling the memtable Bloom filter when there are any range tombstones. (This could be better optimized in the future, but good enough for now.) Did some other cleanup/optimization in the same code to (more than) offset the cost of checking on range tombstones in more cases. There is now notable improvement when memtable_whole_key_filtering and prefix_extractor are used together (unusual), and this makes MultiGet closer to the Get implementation. Pull Request resolved: facebook#9453 Test Plan: new unit test added. Added memtable Bloom to crash test. Performance testing -------------------- Build WAL-only DB (recovers to memtable): ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000 ``` Query test command, to maximize sensitivity to the changed code: ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS ``` (Note -num here is 10x larger for mostly memtable misses) Before & after run simultaneously, average over 10 iterations per data point, ops/sec. MWKF=0 PXS=0 (Bloom disabled) Before: 5724844 After: 6722066 MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful) Before: 9981319 After: 10237990 MWKF=0 PXS=8 (prefixes unique; Bloom useful) Before: 12081715 After: 12117603 MWKF=1 PXS=0 (whole key Bloom useful) Before: 11944354 After: 12096085 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version) Before: 9444299 After: 11826029 MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version) Before: 11784465 After: 11778591 Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster. Reviewed By: ajkr Differential Revision: D33805025 Pulled By: pdillinger fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf

Summary: Changes in facebook#9453 could trigger ``` stderr: Error: prefixpercent is non-zero while prefix_size is not positive! ``` Pull Request resolved: facebook#9461 Test Plan: run `make blackbox_crashtest` for a long time Reviewed By: ajkr Differential Revision: D33830751 Pulled By: pdillinger fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476

pdillinger requested a review from anand1976 January 26, 2022 23:39

facebook-github-bot added the CLA Signed label Jan 26, 2022

ajkr approved these changes Jan 27, 2022

View reviewed changes

anand1976 approved these changes Jan 27, 2022

View reviewed changes

ajkr reviewed Jan 27, 2022

View reviewed changes

pdillinger added 5 commits January 27, 2022 12:10

cleanup

f6a0318

Optimize more, minimize code duplication

ffe4798

Better comments

d98b3c8

Fix ASC

ed66040

pdillinger force-pushed the multiget_deleterange_memtable_bloom_bug branch from 1e079f3 to ed66040 Compare January 27, 2022 20:20

Fix crash test drive, extend unit test

e4c02e7

facebook-github-bot closed this in ea89c77 Jan 27, 2022

pdillinger mentioned this pull request Jan 27, 2022

Fix/expand prefix extractor testing in crash test #9461

Closed

ShooterIT mentioned this pull request Feb 7, 2022

Revert "Fix the misuse of multi-get API" apache/kvrocks#467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

pdillinger commented Jan 26, 2022 •

edited

Loading

facebook-github-bot commented Jan 26, 2022

ajkr left a comment

ajkr Jan 27, 2022

anand1976 Jan 27, 2022

pdillinger Jan 27, 2022

anand1976 left a comment

anand1976 Jan 27, 2022

pdillinger commented Jan 27, 2022

pdillinger commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

ajkr Jan 27, 2022

ajkr Jan 27, 2022

pdillinger Jan 27, 2022

ajkr Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

Fix major bug with MultiGet, DeleteRange, and memtable Bloom #9453

Conversation

pdillinger commented Jan 26, 2022 • edited Loading

Performance testing

facebook-github-bot commented Jan 26, 2022

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anand1976 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdillinger commented Jan 27, 2022

pdillinger commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

facebook-github-bot commented Jan 27, 2022

pdillinger commented Jan 26, 2022 •

edited

Loading