New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4/4][ResourceMngmt] Account Bloom/Ribbon filter construction memory in global memory limit #9073
Conversation
784cdab
to
8baad09
Compare
Update:
|
@hx235 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
util/bloom_test.cc
Outdated
const std::size_t num_entries = 10000; | ||
char key_buffer[sizeof(int)]; | ||
for (std::size_t i = 0; i < num_entries; ++i) { | ||
filter_bits_builder->AddKey(Key(static_cast<int>(i), key_buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to @pdillinger:
It's a bit difficult to verify the charged hash_entries
's size as we reserve and release its memory within Finish
to avoid reserving memory per insertion expensively. After all, we don't need per-insertion granularity here as we don't react to cache insertion result.
The test TEST_P(FullBloomTest, RibbonFilterFallBackOnLargeBanding) indirectly verifies that through some math calculation and I also printed them out during development for verification.
If more tests are needed, I am thinking of adding sync point but any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have even a rough bound on peak memory usage by Bloom construction, that suffices to check for hash_entries because that is the overwhelming majority of usage for Bloom.
util/bloom_test.cc
Outdated
// -1 = Marker for newer Bloom implementations | ||
// -2 = Marker for Standard128 Ribbon | ||
if (will_fall_back) { | ||
EXPECT_EQ(filter.data()[filter.size() - 5], static_cast<char>(-1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to @pdillinger:
It's also a bit difficult to verify the fallback is indeed due to large banding, considering there are multiple paths leading to falling backs. I had to verify it by printing stuff out.
If more tests are needed, I am again thinking of using sync point but any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm satisfied with your single control (!will_fall_back case
) but if you wanted to also control your other concern, you could have another case/iteration that turns off reserve_table_builder_memory
.
table/block_based/filter_policy.cc
Outdated
if (cache_res_mgr) { | ||
Status s = cache_res_mgr->UpdateCacheReservation< | ||
CacheEntryRole::kXXPH3FilterConstruction>( | ||
cache_res_mgr->GetTotalMemoryUsed() + requested); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to @pdillinger:
In the branch of ifdef ROCKSDB_MALLOC_USABLE_SIZ && if (aggregate_rounding_balance_ != nullptr)
, will the returned value rv
hugely different from the requested
value, that is actually used in allocation?
The reason I asked is that
(1) I used the returned value rv
(i.e, len_with_metadata
) to approximate the length of buf
/mutable_buf in all cases. (2)I had a bit difficulty in fully understanding
RoundDownUsableSpaceto make sure it's okay to do so in this case (although I had a feeling that these two values are just around
kExtraPadding away ) (3) I also had some difficulties in covering this branch (
ifdef ROCKSDB_MALLOC_USABLE_SIZ && if (aggregate_rounding_balance_ != nullptr)` in testing but will try harder later.
util/bloom_test.cc
Outdated
@@ -601,6 +602,351 @@ TEST_P(FullBloomTest, OptimizeForMemory) { | |||
} | |||
} | |||
|
|||
TEST_P(FullBloomTest, ReserveBloomAndRibbonFilterConstructionMemory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to @pdillinger:
Just to follow up on our discussion on supporting partitioned filter, for Partitioned Filter, I think we need to do testing at the abstraction level like partitioned_filter_block_test.cc
. One reason is that we need to adds up all the final filter length to make assertion on charging the wholemutable_buf
right. I haven't looked deep into that but it should be doable.
We can either test that or we can guard our feature from Partitioned Filter
8baad09
to
d5c8077
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
d5c8077
to
72d9dd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@@ -175,6 +175,7 @@ TEST_F(OptionsSettableTest, BlockBasedTableOptionsAllFieldsSettable) { | |||
"optimize_filters_for_memory=true;" | |||
"index_block_restart_interval=4;" | |||
"filter_policy=bloomfilter:4:true;whole_key_filtering=1;" | |||
"reserve_table_builder_memory=false;" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally you should also add a small case to OptionsTest.GetBlockBasedTableOptionsFromString to ensure configuration string is properly connected to the field
std::make_tuple(true, BloomFilterPolicy::Mode::kDeprecatedBlock, false), | ||
std::make_tuple(true, BloomFilterPolicy::Mode::kLegacyBloom, false))); | ||
|
||
TEST_P(DBFilterConstructionReserveMemoryTestWithParam, ReserveMemory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe drop a TODO comment here about improving running time of the test, possible approach & challenges that presents to keeping the production code simple and the test reflecting production reality.
…cket modular, api comment
76c84f1
to
f15e591
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Update:
|
…ter earlier (#9345) Summary: Note: rebase on and merge after #9349, as part of #9342 **Context:** #9073 charged the hash entries' memory in block cache with `CacheReservationHandle`. However, in the edge case where Ribbon Filter falls back to Bloom Filter and swaps its hash entries to the embedded bloom filter object, the handles associated with those entries are not swapped and thus not released as soon as those entries are cleared during Bloom Filter's finish process. Although this is a minor issue since RocksDB internal calls `FilterBitsBuilder->Reset()` right after `FilterBitsBuilder->Finish()` on the main path, which releases all the cache reservation related to both the Ribbon Filter and its embedded Bloom Filter, it still worths this fix to avoid confusion. **Summary:** - Swapped the `CacheReservationHandle` associated with the hash entries on Ribbon Filter's fallback Pull Request resolved: #9345 Test Plan: - Added a unit test to verify the number of cache reservation after clearing hash entries, which failed before the change and now succeeds Reviewed By: pdillinger Differential Revision: D33377225 Pulled By: hx235 fbshipit-source-id: 7487f4c40dfb6ee7928232021f93ef2c5329cffa
Summary: Note: rebase on and merge after #9349, #9345, (optional) #9393 **Context:** (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. **Summary:** - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](#9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: #9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s** - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)** - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s** - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)** - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary: **Context:** Previous PR #9748, #9073, #8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. **Summary:** - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: #9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465** 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078** - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Note: This PR is the 4th part of a bigger PR stack (#9073) and will rebase/merge only after the first three PRs (#9070, #9071, #9130) merge.
Context:
Similar to #8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction by charging dummy entry to block cache, moving toward the goal of single global memory limit using block cache capacity. It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under
LRUCacheOptions::strict_capacity_limit=true
.The option to turn on this feature is
BlockBasedTableOptions::reserve_table_builder_memory = true
which by default is set tofalse
. We decided not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.Summary:
hash_entries
.size(), we bucket-charge hash entries during insertion for performance),bytes_coeff_rows
+bytes_result_rows
+bytes_backtrack
),mutable_buf
's size).CacheReservationManager::CacheReservationHandle
to account final filter's memory, we have to store theCacheReservationManager
object andCacheReservationHandle
for final filter inXXPH3BitsFilterBuilder
as well as explicitly delete the filter bits builder when done with the final filter in block based table.filter_bench
with this memory reservation featureTest:
db_bloom_filter_test
to verify filter construction peak cache reservation under combination ofBlockBasedTable::Rep::FilterType
(e.g,kFullFilter
,kPartitionedFilter
),BloomFilterPolicy::Mode
(e.g,kFastLocalBloom
,kStandard128Ribbon
,kDeprecatedBlock
) andBlockBasedTableOptions::reserve_table_builder_memory
kFullFilter
+kStandard128Ribbon
andkPartitionedFilter
take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locallybloom_test
to verify Ribbon Filter fallback on large banding in FullFilterfilter_bench
to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below:FastLocalBloom
./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'
:./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'
:./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'
:Ribbon128
./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'
:./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'
:./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'
:"Cache reservation for Ribbon filter banding failed due to cache full"
is indeed logged to console.