-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PinSelf() read-after-free in DB::GetMergeOperands() #9507
Conversation
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
c79bf82
to
669e53b
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
669e53b
to
139351d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
139351d
to
4801e05
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
4801e05
to
b75f73d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b75f73d
to
7daf2b9
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is great and sounds like you plan to keep working on solving it, but just wanted to mention one thing early on. We should try to solve it without adding heap (de)allocations in the Get() path if possible.
Got you - is it because it's slow? Then now it makes sense to me what the previous PR was not using shared_ptr stuff. Will update this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for debugging this and found the bug!
db/db_impl/db_impl.cc
Outdated
@@ -1881,6 +1881,8 @@ Status DBImpl::GetImpl(const ReadOptions& read_options, const Slice& key, | |||
return s; | |||
} | |||
} | |||
std::shared_ptr<PinnedIteratorsManager> pinned_iters_mgr( | |||
new PinnedIteratorsManager()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the good comment by @ajkr , shared_ptr here doesn't appear to be necessary, as it is never shared. A unique_ptr is more straight-forward, and something allocated in the stack would be even better, if possible.
db/version_set.cc
Outdated
@@ -1967,7 +1967,8 @@ void Version::Get(const ReadOptions& read_options, const LookupKey& k, | |||
MergeContext* merge_context, | |||
SequenceNumber* max_covering_tombstone_seq, bool* value_found, | |||
bool* key_exists, SequenceNumber* seq, ReadCallback* callback, | |||
bool* is_blob, bool do_merge) { | |||
bool* is_blob, bool do_merge, | |||
std::shared_ptr<PinnedIteratorsManager> pinned_iters_mgr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing shared_ptr in this way can be expensive, as it will increment an atomic counter. We had several performance regressions due to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but the way we do it is that if the caller holds the/an ownership to this object, we just pass the raw pointer down to the child functions. Not sure whether there is going to be a better way of doing it with C++17 though.
I agree that we should use stack-allocated object. And just to clarify, for stack-allocated object, we can't use shared_ptr nor unique_ptr so your first suggestion about unique_ptr won't be adopted. |
7daf2b9
to
d1011cc
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
db/version_set.cc
Outdated
@@ -1978,7 +1979,14 @@ void Version::Get(const ReadOptions& read_options, const LookupKey& k, | |||
*key_exists = true; | |||
} | |||
|
|||
PinnedIteratorsManager pinned_iters_mgr; | |||
PinnedIteratorsManager local_pinned_iters_mgr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to allocate it here if in the common path, there is one passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this either - still working around it but couldn't think of another solution to bypass the allocation other than (void) it later. (In fact, not even sure if void-ing it actually frees the memory earlier).
The issue I had is I tried to do some conditional allocation like
if (use_local_pinned_iters_mgr) {
PinnedIteratorsManager local_pinned_iters_mgr;
pinned_iters_mgr = & local_pinned_iters_mgr;
}
But then local_pinned_iters_mgr
will be local to the conditional branch scope.
Maybe I'm just not aware of a general coding pattern to this kind of issue - any thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always force the caller to allocate one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I was thinking of a minimum code change so didn't touch any of the callers. But yeah we could do that to all the callers (6 of them) and make sure the passed-in PinnedIteratorsManager
life-time is well taken cared. Will update in the next draft.
d1011cc
to
8f4a0c7
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update:
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't know what the range of measured values was in the 10 benchmark runs but am assuming the regression on average would be well within the noise.
Let me double check with the team |
Update - will run more benchmarking as suggested by team. |
@ajkr - updated the benchmark command and result. The result seems fine to me but would like to have an extra review on that before merging the PR. |
8f4a0c7
to
944ea6a
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update: rebase |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM still!
std of micros/op
I had a feeling this was about to show up. @jay-zhuang's google benchmark integration will probably be useful for automating this process of finding a stable result.
Summary: **Context:** Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below: ``` READ of size 8 at 0x60400002529d thread T0 #5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171 #6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919 #7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 freed by thread T0 here: #3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99 #4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205 #5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547 #6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60 #7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38 #8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71 #9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24 #10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22 #11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886 #12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203 previously allocated by thread T0 here: #1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35 #2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171 #3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206 #4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325 #5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503 ``` Here is the analysis: - We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction. - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked. **And that's what we see in "freed by thread T955 here" in ASAN.** - Bug 馃悰: `PinnedIteratorsManager` is local to `Version::Get()` while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.** - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long. - This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter. The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531) - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full. **Summary:** - Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands Pull Request resolved: #9507 Test Plan: - New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` - db bench on read path - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 ` - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op) - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done` - **Result: Pre-change: 3.79193 micros/op; Post-change: 3.79528 micros/op (+0.09%)** (pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run -- | -- | -- | -- 3.58355 | 0.265209 | 3.48715 | 0.382076 3.58845 | 0.519927 | 3.5832 | 0.382726 3.66415 | 0.452097 | 3.677 | 0.563831 3.68495 | 0.430897 | 3.68405 | 0.495355 3.70295 | 0.482893 | 3.68465 | 0.431438 3.719 | 0.463806 | 3.71945 | 0.457157 3.7393 | 0.453423 | 3.72795 | 0.538604 3.7806 | 0.527613 | 3.75075 | 0.444509 3.7817 | 0.426704 | 3.7683 | 0.468065 3.809 | 0.381033 | 3.8086 | 0.557378 3.80985 | 0.466011 | 3.81805 | 0.524833 3.8165 | 0.500351 | 3.83405 | 0.529339 3.8479 | 0.430326 | 3.86285 | 0.44831 3.85125 | 0.434108 | 3.8717 | 0.544098 3.8556 | 0.524602 | 3.895 | 0.411679 3.8656 | 0.476383 | 3.90965 | 0.566636 3.8911 | 0.488477 | 3.92735 | 0.608038 3.898 | 0.493978 | 3.9439 | 0.524511 3.97235 | 0.515008 | 3.9623 | 0.477416 3.9768 | 0.519993 | 3.98965 | 0.521481 - CI Reviewed By: ajkr Differential Revision: D34030519 Pulled By: hx235 fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8
Example of this in test plan of #9611 |
Context:
Running the new test
DBMergeOperandTest.MergeOperandReadAfterFreeBug
prior to this fix surfaces the read-after-free bug of PinSef() as below:Here is the analysis:
Cleanable
capability in ourVersion::Get()
path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.And that's what we see in "freed by thread T955 here" in ASAN.
PinnedIteratorsManager
is local toVersion::Get()
while the data of merge operands need to outliveVersion::Get
and stay till they get PinSelf(), which is the read-after-free in ASAN.PinnedIteratorsManager
when developing the APIDB::GetMergeOperands
cuz the current logic works fine with the existing case of getting the merged value where the operands do not need to live that long.BlockBasedTable::MaybeReadBlockAndLoadToCache
in "previously allocated by" in ASAN report) is a cache entry deleter.The deleter will call
Cache::Release()
which, for LRU cache, won't immediately deallocate the block based on LRU policy unless the cache is full or being instructed to force eraseDBMergeOperandTest.MergeOperandReadAfterFreeBug
makes the cache extremely small to force cache full.Summary:
PinnedIteratorsManager
's lifetime with the longest possible lifetime needed for merge operandsTest:
DBMergeOperandTest.MergeOperandReadAfterFreeBug
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000
for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000 -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done