SegFault during GetAggregateIntProperty call. #8382

matthewvon · 2021-06-10T13:41:13Z

I am logging this issue in case someone else has time to produce a test case before I do.

db->GetAggregatedIntProperty("rocksdb.estimate-table-readers-mem", & table_readers_size);

The above call internally acquires the database mutex before it walks the column families. In our use case, this call is on an independent thread that has been awoken by an EventListener event, specifically the compaction complete event. It appears that the compaction completion can release column family references leading to the delete of a column family without acquiring the database mutex. That means the thread walking the column families for GetAggregatedIntProperty ends up iterating a deleted object. Segfault results.

matthewvon · 2021-06-10T13:41:57Z

I have known about this issue for a couple of months. Feel bad for having not even mentioned it previously. My apologies.

akankshamahajan15 · 2021-06-10T17:57:34Z

@matthewvon Do you mind sharing the test code, if you have, to reproduce it.

matthewvon · 2021-06-10T17:59:18Z

@akankshamahajan15 That is the problem. Been too busy to create a sample test case. And not likely to get time for another few weeks. Felt it important to at least log this now, test case later.

Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like #8440 , #8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. #3510 fixed the deadlock but left the unlocking in place. #6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, #7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since #3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: #8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528

ltamasi · 2021-08-07T01:43:24Z

Hi @matthewvon ! Could you confirm if #8605 fixes the issue you've been seeing?

ltamasi · 2021-09-10T16:24:48Z

Assuming fixed, please feel free to reopen if you still see this after the fix mentioned above.

akankshamahajan15 added bug Confirmed RocksDB bugs segfault up-for-grabs Up for grabs labels Jun 10, 2021

ltamasi mentioned this issue Aug 2, 2021

Fix a race in ColumnFamilyData::UnrefAndTryDelete #8605

Closed

ltamasi closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegFault during GetAggregateIntProperty call. #8382

SegFault during GetAggregateIntProperty call. #8382

matthewvon commented Jun 10, 2021 •

edited

Loading

matthewvon commented Jun 10, 2021

akankshamahajan15 commented Jun 10, 2021

matthewvon commented Jun 10, 2021

ltamasi commented Aug 7, 2021

ltamasi commented Sep 10, 2021

SegFault during GetAggregateIntProperty call. #8382

SegFault during GetAggregateIntProperty call. #8382

Comments

matthewvon commented Jun 10, 2021 • edited Loading

matthewvon commented Jun 10, 2021

akankshamahajan15 commented Jun 10, 2021

matthewvon commented Jun 10, 2021

ltamasi commented Aug 7, 2021

ltamasi commented Sep 10, 2021

matthewvon commented Jun 10, 2021 •

edited

Loading