Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegFault during GetAggregateIntProperty call. #8382

Closed
matthewvon opened this issue Jun 10, 2021 · 5 comments
Closed

SegFault during GetAggregateIntProperty call. #8382

matthewvon opened this issue Jun 10, 2021 · 5 comments
Labels
bug Confirmed RocksDB bugs segfault up-for-grabs Up for grabs

Comments

@matthewvon
Copy link
Contributor

matthewvon commented Jun 10, 2021

I am logging this issue in case someone else has time to produce a test case before I do.

db->GetAggregatedIntProperty("rocksdb.estimate-table-readers-mem", & table_readers_size);

The above call internally acquires the database mutex before it walks the column families. In our use case, this call is on an independent thread that has been awoken by an EventListener event, specifically the compaction complete event. It appears that the compaction completion can release column family references leading to the delete of a column family without acquiring the database mutex. That means the thread walking the column families for GetAggregatedIntProperty ends up iterating a deleted object. Segfault results.

@matthewvon
Copy link
Contributor Author

I have known about this issue for a couple of months. Feel bad for having not even mentioned it previously. My apologies.

@akankshamahajan15
Copy link
Contributor

@matthewvon Do you mind sharing the test code, if you have, to reproduce it.

@matthewvon
Copy link
Contributor Author

@akankshamahajan15 That is the problem. Been too busy to create a sample test case. And not likely to get time for another few weeks. Felt it important to at least log this now, test case later.

@akankshamahajan15 akankshamahajan15 added bug Confirmed RocksDB bugs segfault up-for-grabs Up for grabs labels Jun 10, 2021
facebook-github-bot pushed a commit that referenced this issue Aug 3, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
ltamasi added a commit that referenced this issue Aug 4, 2021
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
#8440 ,
#8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
#3510 fixed the deadlock but left the
unlocking in place. #6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, #7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since #3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: #8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
@ltamasi
Copy link
Contributor

ltamasi commented Aug 7, 2021

Hi @matthewvon ! Could you confirm if #8605 fixes the issue you've been seeing?

@ltamasi
Copy link
Contributor

ltamasi commented Sep 10, 2021

Assuming fixed, please feel free to reopen if you still see this after the fix mentioned above.

@ltamasi ltamasi closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed RocksDB bugs segfault up-for-grabs Up for grabs
Projects
None yet
Development

No branches or pull requests

3 participants