Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in SstFileManagerImpl error recovery code #9435

Closed
wants to merge 1 commit into from

Conversation

anand1976
Copy link
Contributor

Summary:
There is a race in SstFileManagerImpl between the ClearError() function
and CancelErrorRecovery(). The race can cause ClearError() to deref the
file system pointer after it has been freed. This is likely to occur
during process shutdown, when the order of destruction of the
DB/Env/FileSystem and SstFileManagerImpl is not deterministic.

Test Plan:
Reproduce the crash in a TSAN build by introducing sleeps in the code, and verify with
the fix.

Summary:
There is a race in SstFileManagerImpl between the ClearError() function
and CancelErrorRecovery(). The race can cause ClearError() to deref the
file system pointer after it has been freed. This is likely to occur
during process shutdown, when the order of destruction of the
DB/Env/FileSystem and SstFileManagerImpl is not deterministic.

Test Plan:
Reproduce the crash in a TSAN build by introducing sleeps in the code, and verify with
the fix.

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixing and I look forward to a unit test reproducing it.

@@ -256,7 +256,7 @@ void SstFileManagerImpl::ClearError() {
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can be reproduced, we can try to add a unit test with a sync point here to reproduce this sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this is hard since this happens in test cleaning up phase. If it is too hard, then forget about it. We can have it here and see whether the test failure still h appends in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I couldn't figure out a repro using sync points. Ended up with sleeps in the SstFileManagerImpl destructor and ClearError to repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants