-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in fs_test.WALWriteRetryableErrorAutoRecover1
#7897
Conversation
The recovery thread could hold the db.mutex, which is needed from sync write in main thread. Make sure the write is done before recovery thread starts. Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the case is, the recover thread hold the db_mutex, and is at RecoverFromRetryableBGIOError:BeforeResume0 wait for WALWriteError1:0. However, the main thread try to grab db_mutex for sync, so waiting at that location, which is ahead of WALWriteError1:0. Do I understand correctly?
Yes exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
@jay-zhuang merged this pull request in c6ff4c0. |
…k#7897) Summary: The recovery thread could hold the db.mutex, which is needed from sync write in main thread. Make sure the write is done before recovery thread starts. Pull Request resolved: facebook#7897 Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200` Reviewed By: zhichao-cao Differential Revision: D26082933 Pulled By: jay-zhuang fbshipit-source-id: 226fc49228c0e5903f86ff45cc3fed3080abdb1f
The recovery thread could hold the db.mutex, which is needed from sync
write in main thread.
Make sure the write is done before recovery thread starts.
Test Plan:
gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200