Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` #7897

jay-zhuang · 2021-01-26T20:29:27Z

The recovery thread could hold the db.mutex, which is needed from sync
write in main thread.
Make sure the write is done before recovery thread starts.

Test Plan: gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200

The recovery thread could hold the db.mutex, which is needed from sync write in main thread. Make sure the write is done before recovery thread starts. Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200`

facebook-github-bot

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zhichao-cao

So the case is, the recover thread hold the db_mutex, and is at RecoverFromRetryableBGIOError:BeforeResume0 wait for WALWriteError1:0. However, the main thread try to grab db_mutex for sync, so waiting at that location, which is ahead of WALWriteError1:0. Do I understand correctly?

jay-zhuang · 2021-01-26T23:44:53Z

So the case is, the recover thread hold the db_mutex, and is at RecoverFromRetryableBGIOError:BeforeResume0 wait for WALWriteError1:0. However, the main thread try to grab db_mutex for sync, so waiting at that location, which is ahead of WALWriteError1:0. Do I understand correctly?

Yes exactly.

zhichao-cao

LGTM, thanks for the fix!

facebook-github-bot · 2021-01-27T01:02:17Z

@jay-zhuang merged this pull request in c6ff4c0.

…k#7897) Summary: The recovery thread could hold the db.mutex, which is needed from sync write in main thread. Make sure the write is done before recovery thread starts. Pull Request resolved: facebook#7897 Test Plan: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.WALWriteRetryableErrorAutoRecover1 -r 10000 --workers=200` Reviewed By: zhichao-cao Differential Revision: D26082933 Pulled By: jay-zhuang fbshipit-source-id: 226fc49228c0e5903f86ff45cc3fed3080abdb1f

facebook-github-bot added the CLA Signed label Jan 26, 2021

jay-zhuang linked an issue Jan 26, 2021 that may be closed by this pull request

flaky test: error_handler_fs_test #7472

Closed

facebook-github-bot reviewed Jan 26, 2021

View reviewed changes

jay-zhuang requested a review from zhichao-cao January 26, 2021 22:37

zhichao-cao reviewed Jan 26, 2021

View reviewed changes

zhichao-cao approved these changes Jan 27, 2021

View reviewed changes

facebook-github-bot closed this in c6ff4c0 Jan 27, 2021

facebook-github-bot added the Merged label Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` #7897

Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` #7897

jay-zhuang commented Jan 26, 2021

facebook-github-bot left a comment

zhichao-cao left a comment

jay-zhuang commented Jan 26, 2021

zhichao-cao left a comment

facebook-github-bot commented Jan 27, 2021

Fix deadlock in fs_test.WALWriteRetryableErrorAutoRecover1 #7897

Fix deadlock in fs_test.WALWriteRetryableErrorAutoRecover1 #7897

Conversation

jay-zhuang commented Jan 26, 2021

facebook-github-bot left a comment

Choose a reason for hiding this comment

zhichao-cao left a comment

Choose a reason for hiding this comment

jay-zhuang commented Jan 26, 2021

zhichao-cao left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 27, 2021

Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` #7897

Fix deadlock in `fs_test.WALWriteRetryableErrorAutoRecover1` #7897