Fix a race condition in WAL tracking causing DB open failure #9715

riversand963 · 2022-03-18T05:32:45Z

There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

Trigger a manual flush for "cf1", creating a 7.log
Insert another key to "default", and trigger flush for "default", creating 8.log
BgFlushThread1 finishes writing 9.sst
BgFlushThread2 finishes writing 10.sst

Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to versions_->wals_.min_wal_number_to_keep,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. job_context.log_number will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if track_and_verify_wal_in_manifest is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the cfd::log_number.
To fix this issue, we rename an existing variable min_log_number_to_keep_2pc to min_log_number_to_keep,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
min_log_number_to_keep means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to alive_log_files_.

Test plan:

make check

Also ran stress test below (with asan) to make sure it completes successfully.

TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test

facebook-github-bot · 2022-03-18T05:40:36Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-03-18T22:56:29Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-03-18T22:59:46Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ltamasi

Thanks for the fix @riversand963 !

ltamasi · 2022-03-21T17:47:09Z

db/db_wal_test.cc

+  SyncPoint::GetInstance()->SetCallBack(
+      "MemTableList::TryInstallMemtableFlushResults:AfterComputeMinWalToKeep",
+      [&](void* /*arg*/) {
+        dbfull()->mutex()->AssertHeld();
+        if (!called) {
+          called = true;
+          SyncPoint::GetInstance()->LoadDependency({
+              {"VersionSet::LogAndApply:WriteManifestStart",
+               "DBWALTest::RaceInstallFlushResultsWithWalObsoletion:BgFlush2"},
+              {"DBWALTest::RaceInstallFlushResultsWithWalObsoletion:BgFlush2",
+               "VersionSet::LogAndApply:WriteManifest"},
+          });
+        } else {
+          TEST_SYNC_POINT(
+              "DBWALTest::RaceInstallFlushResultsWithWalObsoletion:BgFlush2");
+        }
+      });


I feel some comment/explanation re: how we set the race condition up here would be helpful.

Sure, will do.

ltamasi · 2022-03-21T17:59:24Z

db/version_set.cc

 // Called only either from ::LogAndApply which is protected by mutex or during
 // recovery which is single-threaded.


This comment makes me wonder if we even need an atomic...

I guess one benefit of using an atomic is that you can read from it in other places without acquiring mutex, e.g. reporting it as an internal stat.

riversand963 · 2022-03-21T18:39:15Z

Thanks @ltamasi for the review!

facebook-github-bot · 2022-03-23T01:19:05Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-03-23T01:20:32Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This reverts commit e22e8024778af4b4d6e471c684e208ad950a10a4.

facebook-github-bot · 2022-03-24T00:15:39Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-03-24T00:17:35Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC. The race condition is between two background flush threads trying to install flush results to the MANIFEST. Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially, both column families have one mutable (active) memtable whose data backed by 6.log. 1. Trigger a manual flush for "cf1", creating a 7.log 2. Insert another key to "default", and trigger flush for "default", creating 8.log 3. BgFlushThread1 finishes writing 9.sst 4. BgFlushThread2 finishes writing 10.sst ``` Time BgFlushThread1 BgFlushThread2 | mutex_.Lock() | precompute min_wal_to_keep as 6 | mutex_.Unlock() | mutex_.Lock() | precompute min_wal_to_keep as 6 | join MANIFEST write queue and mutex_.Unlock() | write to MANIFEST | mutex_.Lock() | cfd1->log_number = 7 | Signal bg_flush_2 and mutex_.Unlock() | wake up and mutex_.Lock() | cfd0->log_number = 8 | FindObsoleteFiles() with job_context->log_number == 7 | mutex_.Unlock() | PurgeObsoleteFiles() deletes 6.log V ``` As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6). Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6). No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`, due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514. The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e. the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist. If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true. We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know the correct min wal number until the other bg flush threads have finished committing to the manifest and updated the `cfd::log_number`. To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`, and use it to track WAL file deletion in non-2pc mode as well. This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread. `min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`. Pull Request resolved: #9715 Test Plan: ``` make check ``` Also ran stress test below (with asan) to make sure it completes successfully. ``` TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \ CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \ make J=52 -j52 blackbox_asan_crash_test ``` Reviewed By: ltamasi Differential Revision: D34984412 Pulled By: riversand963 fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005

facebook-github-bot added the CLA Signed label Mar 18, 2022

riversand963 requested review from ltamasi and ajkr March 18, 2022 05:46

riversand963 changed the title ~~Fix a race condition in WAL tracking causing DB open failure~~ [Draft] Fix a race condition in WAL tracking causing DB open failure Mar 18, 2022

riversand963 force-pushed the fix-race-track-wal-in-manifest branch from 2fa2ce5 to 166ee44 Compare March 18, 2022 22:56

riversand963 changed the title ~~[Draft] Fix a race condition in WAL tracking causing DB open failure~~ Fix a race condition in WAL tracking causing DB open failure Mar 18, 2022

ltamasi approved these changes Mar 21, 2022

View reviewed changes

riversand963 force-pushed the fix-race-track-wal-in-manifest branch from 166ee44 to d35458a Compare March 23, 2022 01:19

riversand963 added 14 commits March 23, 2022 17:14

minimal repro unit test

ad2b2d4

minor

abfc8b9

Try to repro with txndb, tbc

e9ea10d

Use wal set to determine log file number to keep

402e110

Remove fprintfs

8a5a989

Revert "Use wal set to determine log file number to keep"

f2e0843

This reverts commit e22e8024778af4b4d6e471c684e208ad950a10a4.

Use another var to track min wal in non-2pc

9e2c857

Update min_log_number_to_keep_non_2pc during recovery

c6e60aa

Remove unnecessary unit test for txn db

36e6ce7

Update HISTORY

d5a4bda

Combine min_log_number_to_keep_2pc and min_log_number_to_keep_non_2pc

c0a8bac

Add some comments about sync points

e389f09

Add some comments

b1bc4db

Update HISTORY

ae3b0bc

riversand963 force-pushed the fix-race-track-wal-in-manifest branch from d35458a to ae3b0bc Compare March 24, 2022 00:15

facebook-github-bot closed this in e0c84aa Mar 24, 2022

riversand963 deleted the fix-race-track-wal-in-manifest branch March 24, 2022 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a race condition in WAL tracking causing DB open failure #9715

Fix a race condition in WAL tracking causing DB open failure #9715

riversand963 commented Mar 18, 2022 •

edited

Loading

facebook-github-bot commented Mar 18, 2022

facebook-github-bot commented Mar 18, 2022

facebook-github-bot commented Mar 18, 2022

ltamasi left a comment

ltamasi Mar 21, 2022

riversand963 Mar 21, 2022

ltamasi Mar 21, 2022

riversand963 Mar 21, 2022

riversand963 commented Mar 21, 2022

facebook-github-bot commented Mar 23, 2022

facebook-github-bot commented Mar 23, 2022

facebook-github-bot commented Mar 24, 2022

facebook-github-bot commented Mar 24, 2022

		// Called only either from ::LogAndApply which is protected by mutex or during
		// recovery which is single-threaded.

Fix a race condition in WAL tracking causing DB open failure #9715

Fix a race condition in WAL tracking causing DB open failure #9715

Conversation

riversand963 commented Mar 18, 2022 • edited Loading

facebook-github-bot commented Mar 18, 2022

facebook-github-bot commented Mar 18, 2022

facebook-github-bot commented Mar 18, 2022

ltamasi left a comment

Choose a reason for hiding this comment

ltamasi Mar 21, 2022

Choose a reason for hiding this comment

riversand963 Mar 21, 2022

Choose a reason for hiding this comment

ltamasi Mar 21, 2022

Choose a reason for hiding this comment

riversand963 Mar 21, 2022

Choose a reason for hiding this comment

riversand963 commented Mar 21, 2022

facebook-github-bot commented Mar 23, 2022

facebook-github-bot commented Mar 23, 2022

facebook-github-bot commented Mar 24, 2022

facebook-github-bot commented Mar 24, 2022

riversand963 commented Mar 18, 2022 •

edited

Loading