Allow DB resume after background errors #3997

anand1976 · 2018-06-15T00:53:06Z

Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -

Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.

siying

It's great!

siying · 2018-06-20T20:05:32Z

db/db_impl.h

@@ -1424,6 +1425,8 @@ class DBImpl : public DB {

  // Flag to check whether Close() has been called on this DB
  bool closed_;
+
+  std::unique_ptr<ErrorHandler> error_handler_;


Why a pointer rather than ErrorHandler error_handler_?

I wanted to explicitly construct it passing DBImpl::mutex_ and DBImpl::immutable_db_options_. Although it occurs to me now that I can call the constructor during DBImpl construction.

siying · 2018-06-20T20:06:20Z

db/db_impl.cc

+  }
+
+  ROCKS_LOG_INFO(immutable_db_options_.info_log,
+                 "Resuming DB");


Make format may format it differently.

siying · 2018-06-20T20:07:13Z

db/db_impl.cc

+
+  mutex_.Lock();
+  if (bg_work_paused_ == 0) {
+    MaybeScheduleFlushOrCompaction();


I feel it safer to always call MaybeScheduleFlushOrCompaction().

siying · 2018-06-20T20:08:55Z

db/db_impl.cc

+  // No need to check BGError again. If something happened, event listener would be
+  // notified and the operation causing it would have failed
+  ROCKS_LOG_INFO(immutable_db_options_.info_log,
+      "Successfully resumed DB");


We usually try to avoid logging inside DB mutex. This applies to multiple logging in this function. We may not care that in this case where DB is already stopped, but it will be great that we always keep this practice when possible.

siying · 2018-06-20T20:10:38Z

db/db_impl.cc

+  Status s = error_handler_->GetBGError();
+  if (s.severity() > Status::Severity::kHardError) {
+    ROCKS_LOG_INFO(immutable_db_options_.info_log,
+        "DB resume requested but failed due to Fatal/Unrecoverable error");


So hard error is something can be recovered?

As of now, the only hard error we can recover from is running out of space in SstFileManager during background compaction/flush. The user can increase the limit and resume the DB.
Eventually, I plan to add more cases like filesystem out of space while writing the WAL, but that requires more work such as flushing memtables etc.

siying · 2018-06-20T20:26:03Z

db/error_handler.cc

+    return Status::OK();
+  }
+
+  auto paranoid = db_options_.paranoid_checks;


I prefer not using auto in this case, as type is not clear in this context.

siying · 2018-06-20T20:29:01Z

db/error_handler.cc

+  }
+
+  auto paranoid = db_options_.paranoid_checks;
+  Status::Severity sev = Status::Severity::kNoError;


Is default no error risky? How about set a default that stops the DB? In that case, in case the maps are not covering all the possible cases, at least we are stopping the DB.

Yes, we can set the default to fatal error to be on the safe side.

siying · 2018-06-20T20:34:38Z

db/error_handler.cc

+};
+
+Status ErrorHandler::SetBGError(const Status& bg_err, BackgroundErrorReason reason) {
+  if (bg_err.ok()) {


assert the DB mutex is held?

siying · 2018-06-20T20:34:57Z

db/error_handler_test.cc

+      trig_no_space(false), trig_io_error(false) {}
+
+    void SetTrigNoSpace() {trig_no_space = true;}
+    void SetTrigIoError() {trig_io_error = true;}


make format

siying · 2018-06-20T20:41:47Z

include/rocksdb/status.h

+    kFatalError = 3,
+    kUnrecoverableError = 4,
+    kMaxSeverity
+  };


How many bytes does it take by default? If more than 1 maybe it's a good idea to explicitly claim it to be a byte, just as what we do with PerfLevel.

Default enum is 4 bytes I think? I don't mind setting it to unsigned char. Code and SubCode are default size, though, so it might look a little inconsistent.

I mean change all of them.

What do you hope to optimize? The Status copy constructor explicitly copies each field, so I don't think we'll gain anything in terms of CPU. It is mostly used on stack, so it won't change memory usage either.

siying

Great!

facebook-github-bot

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-06-27T01:06:42Z

@anand1976 has updated the pull request. View: changes, changes since last import

facebook-github-bot

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2018-06-27T20:38:21Z

@anand1976 has updated the pull request.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: The local variable mutex in DBImpl::Resume shadows DBImpl::mutex(). Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Move DBImpl::error_handler_ to constructor so its available to BG threads during DB::Open() 2. Setup up sync point dependency so compaction starts only after flush call is finished Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Change DBImpl::error_handler_ from unique_ptr to embedded object 2. Make errors with reason BackgroundErrorReason::kWriteCallback fatal 3. Fix formatting 4. Misc Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: This will hopefully reduce the size of teh Status object and lower CPU utilization due to fewer bytes to copy around. Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Formatting 2. Remove a dead store on db/error_handler.cc, line 121 Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot · 2018-06-27T23:25:21Z

@anand1976 has updated the pull request.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot · 2018-06-28T05:38:28Z

@anand1976 has updated the pull request. View: changes, changes since last import

facebook-github-bot

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@anand1976 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes facebook#3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd

Summary: **Context:** As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process. However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`. As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like ``` db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet' assert(versions_->io_status().ok()); ``` **Summary:** - I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` - As the context, the orignal #3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit. - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables. - In addition, to preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s` - In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block. Pull Request resolved: #9496 Test Plan: - Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread ``` sleep(1); assert(versions_->io_status().ok()); ``` `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError Received signal 11 (Segmentation fault) #0 rocksdb/error_handler_fs_test() [0x5818a4] rocksdb::DBImpl::ResumeImpl(rocksdb::DBRecoverContext) /data/users/huixiao/rocksdb/db/db_impl/db_impl.cc:421 #1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600 #2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310 #3 rocksdb/error_handler_fs_test() ``` - The assertion failure does not happen with PR `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` `[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) ` Reviewed By: riversand963, anand1976 Differential Revision: D33990099 Pulled By: hx235 fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab

Summary: **Context:** As part of #6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed](e66199d#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R396). Before re-enabling file deletion, it `assert(versions_->io_status().ok());`, which IMO assumes `versions_` is **the** `version_` in the recovery process. However, this is not necessarily true due to `s = error_handler_.ClearBGError();` happening before that assertion can unblock some foreground thread by [`EventHelpers::NotifyOnErrorRecoveryEnd()`](https://github.com/facebook/rocksdb/blob/3122cb435875d720fc3d23a48eb7c0fa89d869aa/db/error_handler.cc#L552-L553) as part of the `ClearBGError()`. That foreground thread can do whatever it wants including closing/reopening the db and clean up that same `versions_`. As a consequence, `assert(versions_->io_status().ok());`, will access `io_status()` of a nullptr and test like `DBErrorHandlingFSTest.MultiCFWALWriteError` becomes flaky. The unblocked foreground thread (in this case, the testing thread) proceeds to [reopen the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494), where [`versions_` gets reset to nullptr](https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl.cc?fbclid=IwAR2uRhwBiPKgmE9q_6CM2mzbfwjoRgsGpXOrHruSJUDcAKc9rYZtVSvKdOY#L678) as part of the old db clean-up. If this happens right before `assert(versions_->io_status().ok()); ` gets excuted in the background thread, then we can see error like ``` db/db_impl/db_impl.cc:420:5: runtime error: member call on null pointer of type 'rocksdb::VersionSet' assert(versions_->io_status().ok()); ``` **Summary:** - I proposed to call `s = error_handler_.ClearBGError();` after we know it's fine to wake up foreground, which I think is right before we LOG `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` - As the context, the orignal #3997 introducing `DBImpl::Resume()` calls `s = error_handler_.ClearBGError();` very close to calling `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` while the later #6949 distances these two calls a bit. - And it seems fine to me that `s = error_handler_.ClearBGError();` happens after `EnableFileDeletions(/*force=*/true);` at least syntax-wise since these two functions are orthogonal. And it also seems okay to me that we re-enable file deletion before `s = error_handler_.ClearBGError();`, which basically is resetting some state variables. - In addition, to preserve the previous behavior of #6949 where status of re-enabling file deletion is not taken account into the general status of resuming the db, I separated `enable_file_deletion_s` from the general `s` - In addition, to make `ROCKS_LOG_INFO(immutable_db_options_.info_log, "Successfully resumed DB");` more clear, I separated it into its own if-block. Pull Request resolved: #9496 Test Plan: - Manually reproduce the assertion failure in`DBErrorHandlingFSTest.MultiCFWALWriteError` by injecting sleep like below so that it's more likely for `assert(versions_->io_status().ok());` to execute after [reopening the db](https://github.com/facebook/rocksdb/blob/6.29.fb/db/error_handler_fs_test.cc?fbclid=IwAR1kQOxSbTUmaHQPAGz5jdMHXtDsDFKiFl8rifX-vIz4B23Y0S9jBkssSCg#L1494) in the foreground (i.e, testing) thread ``` sleep(1); assert(versions_->io_status().ok()); ``` `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.MultiCFWALWriteError Received signal 11 (Segmentation fault) #1 rocksdb/error_handler_fs_test() [0x6379ff] rocksdb::ErrorHandler::RecoverFromBGError(bool) /data/users/huixiao/rocksdb/db/error_handler.cc:600 #2 rocksdb/error_handler_fs_test() [0x7c5362] rocksdb::SstFileManagerImpl::ClearError() /data/users/huixiao/rocksdb/file/sst_file_manager_impl.cc:310 #3 rocksdb/error_handler_fs_test() ``` - The assertion failure does not happen with PR `python3 gtest-parallel/gtest_parallel.py -r 100 -w 100 rocksdb/error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.MultiCFWALWriteError` `[100/100] DBErrorHandlingFSTest.MultiCFWALWriteError (43785 ms) ` Reviewed By: riversand963, anand1976 Differential Revision: D33990099 Pulled By: hx235 fbshipit-source-id: 2e0259a471fa8892ff177da91b3e1c0792dd7bab

anand1976 requested a review from siying June 15, 2018 00:53

facebook-github-bot added the CLA Signed label Jun 15, 2018

siying reviewed Jun 20, 2018

View reviewed changes

anand1976 force-pushed the bg_error branch 2 times, most recently from 228c067 to a045782 Compare June 22, 2018 23:40

siying approved these changes Jun 26, 2018

View reviewed changes

facebook-github-bot reviewed Jun 26, 2018

View reviewed changes

facebook-github-bot reviewed Jun 27, 2018

View reviewed changes

anand1976 force-pushed the bg_error branch from ec20e44 to 37a371e Compare June 27, 2018 20:38

Anand Ananthabhotla added 12 commits June 27, 2018 15:52

bg_error handling and resume

d4c31bf

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Refactor bg_error_ code; Add Severity to Status

4bfe713

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Refactor the error handling code

44d5a7d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Add message string for subcode kSpaceLimit

65b4589

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fail resume for Fatal/Unrecoverable errors

479fc13

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix compilation error in Travis CI

c9e09e2

Summary: The local variable mutex in DBImpl::Resume shadows DBImpl::mutex(). Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix test failures

5e485d3

Summary: 1. Move DBImpl::error_handler_ to constructor so its available to BG threads during DB::Open() 2. Setup up sync point dependency so compaction starts only after flush call is finished Test Plan: Reviewers: Subscribers: Tasks: Tags:

Address code review comments

fc3e74b

Summary: 1. Change DBImpl::error_handler_ from unique_ptr to embedded object 2. Make errors with reason BackgroundErrorReason::kWriteCallback fatal 3. Fix formatting 4. Misc Test Plan: Reviewers: Subscribers: Tasks: Tags:

Change enum storage in Status to unsigned char

d8d37e2

Summary: This will hopefully reduce the size of teh Status object and lower CPU utilization due to fewer bytes to copy around. Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix Infer errors

5320e5d

Summary: 1. Formatting 2. Remove a dead store on db/error_handler.cc, line 121 Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix flaky test DBErrorHandlingTest.CompactionWriteError

916e0ed

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix merge conflict on master

81720f1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

anand1976 force-pushed the bg_error branch from 37a371e to 81720f1 Compare June 27, 2018 23:25

Fix flaky test DBErrorHandlerTest.CorruptionTest

2126052

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot reviewed Jun 28, 2018

View reviewed changes

facebook-github-bot closed this in 52d4c9b Jun 28, 2018

rohansuri mentioned this pull request May 19, 2020

[WIP] Background error handling cockroachdb/pebble#573

Open

hx235 mentioned this pull request Feb 3, 2022

Deflake DBErrorHandlingFSTest.MultiCFWALWriteError #9496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow DB resume after background errors #3997

Allow DB resume after background errors #3997

anand1976 commented Jun 15, 2018

siying left a comment

siying Jun 20, 2018

anand1976 Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

anand1976 Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

anand1976 Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

siying Jun 20, 2018

anand1976 Jun 20, 2018

siying Jun 21, 2018

anand1976 Jun 21, 2018

siying left a comment

facebook-github-bot left a comment

facebook-github-bot commented Jun 27, 2018

facebook-github-bot left a comment

facebook-github-bot commented Jun 27, 2018

facebook-github-bot commented Jun 27, 2018

facebook-github-bot commented Jun 28, 2018

facebook-github-bot left a comment

facebook-github-bot left a comment

Allow DB resume after background errors #3997

Allow DB resume after background errors #3997

Conversation

anand1976 commented Jun 15, 2018

siying left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 27, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 27, 2018

facebook-github-bot commented Jun 27, 2018

facebook-github-bot commented Jun 28, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment