Auto recovery from out of space errors #4164

anand1976 · 2018-07-22T04:40:25Z

Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -

Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()

Todo:

More extensive testing
Add disk full condition to db_stress (follow-on PR)

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

anand1976 · 2018-07-22T15:12:50Z

Setting the WIP tag to account for the remaining testing. The implementation is largely complete, I think, pending feedback/comments.

sagar0 · 2018-08-16T20:49:05Z

include/rocksdb/listener.h

@@ -27,6 +27,7 @@ enum class TableFileCreationReason {
  kFlush,
  kCompaction,
  kRecovery,
+  kMisc,


It might be good to be more specific with the reason here.

Let me rethink this. I was using TableFileCreationReason to signal SstFileManagerImpl::OnAddFile that the file was added due to compaction (with kMisc as sort of the default), but it doesn't necessarily mean the file was just created. It could be called during DB::Open, when its just listing all the files. So maybe TableFileCreationReason is not appropriate. Instead, I can have an explicit parameter in OnAddFile to specify whether the file being added was created by compaction.

siying

I'm still reviewing. This is more complicated than I though.

siying · 2018-08-22T20:36:52Z

.travis.yml

@@ -116,6 +121,7 @@ script:
      mkdir build && cd build && cmake -DJNI=1 .. -DCMAKE_BUILD_TYPE=Release && make -j4 rocksdb rocksdbjni
      ;;
    esac
+  - for i in $(find ./ -maxdepth 1 -name 'core*' -print); do gdb ./error_handler_test core* -ex "thread apply all bt" -ex "set pagination 0" -batch; done;


Oh wow, this is complicated unit test technique...

This is just temporary. I'm not going to merge this file. One of the tests is crashing only in Travis, so this is trying to print the backtrace.

siying · 2018-08-22T20:40:34Z

db/error_handler.h

-  private:
+    bool IsRecoveryInProgress() { return recovery_in_prog_; }
+
+    Status RecoverFromBGError(bool exclusive = false);


Can you explain why we need exclusive and non-exclusive mode, and can't make everything exclusive?

That's in fact the idea - to have only one kind of recovery going on (either auto or manual). I think the implementation is probably confusing. The idea is that a user that doesn't want auto recovery would turn it off in the OnErrorRecoveryBegin event callback, and then call DB::Resume(). The exclusive parameter here is to indicate whether the call is from manual recovery or not. I should change the name to be more explicit.

siying · 2018-08-22T20:42:16Z

db/db_impl.h

@@ -844,6 +845,8 @@ class DBImpl : public DB {
                 bool read_only = false, bool error_if_log_file_exist = false,
                 bool error_if_data_exists_in_logs = false);

+  Status ResumeImpl();


Can you write a comment describing what the function is doing?

siying · 2018-08-22T23:43:12Z

db/db_impl.cc

-                 "Shutdown: canceling all background work");
+  if (shutdown) {
+    ROCKS_LOG_INFO(immutable_db_options_.info_log,
+                   "Shutdown: canceling all background work");


If we set a flag here, will the race condition between recovery and shutdown be easier to resolve?

That's a interesting suggestion. We could set a new flag here, and in DBImpl::CloseHelper, call error_handler_.CancelErrorRecovery before CancelAllBackgroundWork. That way, we would not have both trying to schedule memtable flushes. Let me look into this a bit more.

siying · 2018-08-22T23:50:17Z

db/db_impl.cc

 }

 // Will lock the mutex_,  will wait for completion if wait is true
-void DBImpl::CancelAllBackgroundWork(bool wait) {
+void DBImpl::CancelAllBackgroundWork(bool wait, bool shutdown) {


Is all what it does in shutdown=false case these lines?

while (bg_bottom_compaction_scheduled_ || bg_compaction_scheduled_ || bg_flush_scheduled_) { bg_cv_.Wait(); }

If that is the case, is it easier to make these lines a helper function, rather than fitting into CancelAllBackgroundWork()? Otherwise, it doesn't feel cancelling any background work.

That's a good point. It is effectively just waiting for background work to be completed. This can be moved to a helper and reused.

siying · 2018-08-23T00:03:00Z

util/sst_file_manager_impl.cc

+      closing_(false),
+      reserved_disk_buffer_(0),
+      free_space_trigger_(0) {
+  bg_thread_.reset(new port::Thread(&SstFileManagerImpl::ClearError, this));


Is there a way to start it only when out-of-space happened?

Yes. I think that's a good idea. That way, if an out of space error never happens, the behavior is exactly the same as it is now.

siying · 2018-08-23T00:12:08Z

util/sst_file_manager_impl.cc

+    Status s = env_->GetFreeSpace(path_, &free_space);
+    if (s.ok()) {
+      if (bg_err_.severity() == Status::Severity::kHardError) {
+        // Give priority to hard errors


Can you comment more about what it means?

Its referring to the case where some DB instances experienced soft errors (i.e compaction failure) and some experienced hard errors (i.e flush/WAL write failures), we would estimate required free space as if all instances experienced hard errors. For hard errors, the free space threshold is reserved_disk_buffer_, which is basically the sum of write_buffer_size of all instances. For soft errors, the threshold is free_space_trigger_, which is the value of cur_compactions_reserved_size_ at the time of error.

siying · 2018-08-23T00:23:09Z

util/sst_file_manager_impl.cc

+      (needed_headroom + total_files_size_ > max_allowed_space_)) {
+    cur_compactions_reserved_size_ -= size_added_by_compaction;
+    return false;
+  }


This is a fairly complicated logic. I wonder whether we can simplify it, trading off some feature supports.

There are a couple of things I added here that increased the complexity - 1. The in_progress_files_size_ was added in order to account for a compaction that has generated some outputs that are already counted in cur_compactions_reserved_size_, and 2. Since compaction_buffer_size_ is a user provided parameter during SstFileManager construction, choose a default if they didn't set it.

I can move these down below to the NoSpace case in order to avoid impacting the normal case. Maybe #2 can also be removed, depending on how conservative we want to be when estimating headroom after getting out of space error.

siying · 2018-08-23T00:25:32Z

util/sst_file_manager_impl.cc

+  } else {
+    // Take a snapshot of cur_compactions_reserved_size_ for when we encounter
+    // a NoSpace error.
+    free_space_trigger_ = cur_compactions_reserved_size_;


I have a worry using this one.
For example, if a user runs universal compaction, and has 200 shards, and it is out-of-memory. Eventually, full compaction of all the shards will reserve the size equivalent to the whole shard size, and fail. The compaction trigger would be equal to the total size of the 200 shards. Then users need to drop the disk space to lower than 50% to make it recover. This is going to be much more than what is needed. Is it a valid worry?

The cur_compactions_reserved_size_ is the space reserved by currently running compactions. In your example, would the user tune max_background_jobs to limit the number of concurrent compactions? If they set it to a high number, they will always run the risk of doubling the space used. This check may, in fact benefit that case, as it does not stop background work but rather just slows it down. If EnoughRoomForCompaction returns false, the compaction is retried after some time. So it would end up throttling concurrent compactions to some lower number that can actually be accommodated by the available disk space.

siying · 2018-08-23T17:18:43Z

When an SST file manager manages multiple DBs, can we recover them slowly, rather than enable them all at one shot? I'm worry about the sudden space increase may complicate things.

@anand1976

Summary: Search or jump to… Pull requests Issues Marketplace Explore @anand1976 Sign out 882 11,071 2,425 facebook/rocksdb Code Issues 124 Pull requests 88 Projects 1 Wiki Insights Auto recovery from out of space errors Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Fix corner cases where compaction state is left inconsistent a) Compaction triggered by recovery b) Compaction failure after writes are stopped (level0_stop_writes_trigger) - add it back to compaction queue 2. Prevent indefinite wait in DBImpl::DelayWrite() 3. Fix required free space calculation in SstFileManagerImpl Test Plan: Tested using db_bench on a loopback filesystem and periodically filling up the fs Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Rebase and resolve conflicts 2. Work around GetFreeSpace conflict in Windows build 3. Misc. Travis compilation errors Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: .travis.yml with a specific config and coredump/gdb enabled Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Add tests for multiple column family and multiple DB cases 2. Change SstFileManager::OnAddFile signature 3. Better isolate DB instances in the multiple instance case by not impacting one instance if another has run into error, in order to preserve existing behavior Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: The main conflict is with the commit 7daae51, which refactors flush request processing. Test Plan: Reviewers: Subscribers: Tasks: Tags:

siying · 2018-09-11T23:12:25Z

db/db_impl.h

+  // error recovery from going on in parallel. The latter, shutting_down_,
+  // is set a little later during the shutdown after scheduling memtable
+  // flushes
+  bool shutdown_flag_;


Maybe call it shutdown_initiated_ directly?

siying · 2018-09-11T23:13:15Z

db/db_impl.h

+  // error recovery from going on in parallel. The latter, shutting_down_,
+  // is set a little later during the shutdown after scheduling memtable
+  // flushes
+  bool shutdown_flag_;


Maybe I'm picky but if possible can we place it together with those bool variables, for better padding?

siying · 2018-09-12T17:13:47Z

util/sst_file_manager_impl.cc

+      // NoSpace. If that happens, we abort the recovery here and start
+      // over.
+      mu_.Unlock();
+      s = cur_instance_->RecoverFromBGError();


Is it possible that, before the mutex is acquired again, the same DB recovered, but quickly failed again. Since this DB is still en error_handler_list_, it is not inserted again, but we poped it up in line 294, so the error handle of this failure DB is out of track?

That's a good catch. I think the safest way for now is to check the DB's error status again after acquiring the mutex, and leave it in the list if there is an error.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Temporarily disbling these tests in Travis as they are only failing in Travis and not reproducible elsewhere. Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot

anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Test Plan: make check make valgrind_test Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Test Plan: make check make check (clang) make valgrind_test make ubsan_check Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: #4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640

yiwu-arbug · 2019-05-10T22:38:14Z

tools/db_bench_tool.cc

+      cv_.SignalAll();
+    }
+
+    bool WaitForRecovery(uint64_t /*abs_time_us*/) {


@anand1976 Hi! Seems the abs_time_us here should be uncomment? Otherwise db_bench will wait forever on error.

Summary: A new interface method Env::GetFreeSpace was added in #4164. It needs to be implemented for Windows port. Some error_handler_test cases fail on Windows because recovery cannot succeed without free space being reported. Pull Request resolved: #6265 Differential Revision: D19303065 fbshipit-source-id: 1f1a83e53f334284781cf61feabc996e87b945ca

rohansuri · 2020-05-17T16:02:50Z

Hi @anand1976 / @siying, I had some questions around this change. Thanks in advance!

The check here shouldn't have size_added_by_compaction since it is already included in needed_headroom before in the variable initialization?
It seems we never pass compaction=true to SstFileManagerImpl::OnAddFile which means the logic around in_progress_files_size_ is never executed. Any reason why it is turned off?
Finally, in what scenario will an already tracked file by SstFileManagerImpl be added again? And why do subtract its size from cur_compactions_reserved_size_ in that case?

anand1976 requested a review from siying July 22, 2018 04:40

facebook-github-bot added the CLA Signed label Jul 22, 2018

anand1976 requested a review from ajkr July 22, 2018 04:40

anand1976 added the WIP Work in progress label Jul 22, 2018

anand1976 force-pushed the out_of_space_recovery branch from 32e8acf to 32e8767 Compare August 6, 2018 21:55

sagar0 reviewed Aug 16, 2018

View reviewed changes

anand1976 removed the WIP Work in progress label Aug 17, 2018

siying reviewed Aug 22, 2018

View reviewed changes

siying reviewed Aug 23, 2018

View reviewed changes

Anand Ananthabhotla added 16 commits August 27, 2018 18:15

Fix build failures

0222a47

Summary: 1. Rebase and resolve conflicts 2. Work around GetFreeSpace conflict in Windows build 3. Misc. Travis compilation errors Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix more Travis errors

d3a3c49

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix ROCKSDB_LITE build

681d78a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Debug Travis CI failures

709b6b9

Summary: .travis.yml with a specific config and coredump/gdb enabled Test Plan: Reviewers: Subscribers: Tasks: Tags:

Reenable builds in Travis CI to debug a coredump

24002c6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Reenable all builds in Travis CI and keep gdb/coredump

a685441

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Debug Travis CI failure: Only enable coredump for TEST_GROUP 1

30548bc

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix syntax error in .travis.yml

d2185a2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Revert all changes in .travis.yml

8653dc1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

2nd attempt at printing stack trace in Travis build

39a473f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Don't install gdb for osx builds

3daaca9

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Address code review comments

feea658

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Rebase to latest master after resolving conflicts

a59b8b2

Summary: The main conflict is with the commit 7daae51, which refactors flush request processing. Test Plan: Reviewers: Subscribers: Tasks: Tags:

anand1976 force-pushed the out_of_space_recovery branch from 0c05ac8 to a59b8b2 Compare August 29, 2018 00:08

siying approved these changes Sep 12, 2018

View reviewed changes

Address final few code review comments

359d98a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Anand Ananthabhotla added 8 commits September 13, 2018 19:25

Fix a race between recovery completion and DB shutdown

832fa03

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix one more race of shutdown initiated before recovery attempt

7989ce6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Enable gdb in travis

dbef9f0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

2nd attempt at Travis gdb

893bcc6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

3rd attempt at Travis

98cafaa

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Revert .travis.yml and fix some mem leaks in error_handler_test

cdaae4e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Disable tests failing only in Travis

6e278e1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Disable a few more tests in Travis

d4760ff

Summary: Temporarily disbling these tests in Travis as they are only failing in Travis and not reproducible elsewhere. Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot reviewed Sep 15, 2018

View reviewed changes

facebook-github-bot closed this in a27fce4 Sep 15, 2018

yiwu-arbug reviewed May 10, 2019

View reviewed changes

hliu18 mentioned this pull request Jan 7, 2020

Implement getfreespace for WinEnv #6265

Closed

rohansuri mentioned this pull request May 19, 2020

[WIP] Background error handling cockroachdb/pebble#573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto recovery from out of space errors #4164

Auto recovery from out of space errors #4164

anand1976 commented Jul 22, 2018

anand1976 commented Jul 22, 2018

sagar0 Aug 16, 2018

anand1976 Aug 17, 2018

siying left a comment

siying Aug 22, 2018

anand1976 Aug 23, 2018

siying Aug 22, 2018

anand1976 Aug 23, 2018

siying Aug 22, 2018

siying Aug 22, 2018

anand1976 Aug 23, 2018 •

edited

Loading

siying Aug 22, 2018

anand1976 Aug 23, 2018

siying Aug 23, 2018

anand1976 Aug 23, 2018

siying Aug 23, 2018

anand1976 Aug 23, 2018

siying Aug 23, 2018

anand1976 Aug 23, 2018

siying Aug 23, 2018

anand1976 Aug 23, 2018

siying commented Aug 23, 2018

siying Sep 11, 2018

siying Sep 11, 2018

siying Sep 12, 2018

anand1976 Sep 13, 2018

facebook-github-bot left a comment

yiwu-arbug May 10, 2019

rohansuri commented May 17, 2020

Auto recovery from out of space errors #4164

Auto recovery from out of space errors #4164

Conversation

anand1976 commented Jul 22, 2018

anand1976 commented Jul 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anand1976 Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying commented Aug 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohansuri commented May 17, 2020

anand1976 Aug 23, 2018 •

edited

Loading