Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto recovery from out of space errors #4164

Closed
wants to merge 25 commits into from

Conversation

anand1976
Copy link
Contributor

Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -

  1. Compaction errors are treated as soft errors and don't put the
    database in read-only mode. A compaction is delayed until enough free
    disk space is available to accomodate the compaction outputs, which is
    estimated based on the input size. This means that users can continue to
    write, and we rely on the WriteController to delay or stop writes if the
    compaction debt becomes too high due to persistent low disk space
    condition
  2. Errors during write callback and flush are treated as hard errors,
    i.e the database is put in read-only mode and goes back to read-write
    only fater certain recovery actions are taken.
  3. Both types of recovery rely on the SstFileManagerImpl to poll for
    sufficient disk space. We assume that there is a 1-1 mapping between an
    SFM and the underlying OS storage container. For cases where multiple
    DBs are hosted on a single storage container, the user is expected to
    allocate a single SFM instance and use the same one for all the DBs. If
    no SFM is specified by the user, DBImpl::Open() will allocate one, but
    this will be one per DB and each DB will recover independently. The
    recovery implemented by SFM is as follows -
    a) On the first occurance of an out of space error during compaction,
    subsequent
    compactions will be delayed until the disk free space check indicates
    enough available space. The required space is computed as the sum of
    input sizes.
    b) The free space check requirement will be removed once the amount of
    free space is greater than the size reserved by in progress
    compactions when the first error occured
    c) If the out of space error is a hard error, a background thread in
    SFM will poll for sufficient headroom before triggering the recovery
    of the database and putting it in write-only mode. The headroom is
    calculated as the sum of the write_buffer_size of all the DB instances
    associated with the SFM
  4. EventListener callbacks will be called at the start and completion of
    automatic recovery. Users can disable the auto recov ery in the start
    callback, and later initiate it manually by calling DB::Resume()

Todo:

  1. More extensive testing
  2. Add disk full condition to db_stress (follow-on PR)

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

@anand1976 anand1976 requested a review from siying July 22, 2018 04:40
@anand1976 anand1976 requested a review from ajkr July 22, 2018 04:40
@anand1976 anand1976 added the WIP Work in progress label Jul 22, 2018
@anand1976
Copy link
Contributor Author

Setting the WIP tag to account for the remaining testing. The implementation is largely complete, I think, pending feedback/comments.

@@ -27,6 +27,7 @@ enum class TableFileCreationReason {
kFlush,
kCompaction,
kRecovery,
kMisc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to be more specific with the reason here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rethink this. I was using TableFileCreationReason to signal SstFileManagerImpl::OnAddFile that the file was added due to compaction (with kMisc as sort of the default), but it doesn't necessarily mean the file was just created. It could be called during DB::Open, when its just listing all the files. So maybe TableFileCreationReason is not appropriate. Instead, I can have an explicit parameter in OnAddFile to specify whether the file being added was created by compaction.

@anand1976 anand1976 removed the WIP Work in progress label Aug 17, 2018
Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing. This is more complicated than I though.

.travis.yml Outdated
@@ -116,6 +121,7 @@ script:
mkdir build && cd build && cmake -DJNI=1 .. -DCMAKE_BUILD_TYPE=Release && make -j4 rocksdb rocksdbjni
;;
esac
- for i in $(find ./ -maxdepth 1 -name 'core*' -print); do gdb ./error_handler_test core* -ex "thread apply all bt" -ex "set pagination 0" -batch; done;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, this is complicated unit test technique...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just temporary. I'm not going to merge this file. One of the tests is crashing only in Travis, so this is trying to print the backtrace.

private:
bool IsRecoveryInProgress() { return recovery_in_prog_; }

Status RecoverFromBGError(bool exclusive = false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need exclusive and non-exclusive mode, and can't make everything exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's in fact the idea - to have only one kind of recovery going on (either auto or manual). I think the implementation is probably confusing. The idea is that a user that doesn't want auto recovery would turn it off in the OnErrorRecoveryBegin event callback, and then call DB::Resume(). The exclusive parameter here is to indicate whether the call is from manual recovery or not. I should change the name to be more explicit.

@@ -844,6 +845,8 @@ class DBImpl : public DB {
bool read_only = false, bool error_if_log_file_exist = false,
bool error_if_data_exists_in_logs = false);

Status ResumeImpl();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a comment describing what the function is doing?

db/db_impl.cc Outdated
"Shutdown: canceling all background work");
if (shutdown) {
ROCKS_LOG_INFO(immutable_db_options_.info_log,
"Shutdown: canceling all background work");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set a flag here, will the race condition between recovery and shutdown be easier to resolve?

Copy link
Contributor Author

@anand1976 anand1976 Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a interesting suggestion. We could set a new flag here, and in DBImpl::CloseHelper, call error_handler_.CancelErrorRecovery before CancelAllBackgroundWork. That way, we would not have both trying to schedule memtable flushes. Let me look into this a bit more.

db/db_impl.cc Outdated
}

// Will lock the mutex_, will wait for completion if wait is true
void DBImpl::CancelAllBackgroundWork(bool wait) {
void DBImpl::CancelAllBackgroundWork(bool wait, bool shutdown) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all what it does in shutdown=false case these lines?

   while (bg_bottom_compaction_scheduled_ || bg_compaction_scheduled_ ||
          bg_flush_scheduled_) {
     bg_cv_.Wait();
 }

If that is the case, is it easier to make these lines a helper function, rather than fitting into CancelAllBackgroundWork()? Otherwise, it doesn't feel cancelling any background work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It is effectively just waiting for background work to be completed. This can be moved to a helper and reused.

closing_(false),
reserved_disk_buffer_(0),
free_space_trigger_(0) {
bg_thread_.reset(new port::Thread(&SstFileManagerImpl::ClearError, this));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to start it only when out-of-space happened?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think that's a good idea. That way, if an out of space error never happens, the behavior is exactly the same as it is now.

Status s = env_->GetFreeSpace(path_, &free_space);
if (s.ok()) {
if (bg_err_.severity() == Status::Severity::kHardError) {
// Give priority to hard errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment more about what it means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its referring to the case where some DB instances experienced soft errors (i.e compaction failure) and some experienced hard errors (i.e flush/WAL write failures), we would estimate required free space as if all instances experienced hard errors. For hard errors, the free space threshold is reserved_disk_buffer_, which is basically the sum of write_buffer_size of all instances. For soft errors, the threshold is free_space_trigger_, which is the value of cur_compactions_reserved_size_ at the time of error.

(needed_headroom + total_files_size_ > max_allowed_space_)) {
cur_compactions_reserved_size_ -= size_added_by_compaction;
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly complicated logic. I wonder whether we can simplify it, trading off some feature supports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of things I added here that increased the complexity - 1. The in_progress_files_size_ was added in order to account for a compaction that has generated some outputs that are already counted in cur_compactions_reserved_size_, and 2. Since compaction_buffer_size_ is a user provided parameter during SstFileManager construction, choose a default if they didn't set it.

I can move these down below to the NoSpace case in order to avoid impacting the normal case. Maybe #2 can also be removed, depending on how conservative we want to be when estimating headroom after getting out of space error.

} else {
// Take a snapshot of cur_compactions_reserved_size_ for when we encounter
// a NoSpace error.
free_space_trigger_ = cur_compactions_reserved_size_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a worry using this one.
For example, if a user runs universal compaction, and has 200 shards, and it is out-of-memory. Eventually, full compaction of all the shards will reserve the size equivalent to the whole shard size, and fail. The compaction trigger would be equal to the total size of the 200 shards. Then users need to drop the disk space to lower than 50% to make it recover. This is going to be much more than what is needed. Is it a valid worry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cur_compactions_reserved_size_ is the space reserved by currently running compactions. In your example, would the user tune max_background_jobs to limit the number of concurrent compactions? If they set it to a high number, they will always run the risk of doubling the space used. This check may, in fact benefit that case, as it does not stop background work but rather just slows it down. If EnoughRoomForCompaction returns false, the compaction is retried after some time. So it would end up throttling concurrent compactions to some lower number that can actually be accommodated by the available disk space.

@siying
Copy link
Contributor

siying commented Aug 23, 2018

When an SST file manager manages multiple DBs, can we recover them slowly, rather than enable them all at one shot? I'm worry about the sudden space increase may complicate things.

Anand Ananthabhotla added 16 commits August 27, 2018 18:15
Summary:

Search or jump to…

Pull requests
Issues
Marketplace
Explore
 @anand1976 Sign out
882
11,071 2,425 facebook/rocksdb
 Code  Issues 124  Pull requests 88  Projects 1  Wiki  Insights
Auto recovery from out of space errors

Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
  a) On the first occurance of an out of space error during compaction,
subsequent
  compactions will be delayed until the disk free space check indicates
  enough available space. The required space is computed as the sum of
  input sizes.
  b) The free space check requirement will be removed once the amount of
  free space is greater than the size reserved by in progress
  compactions when the first error occured
  c) If the out of space error is a hard error, a background thread in
  SFM will poll for sufficient headroom before triggering the recovery
  of the database and putting it in write-only mode. The headroom is
  calculated as the sum of the write_buffer_size of all the DB instances
  associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()

Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
1. Fix corner cases where compaction state is left inconsistent
  a) Compaction triggered by recovery
  b) Compaction failure after writes are stopped
  (level0_stop_writes_trigger) - add it back to compaction queue
2. Prevent indefinite wait in DBImpl::DelayWrite()
3. Fix required free space calculation in SstFileManagerImpl

Test Plan:
Tested using db_bench on a loopback filesystem and periodically filling
up the fs

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
1. Rebase and resolve conflicts
2. Work around GetFreeSpace conflict in Windows build
3. Misc. Travis compilation errors

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
.travis.yml with a specific config and coredump/gdb enabled

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
1. Add tests for multiple column family and multiple DB cases
2. Change SstFileManager::OnAddFile signature
3. Better isolate DB instances in the multiple instance case by not
impacting one instance if another has run into error, in order to
preserve existing behavior

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
The main conflict is with the commit 7daae51, which refactors flush
request processing.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
db/db_impl.h Outdated
// error recovery from going on in parallel. The latter, shutting_down_,
// is set a little later during the shutdown after scheduling memtable
// flushes
bool shutdown_flag_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it shutdown_initiated_ directly?

db/db_impl.h Outdated
// error recovery from going on in parallel. The latter, shutting_down_,
// is set a little later during the shutdown after scheduling memtable
// flushes
bool shutdown_flag_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm picky but if possible can we place it together with those bool variables, for better padding?

// NoSpace. If that happens, we abort the recovery here and start
// over.
mu_.Unlock();
s = cur_instance_->RecoverFromBGError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that, before the mutex is acquired again, the same DB recovered, but quickly failed again. Since this DB is still en error_handler_list_, it is not inserted again, but we poped it up in line 294, so the error handle of this failure DB is out of track?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. I think the safest way for now is to check the DB's error status again after acquiring the mutex, and leave it in the list if there is an error.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Anand Ananthabhotla added 8 commits September 13, 2018 19:25
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
Temporarily disbling these tests in Travis as they are only failing in
Travis and not reproducible elsewhere.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

anand1976 pushed a commit to anand1976/rocksdb that referenced this pull request Sep 17, 2018
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain

Test Plan:
make check
make valgrind_test

Reviewers:

Subscribers:

Tasks:

Tags:
anand1976 pushed a commit to anand1976/rocksdb that referenced this pull request Sep 17, 2018
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain

Test Plan:
make check
make check (clang)
make valgrind_test
make ubsan_check

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
facebook-github-bot pushed a commit that referenced this pull request Sep 17, 2018
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain
Pull Request resolved: #4375

Differential Revision: D9875779

Pulled By: anand1976

fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
cv_.SignalAll();
}

bool WaitForRecovery(uint64_t /*abs_time_us*/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 Hi! Seems the abs_time_us here should be uncomment? Otherwise db_bench will wait forever on error.

facebook-github-bot pushed a commit that referenced this pull request Jan 7, 2020
Summary:
A new interface method Env::GetFreeSpace was added in #4164. It needs to be implemented for Windows port. Some error_handler_test cases fail on Windows because recovery cannot succeed without free space being reported.
Pull Request resolved: #6265

Differential Revision: D19303065

fbshipit-source-id: 1f1a83e53f334284781cf61feabc996e87b945ca
@rohansuri
Copy link

Hi @anand1976 / @siying, I had some questions around this change. Thanks in advance!

  • The check here shouldn't have size_added_by_compaction since it is already included in needed_headroom before in the variable initialization?

  • It seems we never pass compaction=true to SstFileManagerImpl::OnAddFile which means the logic around in_progress_files_size_ is never executed. Any reason why it is turned off?

  • Finally, in what scenario will an already tracked file by SstFileManagerImpl be added again? And why do subtract its size from cur_compactions_reserved_size_ in that case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants