Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto recovery from out of space errors #4164

Closed
wants to merge 25 commits into from

Commits on Aug 28, 2018

  1. Auto recovery from out of space errors

    Summary:
    
    Search or jump to…
    
    Pull requests
    Issues
    Marketplace
    Explore
     @anand1976 Sign out
    882
    11,071 2,425 facebook/rocksdb
     Code  Issues 124  Pull requests 88  Projects 1  Wiki  Insights
    Auto recovery from out of space errors
    
    Summary:
    This commit implements automatic recovery from a Status::NoSpace() error
    during background operations such as write callback, flush and
    compaction. The broad design is as follows -
    1. Compaction errors are treated as soft errors and don't put the
    database in read-only mode. A compaction is delayed until enough free
    disk space is available to accomodate the compaction outputs, which is
    estimated based on the input size. This means that users can continue to
    write, and we rely on the WriteController to delay or stop writes if the
    compaction debt becomes too high due to persistent low disk space
    condition
    2. Errors during write callback and flush are treated as hard errors,
    i.e the database is put in read-only mode and goes back to read-write
    only fater certain recovery actions are taken.
    3. Both types of recovery rely on the SstFileManagerImpl to poll for
    sufficient disk space. We assume that there is a 1-1 mapping between an
    SFM and the underlying OS storage container. For cases where multiple
    DBs are hosted on a single storage container, the user is expected to
    allocate a single SFM instance and use the same one for all the DBs. If
    no SFM is specified by the user, DBImpl::Open() will allocate one, but
    this will be one per DB and each DB will recover independently. The
    recovery implemented by SFM is as follows -
      a) On the first occurance of an out of space error during compaction,
    subsequent
      compactions will be delayed until the disk free space check indicates
      enough available space. The required space is computed as the sum of
      input sizes.
      b) The free space check requirement will be removed once the amount of
      free space is greater than the size reserved by in progress
      compactions when the first error occured
      c) If the out of space error is a hard error, a background thread in
      SFM will poll for sufficient headroom before triggering the recovery
      of the database and putting it in write-only mode. The headroom is
      calculated as the sum of the write_buffer_size of all the DB instances
      associated with the SFM
    4. EventListener callbacks will be called at the start and completion of
    automatic recovery. Users can disable the auto recov ery in the start
    callback, and later initiate it manually by calling DB::Resume()
    
    Todo:
    1. More extensive testing
    2. Add disk full condition to db_stress (follow-on PR)
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    29f30b9 View commit details
    Browse the repository at this point in the history
  2. Fix bugs found by additional testing

    Summary:
    1. Fix corner cases where compaction state is left inconsistent
      a) Compaction triggered by recovery
      b) Compaction failure after writes are stopped
      (level0_stop_writes_trigger) - add it back to compaction queue
    2. Prevent indefinite wait in DBImpl::DelayWrite()
    3. Fix required free space calculation in SstFileManagerImpl
    
    Test Plan:
    Tested using db_bench on a loopback filesystem and periodically filling
    up the fs
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    e0be66f View commit details
    Browse the repository at this point in the history
  3. Fix build failures

    Summary:
    1. Rebase and resolve conflicts
    2. Work around GetFreeSpace conflict in Windows build
    3. Misc. Travis compilation errors
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    0222a47 View commit details
    Browse the repository at this point in the history
  4. Fix more Travis errors

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    d3a3c49 View commit details
    Browse the repository at this point in the history
  5. Fix ROCKSDB_LITE build

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    681d78a View commit details
    Browse the repository at this point in the history
  6. Debug Travis CI failures

    Summary:
    .travis.yml with a specific config and coredump/gdb enabled
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    709b6b9 View commit details
    Browse the repository at this point in the history
  7. Reenable builds in Travis CI to debug a coredump

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    24002c6 View commit details
    Browse the repository at this point in the history
  8. Reenable all builds in Travis CI and keep gdb/coredump

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    a685441 View commit details
    Browse the repository at this point in the history
  9. Debug Travis CI failure: Only enable coredump for TEST_GROUP 1

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    30548bc View commit details
    Browse the repository at this point in the history
  10. Fix syntax error in .travis.yml

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    d2185a2 View commit details
    Browse the repository at this point in the history
  11. Revert all changes in .travis.yml

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    8653dc1 View commit details
    Browse the repository at this point in the history
  12. 2nd attempt at printing stack trace in Travis build

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    39a473f View commit details
    Browse the repository at this point in the history
  13. Don't install gdb for osx builds

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    3daaca9 View commit details
    Browse the repository at this point in the history
  14. More tests, a bug fix and some clean-up/comments

    Summary:
    1. Add tests for multiple column family and multiple DB cases
    2. Change SstFileManager::OnAddFile signature
    3. Better isolate DB instances in the multiple instance case by not
    impacting one instance if another has run into error, in order to
    preserve existing behavior
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    ebdc71b View commit details
    Browse the repository at this point in the history
  15. Address code review comments

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 28, 2018
    Configuration menu
    Copy the full SHA
    feea658 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2018

  1. Rebase to latest master after resolving conflicts

    Summary:
    The main conflict is with the commit 7daae51, which refactors flush
    request processing.
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Aug 29, 2018
    Configuration menu
    Copy the full SHA
    a59b8b2 View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2018

  1. Address final few code review comments

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 13, 2018
    Configuration menu
    Copy the full SHA
    359d98a View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2018

  1. Fix a race between recovery completion and DB shutdown

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 14, 2018
    Configuration menu
    Copy the full SHA
    832fa03 View commit details
    Browse the repository at this point in the history
  2. Fix one more race of shutdown initiated before recovery attempt

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 14, 2018
    Configuration menu
    Copy the full SHA
    7989ce6 View commit details
    Browse the repository at this point in the history

Commits on Sep 15, 2018

  1. Enable gdb in travis

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    dbef9f0 View commit details
    Browse the repository at this point in the history
  2. 2nd attempt at Travis gdb

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    893bcc6 View commit details
    Browse the repository at this point in the history
  3. 3rd attempt at Travis

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    98cafaa View commit details
    Browse the repository at this point in the history
  4. Revert .travis.yml and fix some mem leaks in error_handler_test

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    cdaae4e View commit details
    Browse the repository at this point in the history
  5. Disable tests failing only in Travis

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    6e278e1 View commit details
    Browse the repository at this point in the history
  6. Disable a few more tests in Travis

    Summary:
    Temporarily disbling these tests in Travis as they are only failing in
    Travis and not reproducible elsewhere.
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    Anand Ananthabhotla committed Sep 15, 2018
    Configuration menu
    Copy the full SHA
    d4760ff View commit details
    Browse the repository at this point in the history