Skip to content

Conversation

@anand1976
Copy link
Contributor

@anand1976 anand1976 commented May 20, 2021

In the current error recovery logic, background write errors during flush/compaction are automatically retried under some circumstances (NoSpace, retryable errors in distributed file systems). Normally, these errors are not visible to the user and we can try to recover from them in the background. However, if recovery takes a long time, the memtables eventually would become full with buffered writes and writes will be stopped by the write controller. When that happens, we currently return Status::Incomplete rather than indefinitely hang the write thread. There are 2 problems with this approach -

  1. The Incomplete error may not be handled correctly by the user. It used to be returned only when the write_options.no_slowdown was set.
  2. Other writes may be queued behind the incomplete write. If the background error recovery succeeds, the queued writes may be successful, which might cause inconsistency, especially with TransactionDB.

The solution is to stop all further writes once we return an error for a user write. This is accomplished in this PR as follows -

  1. When the write controller stops writes and there is a background error, stop all further writes by setting the severity in bg_error_ to kHardError.
  2. Return the bg_error_ rather than Status::Incomplete
  3. Disable automatic error recovery in TransactionDB::Open() by setting db_options.max_bgerror_resume_count to 0. (Is this still required if we have Miss Spelling in README #1?)

new_bg_err = OverrideNoSpaceError(new_bg_err, reason, &auto_recovery);
}

if ((!db_options_.max_bgerror_resume_count || !auto_recovery) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about compaction? Compaction do not do auto recovery and it just reschedule by itself. We set it to soft error. Should it be hard error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user has disabled auto recovery, I think we should set it to hard error even for compaction. Otherwise, too many pending compactions could also lead to a write stall.

// error. Since the background error is now user visible and caused a
// write to fail, stop the DB and fail subsequent writes as well. There
// may be other writes in the queue and might cause inconsistency if the
// recovery succeeds and the queued writes are allowed to go through.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that if users use RocksDB in certain way, as what MyRocks does, any write error might not be recoverable. How about other use cases, where writes are more or less independent, so one write failure can be skipped or independently retried later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Should we introduce an option to control this behavior? And perhaps it could apply to other user write failures, such as IO error during WAL append. The automatic recovery will flush memtables and create a new WAL if there was a WAL append failure. Any writes during that time will be failed, but subsequent writes will succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to do that, we probably should go with an option. Do we have an API that allows users to manually recover? If we do, the option can be for manual recovery only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DB::Resume() allows users to manually recover. I added an option to freeze the DB on a user write failure. I kept it independent of auto recovery, since the auto recovery can continue as long as its confined to the background and not visible to the user. If the freeze options is set and there is a user visible failure (either due to reason kWriteCallback or write controller stoppage), then we put the DB in read-only mode and cancel any ongoing recovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants