-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[RFC] Disable automatic error recovery for user write failures #8321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
db/error_handler.cc
Outdated
| new_bg_err = OverrideNoSpaceError(new_bg_err, reason, &auto_recovery); | ||
| } | ||
|
|
||
| if ((!db_options_.max_bgerror_resume_count || !auto_recovery) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about compaction? Compaction do not do auto recovery and it just reschedule by itself. We set it to soft error. Should it be hard error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user has disabled auto recovery, I think we should set it to hard error even for compaction. Otherwise, too many pending compactions could also lead to a write stall.
| // error. Since the background error is now user visible and caused a | ||
| // write to fail, stop the DB and fail subsequent writes as well. There | ||
| // may be other writes in the queue and might cause inconsistency if the | ||
| // recovery succeeds and the queued writes are allowed to go through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that if users use RocksDB in certain way, as what MyRocks does, any write error might not be recoverable. How about other use cases, where writes are more or less independent, so one write failure can be skipped or independently retried later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Should we introduce an option to control this behavior? And perhaps it could apply to other user write failures, such as IO error during WAL append. The automatic recovery will flush memtables and create a new WAL if there was a WAL append failure. Any writes during that time will be failed, but subsequent writes will succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to do that, we probably should go with an option. Do we have an API that allows users to manually recover? If we do, the option can be for manual recovery only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, DB::Resume() allows users to manually recover. I added an option to freeze the DB on a user write failure. I kept it independent of auto recovery, since the auto recovery can continue as long as its confined to the background and not visible to the user. If the freeze options is set and there is a user visible failure (either due to reason kWriteCallback or write controller stoppage), then we put the DB in read-only mode and cancel any ongoing recovery.
In the current error recovery logic, background write errors during flush/compaction are automatically retried under some circumstances (NoSpace, retryable errors in distributed file systems). Normally, these errors are not visible to the user and we can try to recover from them in the background. However, if recovery takes a long time, the memtables eventually would become full with buffered writes and writes will be stopped by the write controller. When that happens, we currently return Status::Incomplete rather than indefinitely hang the write thread. There are 2 problems with this approach -
The solution is to stop all further writes once we return an error for a user write. This is accomplished in this PR as follows -
bg_error_tokHardError.db_options.max_bgerror_resume_countto 0. (Is this still required if we have Miss Spelling in README #1?)