-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the statistics and info log for Error handler #8050
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
ERROR_HANDLER_BG_IO_ERROR_COUNT((byte) -0x17), | ||
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT((byte) -0x18), | ||
ERROR_HANDLER_AUTORESUME_COUNT((byte) -0x19), | ||
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT((byte) -0x1A), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for this to be a histogram stat, so we would get a distribution of how long the outages lasted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for this to be a histogram stat, so we would get a distribution of how long the outages lasted?
You mean for "ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT"? This is just a general counter for the total retries being called since DB is opened.
I used ERROR_HANDLER_AUTORESUME_RETRY_COUNT for histogram which shows how many times of retry will be called in one autoresume call. Did you mean this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Sorry, I misread.
ERROR_HANDLER_BG_IO_ERROR_COUNT((byte) -0x17), | ||
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT((byte) -0x18), | ||
ERROR_HANDLER_AUTORESUME_COUNT((byte) -0x19), | ||
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT((byte) -0x1A), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Sorry, I misread.
if (bg_error_stats_ != nullptr) { | ||
RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_ERROR_COUNT); | ||
RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_IO_ERROR_COUNT); | ||
RecordTick(bg_error_stats_.get(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we break this into 2 stats to distinguish between retryable errors RocksDB attempted to recover from and retryable errors that it didn't attempt? It'll help indirectly track the WAL related retryable errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERROR_HANDLER_BG_IO_ERROR_COUNT - ERROR_HANDLER_AUTORESUME_COUNT is the retryable errors that we do not attempt the recover.
// The counters for error handler, not that, bg_io_error is the subset of | ||
// bg_error and bg_retryable_io_error is the subset of bg_io_error | ||
ERROR_HANDLER_BG_ERROR_COUNT, | ||
ERROR_HANDLER_BG_IO_ERROR_COUNT, | ||
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT, | ||
ERROR_HANDLER_AUTORESUME_COUNT, | ||
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT, | ||
ERROR_HANDLER_AUTORESUME_SUCCESS_COUNT, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not arguing against these statistics, but are they at all meaningful to the RocksDB user? Do they know anything about the error handler and its potential states and failures? Is this opening a can of worms to document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So those are the statistics requested by the internal users. And I believe all the users will benefit from the statistics since error handler is the default logic in RocksDB to deal with the write IO errors.
// The pointer of DB statistics. | ||
std::shared_ptr<Statistics> bg_error_stats_; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a shared or raw pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safer to use the shared pointer I think
a973fa4
to
6f2f089
Compare
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
db/error_handler.cc
Outdated
} | ||
ROCKS_LOG_INFO( | ||
db_options_.info_log, | ||
"ErrorHanlder: Compaction will schedule by itself to resume\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - ErrorHanlder -> ErrorHandler
db/error_handler.cc
Outdated
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT); | ||
} | ||
ROCKS_LOG_INFO(db_options_.info_log, | ||
"ErrorHanlder: Set background retryable IO error\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - ErrorHanlder -> ErrorHandler
db/error_handler.cc
Outdated
} | ||
ROCKS_LOG_INFO( | ||
db_options_.info_log, | ||
"ErrorHanlder: Set background IO error as unrecoverable error\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - ErrorHanlder -> ErrorHandler
db/error_handler.cc
Outdated
} | ||
ROCKS_LOG_INFO( | ||
db_options_.info_log, | ||
"ErrorHanlder: Call StartRecoverFromRetryableBGIOError to resume\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - ErrorHanlder -> ErrorHandler
db/error_handler.cc
Outdated
RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_ERROR_COUNT); | ||
} | ||
ROCKS_LOG_INFO(db_options_.info_log, | ||
"ErrorHanlder: Set regular background error\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - ErrorHanlder -> ErrorHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will fix all
6f2f089
to
2b2b703
Compare
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@zhichao-cao merged this pull request in 08ec5e7. |
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.
Test plan: make check and add test to error_handler_fs_test