Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the statistics and info log for Error handler #8050

Closed
wants to merge 2 commits into from

Conversation

zhichao-cao
Copy link
Contributor

Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Test plan: make check and add test to error_handler_fs_test

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

ERROR_HANDLER_BG_IO_ERROR_COUNT((byte) -0x17),
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT((byte) -0x18),
ERROR_HANDLER_AUTORESUME_COUNT((byte) -0x19),
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT((byte) -0x1A),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for this to be a histogram stat, so we would get a distribution of how long the outages lasted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for this to be a histogram stat, so we would get a distribution of how long the outages lasted?

You mean for "ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT"? This is just a general counter for the total retries being called since DB is opened.

I used ERROR_HANDLER_AUTORESUME_RETRY_COUNT for histogram which shows how many times of retry will be called in one autoresume call. Did you mean this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry, I misread.

ERROR_HANDLER_BG_IO_ERROR_COUNT((byte) -0x17),
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT((byte) -0x18),
ERROR_HANDLER_AUTORESUME_COUNT((byte) -0x19),
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT((byte) -0x1A),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry, I misread.

if (bg_error_stats_ != nullptr) {
RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_ERROR_COUNT);
RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_IO_ERROR_COUNT);
RecordTick(bg_error_stats_.get(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we break this into 2 stats to distinguish between retryable errors RocksDB attempted to recover from and retryable errors that it didn't attempt? It'll help indirectly track the WAL related retryable errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERROR_HANDLER_BG_IO_ERROR_COUNT - ERROR_HANDLER_AUTORESUME_COUNT is the retryable errors that we do not attempt the recover.

Comment on lines +377 to +385
// The counters for error handler, not that, bg_io_error is the subset of
// bg_error and bg_retryable_io_error is the subset of bg_io_error
ERROR_HANDLER_BG_ERROR_COUNT,
ERROR_HANDLER_BG_IO_ERROR_COUNT,
ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT,
ERROR_HANDLER_AUTORESUME_COUNT,
ERROR_HANDLER_AUTORESUME_RETRY_TOTAL_COUNT,
ERROR_HANDLER_AUTORESUME_SUCCESS_COUNT,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not arguing against these statistics, but are they at all meaningful to the RocksDB user? Do they know anything about the error handler and its potential states and failures? Is this opening a can of worms to document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So those are the statistics requested by the internal users. And I believe all the users will benefit from the statistics since error handler is the default logic in RocksDB to deal with the write IO errors.

Comment on lines +112 to +114
// The pointer of DB statistics.
std::shared_ptr<Statistics> bg_error_stats_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a shared or raw pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safer to use the shared pointer I think

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
ROCKS_LOG_INFO(
db_options_.info_log,
"ErrorHanlder: Compaction will schedule by itself to resume\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - ErrorHanlder -> ErrorHandler

ERROR_HANDLER_BG_RETRYABLE_IO_ERROR_COUNT);
}
ROCKS_LOG_INFO(db_options_.info_log,
"ErrorHanlder: Set background retryable IO error\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - ErrorHanlder -> ErrorHandler

}
ROCKS_LOG_INFO(
db_options_.info_log,
"ErrorHanlder: Set background IO error as unrecoverable error\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - ErrorHanlder -> ErrorHandler

}
ROCKS_LOG_INFO(
db_options_.info_log,
"ErrorHanlder: Call StartRecoverFromRetryableBGIOError to resume\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - ErrorHanlder -> ErrorHandler

RecordTick(bg_error_stats_.get(), ERROR_HANDLER_BG_ERROR_COUNT);
}
ROCKS_LOG_INFO(db_options_.info_log,
"ErrorHanlder: Set regular background error\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - ErrorHanlder -> ErrorHandler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will fix all

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao merged this pull request in 08ec5e7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants