Improve error handling and logging for Retryable IO Errors #2312

VasuDevrani · 2024-05-15T06:01:02Z

Search before asking

I had searched in the issues and found no similar issues.

Motivation

In server.cc file, the code checks every minute if the storage is in a retryable IO error state and resumes the database if necessary.

 if (counter != 0 && counter % 600 == 0 && storage->IsDBInRetryableIOError()) {
    storage->GetDB()->Resume();
    LOG(INFO) << "[server] Schedule to resume DB after retryable IO error";
    storage->SetDBInRetryableIOError(false);
  }

Concerns:

The code resumes the database and sets the retryable IO error state to false without checking if the resume operation itself succeeded.
logs INFO message which may be ERROR or WARNING instead

Solution

example code to resolve this:

if (counter != 0 && counter % 600 == 0 && storage->IsDBInRetryableIOError()) {
  auto status = storage->GetDB()->Resume();
  if (status.ok()) {
    LOG(WARNING) << "[server] Successfully resumed DB after retryable IO error: " << status.ToString();
    storage->SetDBInRetryableIOError(false);
  } else {
    LOG(ERROR) << "[server] Failed to resume DB after retryable IO error: " << status.ToString();
    // Additional error handling, such as retrying or notifying the administrator
  }
}

Are you willing to submit a PR?

I'm willing to submit a PR!

git-hulk · 2024-05-15T06:25:14Z

@VasuDevrani Looks good, you can go ahead to improve this.

VasuDevrani · 2024-05-15T10:38:32Z

@VasuDevrani Looks good, you can go ahead to improve this.

See, the code proposed as solution just prints ERROR in else block.
How about we add a set number of retry for resuming DB operation in case of failure, and if it still fails after retries, then we log a CRITICAL error finally.

git-hulk · 2024-05-15T10:52:07Z

I'm wondering if it's a good idea to do that. I prefer letting users determine whether to terminate themself instead of by default N times retry. @PragmaTwice @caipengbo @torwig What do you think?

caipengbo · 2024-05-15T11:03:22Z

Yes, I also don't think it's necessary to set a number of retries. Alternatively, we could provide a RESUME command, which would give administrators more options.

mapleFU · 2024-05-15T11:55:33Z

Out-of curiousity, do we have ability to recovery from error if Resume() failed for one-times? Seems we can only recover from merely case like Compaction output error or flush error?

caipengbo · 2024-05-15T12:08:57Z

do we have ability to recovery from error if Resume() failed for one-times?

There might be some cases (disk no space?) where it needs to be done externally for the Resume() to succeed.

mapleFU · 2024-05-15T12:40:37Z

Thats a good point, so we should let user decide what to do during this case...

git-hulk · 2024-05-16T02:04:50Z

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

VasuDevrani · 2024-05-16T02:12:41Z

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

Yeah I've considered above conversation. I'm learning a bit more about Kvrocks, its internals. will update here with my thoughts after a while. Thanks for following up.

git-hulk · 2024-05-16T02:21:42Z

@VasuDevrani Cool, don't hesitate to raise if you have any ideas.

VasuDevrani · 2024-05-16T13:17:47Z

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

I agree with just improving the logging message for now. (should i make continue with a PR?)

@VasuDevrani Cool, don't hesitate to raise if you have any ideas.

Later, if need arise we can go with idea of having RESUME command for admins as suggested by @caipengbo
or, we can add a configuration option for max_number_of_retry for this operation

git-hulk · 2024-05-16T13:20:43Z

I agree with just improving the logging message for now. (should i make continue with a PR?)

Sure, you can continue improving the logging message.

caipengbo · 2024-05-16T13:25:40Z

I agree with just improving the logging message for now. (should i make continue with a PR?)

Yes, RESUME is not a high priority.

VasuDevrani added the enhancement type enhancement label May 15, 2024

VasuDevrani mentioned this issue May 16, 2024

Improve logging message for retryable background IO errors #2317

Merged

PragmaTwice closed this as completed in #2317 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error handling and logging for Retryable IO Errors #2312

Improve error handling and logging for Retryable IO Errors #2312

VasuDevrani commented May 15, 2024

git-hulk commented May 15, 2024

VasuDevrani commented May 15, 2024

git-hulk commented May 15, 2024

caipengbo commented May 15, 2024 •

edited

Loading

mapleFU commented May 15, 2024

caipengbo commented May 15, 2024 •

edited

Loading

mapleFU commented May 15, 2024

git-hulk commented May 16, 2024

VasuDevrani commented May 16, 2024

git-hulk commented May 16, 2024

VasuDevrani commented May 16, 2024

git-hulk commented May 16, 2024

caipengbo commented May 16, 2024

Improve error handling and logging for Retryable IO Errors #2312

Improve error handling and logging for Retryable IO Errors #2312

Comments

VasuDevrani commented May 15, 2024

Search before asking

Motivation

Solution

Are you willing to submit a PR?

git-hulk commented May 15, 2024

VasuDevrani commented May 15, 2024

git-hulk commented May 15, 2024

caipengbo commented May 15, 2024 • edited Loading

mapleFU commented May 15, 2024

caipengbo commented May 15, 2024 • edited Loading

mapleFU commented May 15, 2024

git-hulk commented May 16, 2024

VasuDevrani commented May 16, 2024

git-hulk commented May 16, 2024

VasuDevrani commented May 16, 2024

git-hulk commented May 16, 2024

caipengbo commented May 16, 2024

caipengbo commented May 15, 2024 •

edited

Loading

caipengbo commented May 15, 2024 •

edited

Loading