Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling and logging for Retryable IO Errors #2312

Closed
2 tasks done
VasuDevrani opened this issue May 15, 2024 · 13 comments · Fixed by #2317
Closed
2 tasks done

Improve error handling and logging for Retryable IO Errors #2312

VasuDevrani opened this issue May 15, 2024 · 13 comments · Fixed by #2317
Labels
enhancement type enhancement

Comments

@VasuDevrani
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

Motivation

In server.cc file, the code checks every minute if the storage is in a retryable IO error state and resumes the database if necessary.

 if (counter != 0 && counter % 600 == 0 && storage->IsDBInRetryableIOError()) {
    storage->GetDB()->Resume();
    LOG(INFO) << "[server] Schedule to resume DB after retryable IO error";
    storage->SetDBInRetryableIOError(false);
  }

Concerns:

  1. The code resumes the database and sets the retryable IO error state to false without checking if the resume operation itself succeeded.
  2. logs INFO message which may be ERROR or WARNING instead

Solution

example code to resolve this:

if (counter != 0 && counter % 600 == 0 && storage->IsDBInRetryableIOError()) {
  auto status = storage->GetDB()->Resume();
  if (status.ok()) {
    LOG(WARNING) << "[server] Successfully resumed DB after retryable IO error: " << status.ToString();
    storage->SetDBInRetryableIOError(false);
  } else {
    LOG(ERROR) << "[server] Failed to resume DB after retryable IO error: " << status.ToString();
    // Additional error handling, such as retrying or notifying the administrator
  }
}

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@VasuDevrani VasuDevrani added the enhancement type enhancement label May 15, 2024
@git-hulk
Copy link
Member

@VasuDevrani Looks good, you can go ahead to improve this.

@VasuDevrani
Copy link
Contributor Author

@VasuDevrani Looks good, you can go ahead to improve this.

See, the code proposed as solution just prints ERROR in else block.
How about we add a set number of retry for resuming DB operation in case of failure, and if it still fails after retries, then we log a CRITICAL error finally.

@git-hulk
Copy link
Member

I'm wondering if it's a good idea to do that. I prefer letting users determine whether to terminate themself instead of by default N times retry. @PragmaTwice @caipengbo @torwig What do you think?

@caipengbo
Copy link
Contributor

caipengbo commented May 15, 2024

Yes, I also don't think it's necessary to set a number of retries. Alternatively, we could provide a RESUME command, which would give administrators more options.

@mapleFU
Copy link
Member

mapleFU commented May 15, 2024

Out-of curiousity, do we have ability to recovery from error if Resume() failed for one-times? Seems we can only recover from merely case like Compaction output error or flush error?

@caipengbo
Copy link
Contributor

caipengbo commented May 15, 2024

do we have ability to recovery from error if Resume() failed for one-times?

There might be some cases (disk no space?) where it needs to be done externally for the Resume() to succeed.

@mapleFU
Copy link
Member

mapleFU commented May 15, 2024

Thats a good point, so we should let user decide what to do during this case...

@git-hulk
Copy link
Member

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

@VasuDevrani
Copy link
Contributor Author

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

Yeah I've considered above conversation. I'm learning a bit more about Kvrocks, its internals. will update here with my thoughts after a while. Thanks for following up.

@git-hulk
Copy link
Member

@VasuDevrani Cool, don't hesitate to raise if you have any ideas.

@VasuDevrani
Copy link
Contributor Author

@VasuDevrani As discussed above, improving the logging message is good, but don't expect to escalate the FATAL error after N time tries. What do you think?

I agree with just improving the logging message for now. (should i make continue with a PR?)

@VasuDevrani Cool, don't hesitate to raise if you have any ideas.

Later, if need arise we can go with idea of having RESUME command for admins as suggested by @caipengbo
or, we can add a configuration option for max_number_of_retry for this operation

@git-hulk
Copy link
Member

I agree with just improving the logging message for now. (should i make continue with a PR?)

Sure, you can continue improving the logging message.

@caipengbo
Copy link
Contributor

I agree with just improving the logging message for now. (should i make continue with a PR?)

Yes, RESUME is not a high priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement type enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants