Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior when dqlite's raft node enters the RAFT_UNAVAILABLE state. #213

Open
MathieuBordere opened this issue Nov 21, 2022 · 6 comments
Open
Labels
Bug Confirmed to be a bug Feature New feature, not a bug

Comments

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Nov 21, 2022

When an unrecoverable error occurs, a raft node can enter the RAFT_UNAVAILABLE state and will never leave it, unless eventually the process running the raft node is restarted. From my understanding the app package does not yet handle this case. Ideally we should detect this unrecoverable state, and restart it.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Nov 21, 2022

This will happen more frequently now due to canonical/dqlite#434 which unmasked a class of unrecoverable errors that were ignored in the past (failed applications of raft log entries).

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Nov 22, 2022

What do you think @freeekanayaka?
My initial thought would be to monitor for the RAFT_UNVAILABLE state (or whatever state we translate it to for dqlite) in the app's run loop and restart the dqlite node (and everything else necessary) ad infinitum.

@freeekanayaka
Copy link
Contributor

First of all, do we have an idea of what errors are being triggered by the FSM? I understand that with the disk-mode feature the error surface area is larger (I/O-related errors, as out-of-space), but for the in-memory one I don't quite see reasons for failure (perhaps except the switch default: case for unknown commands). If you failures other than that one, they may actually be a symptom of other bugs.

That being said, I'm not entirely sure we should handle this transparently (i.e. perform automatic restarts, either in the app Go package or directly within the dqlite C engine itself).

In the in-memory mode case, from dqlite's and raft's perspective this should pretty much be an unrecoverable error: the FSM is supposed to be deterministic so restarting the node should trigger again the exact same problem. For the disk mode case, there might be transient I/O errors like disk full, and for those a retry might help.

In both cases I'd say that the error is kind of a show stopper that requires some human to look at the situation (for example you might need to upgrade the dqlite version if there is an unknown command, or free disk space if it's full). So I'm thinking that perhaps the best course of action would be to propagate the error up the stack to the app run and probably issue a panic with a description of the failure, or alternatively enter some state in dqlite engine where every wire protocol request will return that error.

@MathieuBordere
Copy link
Contributor Author

It's indeed related to the disk-mode when the disk is full, then sqlite can return an error when opening a database connection. I think what you propose makes more sense than retrying, thank you.

@tomponline
Copy link
Member

Could I request that we dont panic in the library as that will prevent the app from taking any remedial/notification action.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Nov 24, 2022

As mentioned, perhaps making all wire protocol requests fail is enough then. If error propagation works all the way through, you should see the error with a reasonable explanation (e.g. "out of disk", or "out-of-date dqlite engine") in, say, LXD logs or command line output.

@MathieuBordere MathieuBordere added Bug Confirmed to be a bug Feature New feature, not a bug labels Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug Feature New feature, not a bug
Projects
None yet
Development

No branches or pull requests

3 participants