Behavior when dqlite's raft node enters the `RAFT_UNAVAILABLE` state. #213

MathieuBordere · 2022-11-21T09:39:37Z

When an unrecoverable error occurs, a raft node can enter the RAFT_UNAVAILABLE state and will never leave it, unless eventually the process running the raft node is restarted. From my understanding the app package does not yet handle this case. Ideally we should detect this unrecoverable state, and restart it.

The text was updated successfully, but these errors were encountered:

MathieuBordere · 2022-11-21T09:41:24Z

This will happen more frequently now due to canonical/dqlite#434 which unmasked a class of unrecoverable errors that were ignored in the past (failed applications of raft log entries).

MathieuBordere · 2022-11-22T16:05:03Z

What do you think @freeekanayaka?
My initial thought would be to monitor for the RAFT_UNVAILABLE state (or whatever state we translate it to for dqlite) in the app's run loop and restart the dqlite node (and everything else necessary) ad infinitum.

freeekanayaka · 2022-11-24T11:52:21Z

First of all, do we have an idea of what errors are being triggered by the FSM? I understand that with the disk-mode feature the error surface area is larger (I/O-related errors, as out-of-space), but for the in-memory one I don't quite see reasons for failure (perhaps except the switch default: case for unknown commands). If you failures other than that one, they may actually be a symptom of other bugs.

That being said, I'm not entirely sure we should handle this transparently (i.e. perform automatic restarts, either in the app Go package or directly within the dqlite C engine itself).

In the in-memory mode case, from dqlite's and raft's perspective this should pretty much be an unrecoverable error: the FSM is supposed to be deterministic so restarting the node should trigger again the exact same problem. For the disk mode case, there might be transient I/O errors like disk full, and for those a retry might help.

In both cases I'd say that the error is kind of a show stopper that requires some human to look at the situation (for example you might need to upgrade the dqlite version if there is an unknown command, or free disk space if it's full). So I'm thinking that perhaps the best course of action would be to propagate the error up the stack to the app run and probably issue a panic with a description of the failure, or alternatively enter some state in dqlite engine where every wire protocol request will return that error.

MathieuBordere · 2022-11-24T12:10:58Z

It's indeed related to the disk-mode when the disk is full, then sqlite can return an error when opening a database connection. I think what you propose makes more sense than retrying, thank you.

tomponline · 2022-11-24T14:08:40Z

Could I request that we dont panic in the library as that will prevent the app from taking any remedial/notification action.

freeekanayaka · 2022-11-24T14:12:23Z

As mentioned, perhaps making all wire protocol requests fail is enough then. If error propagation works all the way through, you should see the error with a reasonable explanation (e.g. "out of disk", or "out-of-date dqlite engine") in, say, LXD logs or command line output.

MathieuBordere mentioned this issue Nov 21, 2022

Initial disk mode support canonical/dqlite#401

Merged

MathieuBordere added Bug Confirmed to be a bug Feature New feature, not a bug labels Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior when dqlite's raft node enters the `RAFT_UNAVAILABLE` state. #213

Behavior when dqlite's raft node enters the `RAFT_UNAVAILABLE` state. #213

MathieuBordere commented Nov 21, 2022 •

edited

Loading

MathieuBordere commented Nov 21, 2022 •

edited

Loading

MathieuBordere commented Nov 22, 2022 •

edited

Loading

freeekanayaka commented Nov 24, 2022

MathieuBordere commented Nov 24, 2022

tomponline commented Nov 24, 2022

freeekanayaka commented Nov 24, 2022 •

edited

Loading

Behavior when dqlite's raft node enters the RAFT_UNAVAILABLE state. #213

Behavior when dqlite's raft node enters the RAFT_UNAVAILABLE state. #213

Comments

MathieuBordere commented Nov 21, 2022 • edited Loading

MathieuBordere commented Nov 21, 2022 • edited Loading

MathieuBordere commented Nov 22, 2022 • edited Loading

freeekanayaka commented Nov 24, 2022

MathieuBordere commented Nov 24, 2022

tomponline commented Nov 24, 2022

freeekanayaka commented Nov 24, 2022 • edited Loading

Behavior when dqlite's raft node enters the `RAFT_UNAVAILABLE` state. #213

Behavior when dqlite's raft node enters the `RAFT_UNAVAILABLE` state. #213

MathieuBordere commented Nov 21, 2022 •

edited

Loading

MathieuBordere commented Nov 21, 2022 •

edited

Loading

MathieuBordere commented Nov 22, 2022 •

edited

Loading

freeekanayaka commented Nov 24, 2022 •

edited

Loading