Leader election and dqlite errors when recovering nodes in HA cluster #2819

bc185174 · 2022-01-04T12:52:03Z

Hello,

We have a HA cluster setup with three nodes, each using version 1.21.7:

NAME                    STATUS   ROLES    AGE   VERSION                    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
masterA                 Ready    <none>   37m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterB                 Ready    <none>   32m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterC                 Ready    <none>   42m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4

We came across an issue when two nodes masterA and masterB were removed ungracefully and shutdown. The elected leader node was masterA. The following errors occurred around the same time on the remaining node masterC:

leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=15s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout

At this point, the elected leader node remained the same (masterA) which is still powered-off; however, when we powered-on masterB, it failed to start the kubelite service:

microk8s.daemon-kubelite[8542]: Error: start node: raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch

Could the segment be corrupt or would this suggest that it cannot sync the dqlite files to the elected leader masterA that is still unavailable? If so is there a way we can validate the checksum? In order to recover masterB, I had to delete the mentioned dqlite file and restart the kubelite service. Once masterB and masterC were available, a new leader node was elected (masterB) and we was able to recover the cluster.

Checking the HA documentation, with only a single node available, would this render the cluster inoperable? Essentially, would we need more than one node available at anytime?

There was a few other suggestions such as increasing the following arguments in the kube-scheduler and kube-controller-manager (source):

--leader-elect-lease-duration=60s
--leader-elect-renew-deadline=40s

A number of comments mentioned that they had the same issue using microk8s v1.21. The last potential issue was a "resource crunch or network issue" mentioned here. We have not yet been able to replicate the issue but would appreciate if anyone could shed some light on this.

The text was updated successfully, but these errors were encountered:

molnarp · 2022-01-06T09:48:00Z

I am having the same issue on a single node cluster on Debian 11 and Microk8s v1.22.4 rev 2695. After leader election is lost, apparently the process is terminated and restarted by systemd, which throws all pods into turmoil.

balchua · 2022-01-07T21:54:33Z

@bc185174 your observation is about right. In a raft cluster (either etcd or dqlite), there must be a majority.
So in a 3 node cluster, it can sustain up to 1 node down. While on a 5 node cluster, it can sustain 2 nodes down.

With regards to that error on your node masterB, my guess is that the node was ungracefully shutdown that caused the issue.

bc185174 · 2022-01-10T09:07:26Z

@bc185174 your observation is about right. In a raft cluster (either etcd or dqlite), there must be a majority. So in a 3 node cluster, it can sustain up to 1 node down. While on a 5 node cluster, it can sustain 2 nodes down.

With regards to that error on your node masterB, my guess is that the node was ungracefully shutdown that caused the issue.

Thank you for clarifying. In the dqlite repo there is some documentation on the raft_start() error. Currently the only solution is to remove the offending segment and restart the kubelite service.

ktsakalozos · 2022-01-10T13:00:46Z

@bc185174, the error raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch indicates some form of data corruption. This probably happened because of the unclean way the node was taken down. Maybe @MathieuBordere knows if there are any plans to perform some kind of (semi) automated "fsck" on the data and recover from such cases.

MathieuBordere · 2022-01-10T13:09:14Z

@bc185174, the error raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch indicates some form of data corruption. This probably happened because of the unclean way the node was taken down. Maybe @MathieuBordere knows if there are any plans to perform some kind of (semi) automated "fsck" on the data and recover from such cases.

It's not planned immediately but was already discussed, and imo is useful to add. Will try to do it within a reasonable timeframe.

stale · 2022-12-06T13:44:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the inactive label Dec 6, 2022

stale bot closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election and dqlite errors when recovering nodes in HA cluster #2819

Leader election and dqlite errors when recovering nodes in HA cluster #2819

bc185174 commented Jan 4, 2022 •

edited

Loading

molnarp commented Jan 6, 2022

balchua commented Jan 7, 2022

bc185174 commented Jan 10, 2022 •

edited

Loading

ktsakalozos commented Jan 10, 2022

MathieuBordere commented Jan 10, 2022

stale bot commented Dec 6, 2022

Leader election and dqlite errors when recovering nodes in HA cluster #2819

Leader election and dqlite errors when recovering nodes in HA cluster #2819

Comments

bc185174 commented Jan 4, 2022 • edited Loading

molnarp commented Jan 6, 2022

balchua commented Jan 7, 2022

bc185174 commented Jan 10, 2022 • edited Loading

ktsakalozos commented Jan 10, 2022

MathieuBordere commented Jan 10, 2022

stale bot commented Dec 6, 2022

bc185174 commented Jan 4, 2022 •

edited

Loading

bc185174 commented Jan 10, 2022 •

edited

Loading