Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader election and dqlite errors when recovering nodes in HA cluster #2819

Closed
bc185174 opened this issue Jan 4, 2022 · 6 comments
Closed
Labels

Comments

@bc185174
Copy link

bc185174 commented Jan 4, 2022

Hello,

We have a HA cluster setup with three nodes, each using version 1.21.7:

NAME                    STATUS   ROLES    AGE   VERSION                    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
masterA                 Ready    <none>   37m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterB                 Ready    <none>   32m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterC                 Ready    <none>   42m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4

We came across an issue when two nodes masterA and masterB were removed ungracefully and shutdown. The elected leader node was masterA. The following errors occurred around the same time on the remaining node masterC:

leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=15s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout

At this point, the elected leader node remained the same (masterA) which is still powered-off; however, when we powered-on masterB, it failed to start the kubelite service:

microk8s.daemon-kubelite[8542]: Error: start node: raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch

Could the segment be corrupt or would this suggest that it cannot sync the dqlite files to the elected leader masterA that is still unavailable? If so is there a way we can validate the checksum? In order to recover masterB, I had to delete the mentioned dqlite file and restart the kubelite service. Once masterB and masterC were available, a new leader node was elected (masterB) and we was able to recover the cluster.

Checking the HA documentation, with only a single node available, would this render the cluster inoperable? Essentially, would we need more than one node available at anytime?

There was a few other suggestions such as increasing the following arguments in the kube-scheduler and kube-controller-manager (source):

--leader-elect-lease-duration=60s
--leader-elect-renew-deadline=40s

A number of comments mentioned that they had the same issue using microk8s v1.21. The last potential issue was a "resource crunch or network issue" mentioned here. We have not yet been able to replicate the issue but would appreciate if anyone could shed some light on this.

@molnarp
Copy link

molnarp commented Jan 6, 2022

I am having the same issue on a single node cluster on Debian 11 and Microk8s v1.22.4 rev 2695. After leader election is lost, apparently the process is terminated and restarted by systemd, which throws all pods into turmoil.

@balchua
Copy link
Collaborator

balchua commented Jan 7, 2022

@bc185174 your observation is about right. In a raft cluster (either etcd or dqlite), there must be a majority.
So in a 3 node cluster, it can sustain up to 1 node down. While on a 5 node cluster, it can sustain 2 nodes down.

With regards to that error on your node masterB, my guess is that the node was ungracefully shutdown that caused the issue.

@bc185174
Copy link
Author

bc185174 commented Jan 10, 2022

@bc185174 your observation is about right. In a raft cluster (either etcd or dqlite), there must be a majority. So in a 3 node cluster, it can sustain up to 1 node down. While on a 5 node cluster, it can sustain 2 nodes down.

With regards to that error on your node masterB, my guess is that the node was ungracefully shutdown that caused the issue.

Thank you for clarifying. In the dqlite repo there is some documentation on the raft_start() error. Currently the only solution is to remove the offending segment and restart the kubelite service.

@ktsakalozos
Copy link
Member

@bc185174, the error raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch indicates some form of data corruption. This probably happened because of the unclean way the node was taken down. Maybe @MathieuBordere knows if there are any plans to perform some kind of (semi) automated "fsck" on the data and recover from such cases.

@MathieuBordere
Copy link

@bc185174, the error raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch indicates some form of data corruption. This probably happened because of the unclean way the node was taken down. Maybe @MathieuBordere knows if there are any plans to perform some kind of (semi) automated "fsck" on the data and recover from such cases.

It's not planned immediately but was already discussed, and imo is useful to add. Will try to do it within a reasonable timeframe.

@stale
Copy link

stale bot commented Dec 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive label Dec 6, 2022
@stale stale bot closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants