-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very frequent Leader Election with High compaction time? #14071
Comments
@iamejboy thanks for reporting the issue, you may want to try tuning considering your env - https://etcd.io/docs/v3.5/tuning/ As you said, it's a kind of issue that's difficult to reproduce. If you can dig more and like to provide a PR that would be great. |
If I interpret the impact correctly, leader election frequently disrupts the cluster availability. Given the etcd version is Lines 113 to 114 in 3cf2f69
and can double confirmed from the
That's one way to mitigate the network packet loss, delay and partition impact. I did not see old leader warning log to send out heartbeats due to it breaches 2*heartbeat timeout. So I guess it's not a disk problem in leader. What's the peer to peer round trip time in old leader and new leader? Also would you mind consider using latest 3.4 minor version etcd? https://github.com/etcd-io/etcd/tree/v3.4.18 Feel free to report back with network round trip time metric between peers if the issue still exist after |
131 us avg, 145 us max, 121 us min for last 10 days stat from smoke ping |
If a follower doesn't receive any message from the leader in randomizedElectionTimeout, then it may kick off a new election process. It looks like just a performance issue to me. Suggested action:
Closing this ticket for now, please feel free to reopen it or raise a new issue/question if you have any other queries or issue. |
The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) Signed-off-by: André Martins <andre@cilium.io>
The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) Signed-off-by: André Martins <andre@cilium.io>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) [ Backport note: Fixed minor conflict. ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) [ Backport note: Fixed minor conflict. ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) [ Backport note: Fixed minor conflict. ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) [ Backport note: Fixed minor conflict. ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
[ upstream commit e3803b4 ] The etcd error 'Failed to update lock: etcdserver: request timed out' does not seem to be related with Cilium. Thus, we will add it to the list of exceptions and not fail the CI because of this error. This seems to be a consequence of etcd's errors such as: - 'Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"' - 'Failed to update lock: etcdserver: request timed out' Those issues are referred in the etcd repository: - etcd-io/etcd#14071 - etcd-io/etcd#14027 (comment) Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
What happened?
Frequent leader changed with no log of "MsgTimeoutNow" from leader from follower, or "rafthttp: lost the TCP streaming connection with peer". A follower would just initialize an election, replacing a leader. Also seen some random "grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" and more than 5s compaction
What did you expect to happen?
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
/usr/local/bin/etcd --name host-000005 --initial-advertise-peer-urls https://host-000005:2380 --listen-peer-urls https://0.0.0.0:2380 --listen-client-urls https://0.0.0.0:2379 --advertise-client-urls https://host-000005:2379 --initial-cluster host-000000=https://host-000000:2380,host-000001=https://host-000001:2380,host-000002=https://host-000002:2380,host-000003=https://host-000003:2380,host-000004=https://host-000004:2380,host-000005=https://host-000005:2380,host-000006=https://host-000006:2380 --initial-cluster-token host --initial-cluster-state existing --data-dir /var/etcd/data --quota-backend-bytes=8388608000 --client-cert-auth --trusted-ca-file=/etc/ssl/ca.pem --cert-file=/etc/ssl/etcd.pem --key-file=/etc/ssl/etcd-key.pem --peer-client-cert-auth --peer-trusted-ca-file=/etc/ssl/ca.pem --peer-cert-file=/etc/ssl/etcd.pem --peer-key-file=/etc/ssl/etcd-key.pem --cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: