-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LeaseKeepAlive Unavailable #13632
Comments
bump ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
any work on this? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
It's unfortunate that nobody even replied. |
Contributions are welcomed! |
Could you provide complete etcd log with the debug level ( |
Hello, we are facing the same issue, so I'll answer instead (also v3.5.8). I'm not sure failures mentioned in issue are resulting in
Worse, there is no mention of I tried enabling go-grpc messages as described in go-grpc documentation but they are not propagated to etcd output. Any further guidance will be appreciated. |
@tjungblu Have you ever seen similar issue in Openshift? |
Never seen this before. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Introduction (tl;dr at the bottom)
Hello, I am unsure if this is an ETCD bug or GRPC one or whatever this is-- so i'll try to be as explicit as possible.
As you may know there has been a few reports/issues related to grpc_code="Unavailable",grpc_method="Watch" #12196, and the interpretation of such metrics by alertmanager: #13127. Also discussed in other issues and even on other Githubs.
I come to you today with either a similar problem or a misconfiguration leading to the couple grpc_code="Unavailable",grpc_method="LeaseKeepAlive" appearing a lot in the metrics of the ETCD.
Environments:
ETCDs outside of k8s cluster environment.
ETCDs configuration are pretty similar for each env:
We've had the fabled Watch Unavailable for a while now but it hasn't bothered me much since I knew it was being looked at, so I've sort of been ignoring the following alert for that error:
- alert: EtcdHighNumberOfFailedGrpcRequests expr: (sum(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{}[5m])) without (grpc_type, grpc_code)) > 0.01 for: 10m
Issue
Symptoms of the issue
Recently however i started receiving EtcdHighNumberOfFailedGrpcRequests alerts that were activated by LeaseKeepAvailable coupled with unavailable code on all 3 environments.
Example from Env2 (but it's the same on Env1 and Env3):
In-depth look
The metrics generated by the etcds confirm this:
How come we have the Unavailable code at 71 and received/sent at 70 and started at 71? Either I do not understand these metrics or the numbers don't add up.
I added the --log-level=debug to an ETCD to see what's actually happening and I was able to spot multiple requests marked with "failure":
As you can see there's some failure related to the lease resource (and other stuff but I am not sure this is the right venue to discuss them now :D).
Attempts at fixing the issue
I've researched the issue a bit online and decided I wanted to try with a newer etcd image since there's been multiple fixes since.
I upgraded Env1 from etcd 3.3.15 to 3.5.1 as this is my test environment.
As expected the grpc_code="Unavailable",grpc_method="Watch" disappeared but I am still facing the grpc_code="Unavailable",grpc_method="LeaseKeepAlive" at the same rate as I was before the update.
Confusion
The leases resources on K8s seem correct as do the kube-scheduler, kube-api-server, kube-controller-manager, and well... the whole cluster seems healthy.
Not expired:
All nodes ready:
tl;dr
grpc_code="Unavailable",grpc_method="LeaseKeepAlive" popping up a lot on cluster. Don't know why, not fixed by upgrading ETCD. Doesn't seem to have an impact on the cluster.
The text was updated successfully, but these errors were encountered: