Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LeaseKeepAlive Unavailable #13632

Open
Joseph94m opened this issue Jan 21, 2022 · 11 comments
Open

LeaseKeepAlive Unavailable #13632

Joseph94m opened this issue Jan 21, 2022 · 11 comments

Comments

@Joseph94m
Copy link

Joseph94m commented Jan 21, 2022

Introduction (tl;dr at the bottom)

Hello, I am unsure if this is an ETCD bug or GRPC one or whatever this is-- so i'll try to be as explicit as possible.

As you may know there has been a few reports/issues related to grpc_code="Unavailable",grpc_method="Watch" #12196, and the interpretation of such metrics by alertmanager: #13127. Also discussed in other issues and even on other Githubs.

I come to you today with either a similar problem or a misconfiguration leading to the couple grpc_code="Unavailable",grpc_method="LeaseKeepAlive" appearing a lot in the metrics of the ETCD.


Environments:

ETCDs outside of k8s cluster environment.

  • Env1: 1 etcd version 3.3.15, k8s 1.18.20 ,runtime docker 18.06.3-ce
  • Env2: 3 etcd version 3.4.3, k8s 1.18.20, runtime docker 18.06.3-ce
  • Env3: 5 etcd version 3.4.3, k8s 1.17.17, runtime docker 18.06.3-ce

ETCDs configuration are pretty similar for each env:

Screenshot 2022-01-21 at 17 17 30

We've had the fabled Watch Unavailable for a while now but it hasn't bothered me much since I knew it was being looked at, so I've sort of been ignoring the following alert for that error:

- alert: EtcdHighNumberOfFailedGrpcRequests expr: (sum(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{}[5m])) without (grpc_type, grpc_code)) > 0.01 for: 10m

Issue

Symptoms of the issue

Recently however i started receiving EtcdHighNumberOfFailedGrpcRequests alerts that were activated by LeaseKeepAvailable coupled with unavailable code on all 3 environments.

Example from Env2 (but it's the same on Env1 and Env3):

Screenshot 2022-01-21 at 17 35 54

In-depth look

The metrics generated by the etcds confirm this:

Screenshot 2022-01-21 at 17 38 48

How come we have the Unavailable code at 71 and received/sent at 70 and started at 71? Either I do not understand these metrics or the numbers don't add up.

I added the --log-level=debug to an ETCD to see what's actually happening and I was able to spot multiple requests marked with "failure":

image

As you can see there's some failure related to the lease resource (and other stuff but I am not sure this is the right venue to discuss them now :D).

Attempts at fixing the issue

I've researched the issue a bit online and decided I wanted to try with a newer etcd image since there's been multiple fixes since.

I upgraded Env1 from etcd 3.3.15 to 3.5.1 as this is my test environment.

As expected the grpc_code="Unavailable",grpc_method="Watch" disappeared but I am still facing the grpc_code="Unavailable",grpc_method="LeaseKeepAlive" at the same rate as I was before the update.

Confusion

The leases resources on K8s seem correct as do the kube-scheduler, kube-api-server, kube-controller-manager, and well... the whole cluster seems healthy.

image

Not expired:
image

All nodes ready:

image

tl;dr

grpc_code="Unavailable",grpc_method="LeaseKeepAlive" popping up a lot on cluster. Don't know why, not fixed by upgrading ETCD. Doesn't seem to have an impact on the cluster.

@Joseph94m
Copy link
Author

bump ?

@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 10, 2022
@Alvaro-Campesino
Copy link

any work on this?

@stale stale bot removed the stale label Jul 27, 2022
@stale
Copy link

stale bot commented Oct 29, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 29, 2022
@stale stale bot closed this as completed Dec 31, 2022
@liry
Copy link

liry commented Aug 30, 2023

It's unfortunate that nobody even replied.
Still happening (v3.5.8) 😞

@serathius
Copy link
Member

Contributions are welcomed!

@ahrtr ahrtr reopened this Aug 31, 2023
@ahrtr ahrtr removed the stale label Aug 31, 2023
@ahrtr
Copy link
Member

ahrtr commented Aug 31, 2023

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

@vladorf
Copy link

vladorf commented Oct 31, 2023

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

Hello, we are facing the same issue, so I'll answer instead (also v3.5.8).

I'm not sure failures mentioned in issue are resulting in Unavailable grpc code. Running with GODEBUG=http2debug=1 will reveal

{"level":"debug","ts":"2023-10-31T15:29:28.638134Z","caller":"auth/store.go:1151","msg":"found command name","common-name":"<<redacted>>","user-name":"<<redacted>>","revision":88}
{"level":"debug","ts":"2023-10-31T15:29:28.638531Z","caller":"etcdserver/server.go:2142","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2023-10-31T15:29:28.638547Z","caller":"etcdserver/server.go:2145","msg":"Applying entry","index":549010374,"term":31,"type":"EntryNormal"}
{"level":"debug","ts":"2023-10-31T15:29:28.638556Z","caller":"etcdserver/server.go:2204","msg":"apply entry normal","consistent-index":549010373,"entry-index":549010374,"should-applyV3":true}
{"level":"debug","ts":"2023-10-31T15:29:28.638569Z","caller":"etcdserver/server.go:2231","msg":"applyEntryNormal","raftReq":"header:<ID:16718641124967440555 username:\"<<redacted>>\" auth_revision:88 > txn:<compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value:\"k8s\\000\\n\\017\\n\\002v1\\022\\tEndpoints\\022)\\n\\023\\n\\000\\022\\000\\032\\000\\\"\\000*\\0002\\0008\\321\\002B\\000z\\000\\022\\022\\n\\020\\n\\01410.95.235.36\\032\\000\\032\\000\\\"\\000\" lease:7495269088112664745 > > failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > > > "}
{"level":"debug","ts":"2023-10-31T15:29:28.638775Z","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2023-10-31T15:29:28.638103Z","time spent":"642.116µs","remote":"10.95.235.36:58620","response type":"/etcdserverpb.KV/Txn","request count":1,"request size":119,"response count":0,"response size":43,"request content":"compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value_size:68 lease:7495269088112664745 >> failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > >"}
2023/10/31 15:29:28 http2: server encoding header ":status" = "200"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-headers" = "accept, content-type, authorization"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-methods" = "POST, GET, OPTIONS, PUT, DELETE"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-origin" = "*"
2023/10/31 15:29:28 http2: server encoding header "content-type" = "application/grpc"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Message"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status-Details-Bin"
2023/10/31 15:29:28 http2: server encoding header "grpc-status" = "0"

Worse, there is no mention of "grpc-status" = "14" whatsover in logs, so its difficult to pinpoint the reason with logs alone.

I tried enabling go-grpc messages as described in go-grpc documentation but they are not propagated to etcd output.

Any further guidance will be appreciated.

@ahrtr
Copy link
Member

ahrtr commented Oct 31, 2023

@tjungblu Have you ever seen similar issue in Openshift?

@tjungblu
Copy link
Contributor

Never seen this before.

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants