LeaseKeepAlive Unavailable #13632

Joseph94m · 2022-01-21T17:19:08Z

Introduction (tl;dr at the bottom)

Hello, I am unsure if this is an ETCD bug or GRPC one or whatever this is-- so i'll try to be as explicit as possible.

As you may know there has been a few reports/issues related to grpc_code="Unavailable",grpc_method="Watch" #12196, and the interpretation of such metrics by alertmanager: #13127. Also discussed in other issues and even on other Githubs.

I come to you today with either a similar problem or a misconfiguration leading to the couple grpc_code="Unavailable",grpc_method="LeaseKeepAlive" appearing a lot in the metrics of the ETCD.

Environments:

ETCDs outside of k8s cluster environment.

Env1: 1 etcd version 3.3.15, k8s 1.18.20 ,runtime docker 18.06.3-ce
Env2: 3 etcd version 3.4.3, k8s 1.18.20, runtime docker 18.06.3-ce
Env3: 5 etcd version 3.4.3, k8s 1.17.17, runtime docker 18.06.3-ce

ETCDs configuration are pretty similar for each env:

We've had the fabled Watch Unavailable for a while now but it hasn't bothered me much since I knew it was being looked at, so I've sort of been ignoring the following alert for that error:

- alert: EtcdHighNumberOfFailedGrpcRequests expr: (sum(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{}[5m])) without (grpc_type, grpc_code)) > 0.01 for: 10m

Issue

Symptoms of the issue

Recently however i started receiving EtcdHighNumberOfFailedGrpcRequests alerts that were activated by LeaseKeepAvailable coupled with unavailable code on all 3 environments.

Example from Env2 (but it's the same on Env1 and Env3):

In-depth look

The metrics generated by the etcds confirm this:

How come we have the Unavailable code at 71 and received/sent at 70 and started at 71? Either I do not understand these metrics or the numbers don't add up.

I added the --log-level=debug to an ETCD to see what's actually happening and I was able to spot multiple requests marked with "failure":

As you can see there's some failure related to the lease resource (and other stuff but I am not sure this is the right venue to discuss them now :D).

Attempts at fixing the issue

I've researched the issue a bit online and decided I wanted to try with a newer etcd image since there's been multiple fixes since.

I upgraded Env1 from etcd 3.3.15 to 3.5.1 as this is my test environment.

As expected the grpc_code="Unavailable",grpc_method="Watch" disappeared but I am still facing the grpc_code="Unavailable",grpc_method="LeaseKeepAlive" at the same rate as I was before the update.

Confusion

The leases resources on K8s seem correct as do the kube-scheduler, kube-api-server, kube-controller-manager, and well... the whole cluster seems healthy.

Not expired:

All nodes ready:

tl;dr

grpc_code="Unavailable",grpc_method="LeaseKeepAlive" popping up a lot on cluster. Don't know why, not fixed by upgrading ETCD. Doesn't seem to have an impact on the cluster.

The text was updated successfully, but these errors were encountered:

Joseph94m · 2022-03-23T09:34:16Z

bump ?

stale · 2022-07-10T13:03:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Alvaro-Campesino · 2022-07-27T06:10:39Z

any work on this?

stale · 2022-10-29T04:38:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

liry · 2023-08-30T19:38:39Z

It's unfortunate that nobody even replied.
Still happening (v3.5.8) 😞

serathius · 2023-08-31T10:53:19Z

Contributions are welcomed!

ahrtr · 2023-08-31T15:21:58Z

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

vladorf · 2023-10-31T15:40:11Z

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

Hello, we are facing the same issue, so I'll answer instead (also v3.5.8).

I'm not sure failures mentioned in issue are resulting in Unavailable grpc code. Running with GODEBUG=http2debug=1 will reveal

{"level":"debug","ts":"2023-10-31T15:29:28.638134Z","caller":"auth/store.go:1151","msg":"found command name","common-name":"<<redacted>>","user-name":"<<redacted>>","revision":88}
{"level":"debug","ts":"2023-10-31T15:29:28.638531Z","caller":"etcdserver/server.go:2142","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2023-10-31T15:29:28.638547Z","caller":"etcdserver/server.go:2145","msg":"Applying entry","index":549010374,"term":31,"type":"EntryNormal"}
{"level":"debug","ts":"2023-10-31T15:29:28.638556Z","caller":"etcdserver/server.go:2204","msg":"apply entry normal","consistent-index":549010373,"entry-index":549010374,"should-applyV3":true}
{"level":"debug","ts":"2023-10-31T15:29:28.638569Z","caller":"etcdserver/server.go:2231","msg":"applyEntryNormal","raftReq":"header:<ID:16718641124967440555 username:\"<<redacted>>\" auth_revision:88 > txn:<compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value:\"k8s\\000\\n\\017\\n\\002v1\\022\\tEndpoints\\022)\\n\\023\\n\\000\\022\\000\\032\\000\\\"\\000*\\0002\\0008\\321\\002B\\000z\\000\\022\\022\\n\\020\\n\\01410.95.235.36\\032\\000\\032\\000\\\"\\000\" lease:7495269088112664745 > > failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > > > "}
{"level":"debug","ts":"2023-10-31T15:29:28.638775Z","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2023-10-31T15:29:28.638103Z","time spent":"642.116µs","remote":"10.95.235.36:58620","response type":"/etcdserverpb.KV/Txn","request count":1,"request size":119,"response count":0,"response size":43,"request content":"compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value_size:68 lease:7495269088112664745 >> failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > >"}
2023/10/31 15:29:28 http2: server encoding header ":status" = "200"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-headers" = "accept, content-type, authorization"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-methods" = "POST, GET, OPTIONS, PUT, DELETE"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-origin" = "*"
2023/10/31 15:29:28 http2: server encoding header "content-type" = "application/grpc"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Message"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status-Details-Bin"
2023/10/31 15:29:28 http2: server encoding header "grpc-status" = "0"

Worse, there is no mention of "grpc-status" = "14" whatsover in logs, so its difficult to pinpoint the reason with logs alone.

I tried enabling go-grpc messages as described in go-grpc documentation but they are not propagated to etcd output.

Any further guidance will be appreciated.

ahrtr · 2023-10-31T16:16:35Z

@tjungblu Have you ever seen similar issue in Openshift?

tjungblu · 2023-10-31T16:23:21Z

Never seen this before.

stale · 2024-03-17T12:58:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Jul 10, 2022

stale bot removed the stale label Jul 27, 2022

serathius added the area/contrib label Jul 27, 2022

stale bot added the stale label Oct 29, 2022

stale bot closed this as completed Dec 31, 2022

ahrtr reopened this Aug 31, 2023

ahrtr removed the stale label Aug 31, 2023

stale bot added the stale label Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeaseKeepAlive Unavailable #13632

LeaseKeepAlive Unavailable #13632

Joseph94m commented Jan 21, 2022 •

edited

Loading

Joseph94m commented Mar 23, 2022

stale bot commented Jul 10, 2022

Alvaro-Campesino commented Jul 27, 2022

stale bot commented Oct 29, 2022

liry commented Aug 30, 2023

serathius commented Aug 31, 2023

ahrtr commented Aug 31, 2023

vladorf commented Oct 31, 2023

ahrtr commented Oct 31, 2023

tjungblu commented Oct 31, 2023

stale bot commented Mar 17, 2024

LeaseKeepAlive Unavailable #13632

LeaseKeepAlive Unavailable #13632

Comments

Joseph94m commented Jan 21, 2022 • edited Loading

Introduction (tl;dr at the bottom)

Environments:

Issue

Symptoms of the issue

In-depth look

Attempts at fixing the issue

Confusion

tl;dr

Joseph94m commented Mar 23, 2022

stale bot commented Jul 10, 2022

Alvaro-Campesino commented Jul 27, 2022

stale bot commented Oct 29, 2022

liry commented Aug 30, 2023

serathius commented Aug 31, 2023

ahrtr commented Aug 31, 2023

vladorf commented Oct 31, 2023

ahrtr commented Oct 31, 2023

tjungblu commented Oct 31, 2023

stale bot commented Mar 17, 2024

Joseph94m commented Jan 21, 2022 •

edited

Loading