Livez/Readyz #16007

logicalhan · 2023-06-05T14:53:59Z

What would you like to be added?

We currently have a single health endpoint for etcd /health which is used in Kubernetes distros as both liveness and readiness checking. In order to be fully api-compliant, we should have both a liveness check (i.e. /livez) which checks that this individual etcd member is "alive" and does not need to be restarted and a readiness check (i.e. /readyz) which signals that the etcd member is ready to accept traffic.

Why is this needed?

There is a difference between "please restart me I'm that unhealthy" vs "please send me all sorts of traffic, I'm ready for it".

The text was updated successfully, but these errors were encountered:

chaochn47 · 2023-06-05T17:08:25Z

Please check this #13340 (comment)

logicalhan · 2023-06-05T17:14:50Z

Please check this #13340 (comment)

Yeah I don't buy it. No one is going to dig up an obscure github issue in order to properly configure their etcd configurations for Kubernetes.

chaochn47 · 2023-06-05T17:24:45Z

Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.

logicalhan · 2023-06-05T17:28:01Z

Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.

As long as you don't touch the existing health endpoint, it's completely backwards compatible and therefore can even be backported.

logicalhan · 2023-06-05T17:28:15Z

for ref: #16008

ahrtr · 2023-06-05T23:37:21Z

Thanks @logicalhan for raising this request. I am supportive on it. /health/serializable=<true|false> isn't an explicit API, and it also requires people to understand what's serializable.

/livez and /readyz are more explicit and easy to understand & use.

/livez is similar to (or a syntax sugar ) /health/serializable=true; It just checks local etcd instance's health status, it should return true/healthy as long as local etcd instance is running & healthy. We shouldn't restart the etcd instance when the cluster isn't healthy (e.g. the quorum isn't satisfied) because it will make the situation even worse.

While /readyz should require quorum, and actually check the health of the cluster. Each etcd instance isn't ready to receive traffic until the cluster is healthy. It's similar to (or a syntax sugar of) /health/serializable=false or /health.

ahrtr · 2023-06-06T01:14:28Z

cc @neolit123

neolit123 · 2023-06-06T06:02:39Z

+1

serathius · 2023-06-06T07:15:13Z

Don't want to rush into adding livez/readyz probe. Main problem with existing health probe we just added it to have it without proper consideration.

I want livez to properly reflect fact that etcd needs restart, for example etcd is stuck on stalled storage https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit?usp=sharing.

Readyz should properly reflect fact that etcd is ready to serve traffic. Don't think alarms matter here. It's a degradation, however it doesn't mean we shouldn't serve reads.

TLDR; I would like to have a design written that will do a proper analysis etcd failure modes and propose matching probes to detect them. Example kubernetes-sigs/metrics-server#542

wenjiaswe · 2023-07-06T01:09:38Z

Thanks for bringing this up @logicalhan.

I will continue work on this.

ahrtr · 2023-07-24T15:51:45Z

Link to #15440

chaochn47 · 2023-09-19T19:27:52Z

Reached out to @wenjiaswe for collaboration of the latest updated version of the design doc etcd livez and readyz. Updates resolve the comments / feedback mentioned in the issue and PoC #16008.

/cc @dims

wenjiaswe · 2023-09-19T20:32:35Z

Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!

wenjiaswe · 2023-09-19T20:33:43Z

cc @marukozh who is also working on this.

chaochn47 · 2023-09-19T21:42:04Z

Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!

Done. Anyone in etcd-dev@googlegroups.com should have access to it etcd livez and readyz and can comment.

Github issue: etcd-io#16007 (comment) Design doc: https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing

ahrtr · 2023-09-29T10:12:02Z

Various discussions are scattered in various places, so I raise my comment under this ticket.

liveness probe

A node is live when both below are satisfied:

It can serve serializable request. But note that if the node is in progress of defragmentation, it can't serve any requests; in such case, we should NOT consider the node as not live due to this rule. Link to gRPC health server sets serving status to NOT_SERVING on defrag #16278
The raft loop isn't blocked. There is one exception, for one-node cluster, there raft loop will be indeed blocked if there is no any client request. One possible way to resolve this is to intentionally trigger events periodically for the liveness check

readyness probe

Basically it shares the same logic as the existing health check (see below), and

A node should be considered to be ready even the alarm NOSPACE is activated.
Should we differentiate local member ready and the cluster ready?

etcd/server/etcdserver/api/etcdhttp/health.go

Lines 47 to 57 in 0a3dc1a

    
           func HandleHealth(lg *zap.Logger, mux *http.ServeMux, srv ServerHealth) { 
        
           	mux.Handle(PathHealth, NewHealthHandler(lg, func(excludedAlarms AlarmSet, serializable bool) Health { 
        
           		if h := checkAlarms(lg, srv, excludedAlarms); h.Health != "true" { 
        
           			return h 
        
           		} 
        
           		if h := checkLeader(lg, srv, serializable); h.Health != "true" { 
        
           			return h 
        
           		} 
        
           		return checkAPI(lg, srv, serializable) 
        
           	})) 
        
           }

Compatiblity

Do not break the existing /health endpoint!

serathius · 2023-09-29T10:25:03Z

Please leave your comments on the document https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing

siyuanfoundation · 2023-10-02T19:32:15Z

Created a k/k issue to track this kubernetes/kubernetes#120970

siyuanfoundation · 2023-10-02T22:08:28Z

Tracking work

add livez/readyz endpoint with basic structure for checkers
add checker for defrag
add checker for readIndex
add checker for local file read
raise a issue and fix existing health probe not checking defrag
raise a issue and fix existing health probe does not respect context from http request

scuzhanglei · 2024-04-30T08:49:25Z

is there any plan to add a endpoint live command to etcdctl.
my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

serathius · 2024-04-30T09:47:44Z

etcdctl uses GRPC only, we would need to make equivalent of /livez and /readyz in GRPC.

tjungblu · 2024-04-30T11:15:57Z

@scuzhanglei I think we have that filed under #16276, just needs a cmdline in etcdctl

since we're bumping that thread again already, is there anything left to pick up? I've just "saved" one PR #16959 from @siyuanfoundation from being stale reaped, I think #16858 is also going to fall prey to the evil bot soon.

siyuanfoundation · 2024-04-30T20:20:26Z

is there any plan to add a endpoint live command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

I have tried to add the commands before 293f087#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.

Hope someone can pick it up.

henrybear327 · 2024-05-02T15:55:32Z

is there any plan to add a endpoint live command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

I have tried to add the commands before 293f087#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.

Hope someone can pick it up.

Hey @siyuanfoundation, I will pick this issue up!

logicalhan added the type/feature label Jun 5, 2023

wenjiaswe mentioned this issue Jul 7, 2023

Asking for approval to regain maintainer status #16197

Merged

serathius mentioned this issue Jul 24, 2023

Introduce grpc health check in etcd client #16276

Open

ahrtr added this to the etcd-v3.6 milestone Jul 26, 2023

chaochn47 mentioned this issue Jul 28, 2023

disk write failed and network partitioned leader was not able to step down to follower #13527

Open

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 29, 2023

ahrtr mentioned this issue Sep 29, 2023

Add livez and readyz for etcd #16651

Merged

siyuanfoundation mentioned this issue Oct 2, 2023

Split etcd /health endpoint to /livez and /readyz kubernetes/kubernetes#120970

Closed

siyuanfoundation mentioned this issue Oct 5, 2023

Optimize RangeRequest Revision count #16510

Open

This was referenced Oct 6, 2023

add existing http health check handler e2e test #16698

Merged

http health check bug fixes #16697

Merged

ahrtr mentioned this issue Oct 8, 2023

Add method (*EtcdServer) IsRaftLoopBlocked to support checking whether the raft loop is blocked #16710

Draft

This was referenced Oct 17, 2023

add readIndex check in readyz #16792

Draft

etcdserver: add metric counters for livez/readyz health checks. #16797

Merged

siyuanfoundation mentioned this issue Oct 31, 2023

etcdserver: make livez return ok when defrag is active. #16858

Open

siyuanfoundation mentioned this issue Nov 27, 2023

[3.5] Backport livez/readyz #17039

Merged

siyuanfoundation mentioned this issue Dec 8, 2023

[3.5] Backport e2e tests for livez/readyz. #17083

Merged

siyuanfoundation mentioned this issue Jan 4, 2024

[concept] add livez/readyz for etcd #16008

Closed

ahrtr mentioned this issue Feb 22, 2024

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Closed

siyuanfoundation mentioned this issue Apr 1, 2024

Change etcd liveness probes to the new livez and readyz endpoints kubernetes/kubeadm#3039

Closed

henrybear327 mentioned this issue May 2, 2024

Make equivalent of /livez and /readyz in gRPC and etcdctl command #17925

Open

BobVanB mentioned this issue Jun 14, 2024

LivenessProbes not working bitnami/charts#26398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Livez/Readyz #16007

Livez/Readyz #16007

logicalhan commented Jun 5, 2023

chaochn47 commented Jun 5, 2023

logicalhan commented Jun 5, 2023

chaochn47 commented Jun 5, 2023

logicalhan commented Jun 5, 2023

logicalhan commented Jun 5, 2023

ahrtr commented Jun 5, 2023 •

edited

Loading

ahrtr commented Jun 6, 2023

neolit123 commented Jun 6, 2023

serathius commented Jun 6, 2023

wenjiaswe commented Jul 6, 2023

ahrtr commented Jul 24, 2023

chaochn47 commented Sep 19, 2023 •

edited

Loading

wenjiaswe commented Sep 19, 2023 •

edited

Loading

wenjiaswe commented Sep 19, 2023

chaochn47 commented Sep 19, 2023 •

edited

Loading

ahrtr commented Sep 29, 2023

serathius commented Sep 29, 2023 •

edited

Loading

siyuanfoundation commented Oct 2, 2023

siyuanfoundation commented Oct 2, 2023 •

edited

Loading

scuzhanglei commented Apr 30, 2024

serathius commented Apr 30, 2024

tjungblu commented Apr 30, 2024

siyuanfoundation commented Apr 30, 2024

henrybear327 commented May 2, 2024

Livez/Readyz #16007

Livez/Readyz #16007

Comments

logicalhan commented Jun 5, 2023

What would you like to be added?

Why is this needed?

chaochn47 commented Jun 5, 2023

logicalhan commented Jun 5, 2023

chaochn47 commented Jun 5, 2023

logicalhan commented Jun 5, 2023

logicalhan commented Jun 5, 2023

ahrtr commented Jun 5, 2023 • edited Loading

ahrtr commented Jun 6, 2023

neolit123 commented Jun 6, 2023

serathius commented Jun 6, 2023

wenjiaswe commented Jul 6, 2023

ahrtr commented Jul 24, 2023

chaochn47 commented Sep 19, 2023 • edited Loading

wenjiaswe commented Sep 19, 2023 • edited Loading

wenjiaswe commented Sep 19, 2023

chaochn47 commented Sep 19, 2023 • edited Loading

ahrtr commented Sep 29, 2023

liveness probe

readyness probe

Compatiblity

serathius commented Sep 29, 2023 • edited Loading

siyuanfoundation commented Oct 2, 2023

siyuanfoundation commented Oct 2, 2023 • edited Loading

Tracking work

scuzhanglei commented Apr 30, 2024

serathius commented Apr 30, 2024

tjungblu commented Apr 30, 2024

siyuanfoundation commented Apr 30, 2024

henrybear327 commented May 2, 2024

ahrtr commented Jun 5, 2023 •

edited

Loading

chaochn47 commented Sep 19, 2023 •

edited

Loading

wenjiaswe commented Sep 19, 2023 •

edited

Loading

chaochn47 commented Sep 19, 2023 •

edited

Loading

serathius commented Sep 29, 2023 •

edited

Loading

siyuanfoundation commented Oct 2, 2023 •

edited

Loading