gRPC health server sets serving status to NOT_SERVING on defrag #16278

chaochn47 · 2023-07-20T22:08:29Z

Trying to fix kubernetes/kubernetes#93280 on etcd side since etcd is part of the kubernetes sigs.

Before failure rate is around 33%

v3_failover_test.go:101: etcd client failed to fail over, error (context deadline exceeded)
v3_failover_test.go:103: request failure rate is 33.59%, traffic volume success 172 requests, total 259 requests

After failure rate is 0%

v3_failover_test.go:103: request failure rate is 0.00%, traffic volume success 2116 requests, total 2116 requests

2023-07-20T22:18:20.243Z [[core] [Channel #19 SubChannel #20] Subchannel picks a new address "unix:localhost:m0" to connect]
2023-07-20T22:18:20.243Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, CONNECTING]
2023-07-20T22:18:20.243Z [[core] [Channel #19 SubChannel #20] Subchannel Connectivity change to READY]
2023-07-20T22:18:20.243Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, READY]
2023-07-20T22:18:20.243Z [[roundrobin] roundrobinPicker: Build called with info: {map[0xc0006aad68:{{
          "Addr": "unix:localhost:m0",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }} 0xc0006aadf8:{{
          "Addr": "unix:localhost:m2",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }}]}]


2023-07-20T22:18:20.743Z INFO	m0	starting defragment	{"member": "m0"}
2023-07-20T22:18:20.743Z [[core] [Channel #19 SubChannel #20] Subchannel Connectivity change to TRANSIENT_FAILURE]
2023-07-20T22:18:20.743Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, TRANSIENT_FAILURE]
2023-07-20T22:18:20.743Z [[roundrobin] roundrobinPicker: Build called with info: {map[0xc0006aadb0:{{
          "Addr": "unix:localhost:m1",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }} 0xc0006aadf8:{{
          "Addr": "unix:localhost:m2",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }}]}]

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ahrtr · 2023-07-26T13:52:22Z

Please see #16276 (comment)

chaochn47 · 2023-09-21T18:23:33Z

Hi @ahrtr @serathius @jmhbnz @fuweid, could you please take a look this approach and share your thoughts about it?

making defrag not blocking quorum reads

making etcd client's load balancer switch to a follower if the leader is doing defrag. (I believe today we only have round robin LB, am I correct?)

I was inspired by what kubernetes/kubernetes#93280 (comment) proposed in the k8s issue.

It will be helpful for eliminating online defrag impact.

/cc @lavacat @dims

server/etcdserver/api/v3rpc/maintenance.go

lavacat · 2023-09-26T19:08:44Z

Please see #16276 (comment)

@ahrtr are you referring to Livez/Readyz, 3.7 defrag plans or experimental dependency?

Livez/Readyz: It will be a larger change and imo will build on this implementation. Also, current design doesn't address client side traffic failover during defrag.
3.7 defrag: not clear if it will be back ported to 3.5 and will take longer to implement.
experimental dependency: I don't see this PR adding new dependencies. It does add "healthCheckConfig": {"serviceName": ""}} deepening current dependency (adds new stream behind the scenes).

IMO this PR is a bugfix for an existing problem with defrag. Fundamentally if we don't want to use https://pkg.go.dev/google.golang.org/grpc/health, we shouldn't merge this, otherwise I think it's very useful fix.

tests/e2e/failover_test.go

fuweid · 2023-09-30T10:05:23Z

@chaochn47 For the change, it looks good to me. It's helpful to the kube-apiserver which has multiple endpoints instead of one service. The kube-apiserver can forward the requests to available servers. What do you think about introducing the flag to enable this feature? It might bring heavy traffic to other servers.

chaochn47 · 2023-10-04T19:21:11Z

Thanks @fuweid for the feedback!

What do you think about introducing the flag to enable this feature? It might bring heavy traffic to other servers.
Yeah, the heuristic is defrag won't take too long and once it's completed. New requests will be routed to this server and watch will break and has 1/3 chance to establish on this server.

There is no built-in supported capacity aware load balancer in gRPC-go. I will explore and explain it in the design doc after livez/readyz.

For disabling this feature by introducing a flag in etcd client, as I explained in the #16278 (comment), we could add flag in the app that consuming etcd client instead.

server/etcdserver/api/v3rpc/health.go

tests/e2e/failover_test.go

ahrtr · 2023-10-10T09:52:21Z

Please also rebase the PR.

chaochn47 · 2023-10-12T20:37:23Z

Please also rebase the PR.

Done. /cc @ahrtr

IMHO, the PR itself is good to merge and safe to cherry picked to release branch since it's recommended to do defrag against one member at a time and client side fail over is disabled by default. Some users may configure grpc probes in their livez set up, we can just call it out that please consider moving to http livez probe instead when upgrading.

One of the remaining risks to evaluate before expanding this to other readyz checks is client should fail open (at least keep one connection in the picker if all the etcd health servers reports NOT_SERVING).

ahrtr · 2023-10-13T09:01:06Z

I see that the e2e test case is very similar to the integration test case, is it possible to write a common test case?

ahrtr · 2023-10-13T09:32:19Z

Another big question is should we add an experimental flag something like "--experimental-stop-grpc-service-on-defragmentation"? I think the answer should be YES. Any new feature might also bring new problem, at least we add an option for users to disable it.

chaochn47 · 2023-10-14T05:14:43Z

I see that the e2e test case is very similar to the integration test case, is it possible to write a common test case?

Enabling failpoint both in e2e and integration test has not yet built into the common test framework. This can be a separate PR to follow up.

Another big question is should we add an experimental flag something like "--experimental-stop-grpc-service-on-defragmentation"? I think the answer should be YES. Any new feature might also bring new problem, at least we add an option for users to disable it.

Okay.

chaochn47 · 2023-10-18T19:18:44Z

Thanks @serathius for the review. I have added aggressive test threshold for happy and negative test cases.

chaochn47 · 2023-10-18T19:57:22Z

E2E test failed in 386 arch. Reason is gofail is not enabled in CI. Test should be skipped after rebasing on top of the changes #16698 introduces.

2023-10-18T19:24:18.1851440Z     failover_test.go:145: request failure rate is 0.00%, traffic volume successfulRequestCount 15786 requests, total 15786 requests
2023-10-18T19:24:18.1852640Z     failover_test.go:169: 
2023-10-18T19:24:18.1853387Z         	Error Trace:	/__w/etcd/etcd/tests/e2e/failover_test.go:169
2023-10-18T19:24:18.1854532Z         	            				/__w/etcd/etcd/tests/e2e/failover_test.go:146
2023-10-18T19:24:18.1855364Z         	Error:      	"0" is not greater than or equal to "90"
2023-10-18T19:24:18.1857108Z         	Test:       	TestFailoverOnDefrag/defrag_blocks_one-third_of_requests_with_stopGRPCServiceOnDefrag_set_to_true_and_client_health_check_disabled

ref. https://github.com/etcd-io/etcd/actions/runs/6565503688/job/17834165379?pr=16278

server/etcdserver/api/v3rpc/health.go

tests/e2e/failover_test.go

chaochn47 · 2023-10-20T15:49:46Z

Kindly ping @serathius @wenjiaswe @ptabor @mitake @spzala @jmhbnz ~

chaochn47 · 2023-10-24T16:09:38Z

Hi @ahrtr @fuweid

What's the suggested process to move progress on this PR? I think all the review comments are addressed.

Edit: found one reference

Poke reviewer if needed
Reviewers are responsive in a timely fashion, but considering everyone is busy, give them some time after requesting review if quick response is not provided. If response is not provided in 10 days, feel free to contact them via adding a comment in the PR or sending an email or message on the Slack.

https://etcd.io/docs/v3.5/triage/prs/#poke-reviewer-if-needed

ahrtr · 2023-10-24T19:22:51Z

This is a nice to have feature, it can prevent client from being blocked on the member in progress of defragmentation.

We also have flag --experimental-stop-grpc-service-on-defrag (defaults to false). There is no any impact on users if they don't intentionally enable it. So it should be safe.

Leave to other maintainers to take a second look.

Thanks @chaochn47

server/etcdserver/api/v3rpc/health.go

Signed-off-by: Chao Chen <chaochn@amazon.com>

tests/e2e/failover_test.go

server/etcdserver/api/v3rpc/health.go

serathius · 2023-10-25T17:42:29Z

I think this is good enough to merge, still please consider implementing followups.

gRPC health server sets serving status to NOT_SERVING on defrag Backport from 3.6 in etcd-io#16278 Co-authored-by: Chao Chen <chaochn@amazon.com> Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

chaochn47 force-pushed the grpc_health_check_poc branch 2 times, most recently from 86b1ea4 to 988972c Compare July 21, 2023 00:11

chaochn47 mentioned this pull request Jul 21, 2023

Documentation: add roadmap #16279

Merged

chaochn47 requested a review from ahrtr July 21, 2023 19:11

fuweid mentioned this pull request Jul 23, 2023

Introduce grpc health check in etcd client #16276

Open

chaochn47 force-pushed the grpc_health_check_poc branch 4 times, most recently from 14b7905 to 3d4aa56 Compare September 21, 2023 17:28

chaochn47 changed the title ~~introduce grpc health check in etcd client poc with fail over on defrag test~~ introduce grpc health check in etcd client poc with fail over on defrag Sep 21, 2023

chaochn47 changed the title ~~introduce grpc health check in etcd client poc with fail over on defrag~~ gRPC health server sets serving status to NOT_SERVING on defrag Sep 21, 2023

chaochn47 commented Sep 21, 2023

View reviewed changes

server/etcdserver/api/v3rpc/maintenance.go Outdated Show resolved Hide resolved

chaochn47 force-pushed the grpc_health_check_poc branch from 3d4aa56 to a8b2f92 Compare September 22, 2023 16:39

lavacat reviewed Sep 26, 2023

View reviewed changes

tests/e2e/failover_test.go Outdated Show resolved Hide resolved

ahrtr mentioned this pull request Sep 29, 2023

Livez/Readyz #16007

Open

lavacat approved these changes Oct 9, 2023

View reviewed changes

ahrtr reviewed Oct 10, 2023

View reviewed changes

chaochn47 force-pushed the grpc_health_check_poc branch 3 times, most recently from d1544fd to 3397052 Compare October 12, 2023 19:23

serathius reviewed Oct 19, 2023

View reviewed changes

server/etcdserver/api/v3rpc/health.go Outdated Show resolved Hide resolved

serathius reviewed Oct 19, 2023

View reviewed changes

tests/e2e/failover_test.go Outdated Show resolved Hide resolved

serathius reviewed Oct 19, 2023

View reviewed changes

tests/e2e/failover_test.go Outdated Show resolved Hide resolved

chaochn47 force-pushed the grpc_health_check_poc branch from 24a5666 to 5775609 Compare October 19, 2023 17:12

chaochn47 requested a review from serathius October 19, 2023 18:02

chaochn47 force-pushed the grpc_health_check_poc branch from 5775609 to a853196 Compare October 24, 2023 15:16

ahrtr approved these changes Oct 24, 2023

View reviewed changes

serathius reviewed Oct 25, 2023

View reviewed changes

server/etcdserver/api/v3rpc/health.go Outdated Show resolved Hide resolved

serathius reviewed Oct 25, 2023

View reviewed changes

server/etcdserver/api/v3rpc/health.go Show resolved Hide resolved

serathius approved these changes Oct 25, 2023

View reviewed changes

chaochn47 added 3 commits October 25, 2023 08:58

online defrag notifies gRPC health server to expose NOT_SERVING status

ea03547

Signed-off-by: Chao Chen <chaochn@amazon.com>

tests: add client failover on defrag test case in e2e

913c825

Signed-off-by: Chao Chen <chaochn@amazon.com>

CHANGELOG-3.6.md: add etcd --experimental-stop-grpc-service-on-defrag

9a59230

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 force-pushed the grpc_health_check_poc branch from a853196 to 9a59230 Compare October 25, 2023 15:59

serathius reviewed Oct 25, 2023

View reviewed changes

tests/e2e/failover_test.go Show resolved Hide resolved

serathius reviewed Oct 25, 2023

View reviewed changes

server/etcdserver/api/v3rpc/health.go Show resolved Hide resolved

serathius merged commit 4d77fd1 into etcd-io:main Oct 25, 2023
27 checks passed

chaochn47 deleted the grpc_health_check_poc branch October 25, 2023 19:41

chaochn47 mentioned this pull request Oct 26, 2023

v3rpc: online defrag toggles gRPC health server serving status #16836

Merged

ahrtr mentioned this pull request Nov 2, 2023

etcdserver: make livez return ok when defrag is active. #16858

Open

tjungblu mentioned this pull request Apr 30, 2024

[3.5] gRPC health server sets serving status to NOT_SERVING on defrag #17914

Merged

Yiqing-Liu mentioned this pull request May 8, 2024

Add Etcd client support: stop traffic to in-defragmentation-server kubernetes/kubernetes#124763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC health server sets serving status to NOT_SERVING on defrag #16278

gRPC health server sets serving status to NOT_SERVING on defrag #16278

chaochn47 commented Jul 20, 2023 •

edited

Loading

ahrtr commented Jul 26, 2023

chaochn47 commented Sep 21, 2023

lavacat commented Sep 26, 2023 •

edited

Loading

fuweid commented Sep 30, 2023

chaochn47 commented Oct 4, 2023 •

edited

Loading

ahrtr commented Oct 10, 2023

chaochn47 commented Oct 12, 2023 •

edited

Loading

ahrtr commented Oct 13, 2023

ahrtr commented Oct 13, 2023

chaochn47 commented Oct 14, 2023

chaochn47 commented Oct 18, 2023

chaochn47 commented Oct 18, 2023

chaochn47 commented Oct 20, 2023

chaochn47 commented Oct 24, 2023 •

edited

Loading

ahrtr commented Oct 24, 2023

serathius commented Oct 25, 2023

gRPC health server sets serving status to NOT_SERVING on defrag #16278

gRPC health server sets serving status to NOT_SERVING on defrag #16278

Conversation

chaochn47 commented Jul 20, 2023 • edited Loading

ahrtr commented Jul 26, 2023

chaochn47 commented Sep 21, 2023

lavacat commented Sep 26, 2023 • edited Loading

fuweid commented Sep 30, 2023

chaochn47 commented Oct 4, 2023 • edited Loading

ahrtr commented Oct 10, 2023

chaochn47 commented Oct 12, 2023 • edited Loading

ahrtr commented Oct 13, 2023

ahrtr commented Oct 13, 2023

chaochn47 commented Oct 14, 2023

chaochn47 commented Oct 18, 2023

chaochn47 commented Oct 18, 2023

chaochn47 commented Oct 20, 2023

chaochn47 commented Oct 24, 2023 • edited Loading

ahrtr commented Oct 24, 2023

serathius commented Oct 25, 2023

chaochn47 commented Jul 20, 2023 •

edited

Loading

lavacat commented Sep 26, 2023 •

edited

Loading

chaochn47 commented Oct 4, 2023 •

edited

Loading

chaochn47 commented Oct 12, 2023 •

edited

Loading

chaochn47 commented Oct 24, 2023 •

edited

Loading