Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC health server sets serving status to NOT_SERVING on defrag #16278

Merged
merged 3 commits into from
Oct 25, 2023

Conversation

chaochn47
Copy link
Member

@chaochn47 chaochn47 commented Jul 20, 2023

Trying to fix kubernetes/kubernetes#93280 on etcd side since etcd is part of the kubernetes sigs.

Before failure rate is around 33%

v3_failover_test.go:101: etcd client failed to fail over, error (context deadline exceeded)
v3_failover_test.go:103: request failure rate is 33.59%, traffic volume success 172 requests, total 259 requests

After failure rate is 0%

v3_failover_test.go:103: request failure rate is 0.00%, traffic volume success 2116 requests, total 2116 requests
2023-07-20T22:18:20.243Z [[core] [Channel #19 SubChannel #20] Subchannel picks a new address "unix:localhost:m0" to connect]
2023-07-20T22:18:20.243Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, CONNECTING]
2023-07-20T22:18:20.243Z [[core] [Channel #19 SubChannel #20] Subchannel Connectivity change to READY]
2023-07-20T22:18:20.243Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, READY]
2023-07-20T22:18:20.243Z [[roundrobin] roundrobinPicker: Build called with info: {map[0xc0006aad68:{{
          "Addr": "unix:localhost:m0",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }} 0xc0006aadf8:{{
          "Addr": "unix:localhost:m2",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }}]}]


2023-07-20T22:18:20.743Z INFO	m0	starting defragment	{"member": "m0"}
2023-07-20T22:18:20.743Z [[core] [Channel #19 SubChannel #20] Subchannel Connectivity change to TRANSIENT_FAILURE]
2023-07-20T22:18:20.743Z [[balancer] base.baseBalancer: handle SubConn state change: 0xc0006aad68, TRANSIENT_FAILURE]
2023-07-20T22:18:20.743Z [[roundrobin] roundrobinPicker: Build called with info: {map[0xc0006aadb0:{{
          "Addr": "unix:localhost:m1",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }} 0xc0006aadf8:{{
          "Addr": "unix:localhost:m2",
          "ServerName": "localhost",
          "Attributes": null,
          "BalancerAttributes": null,
          "Type": 0,
          "Metadata": null
        }}]}]

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@ahrtr
Copy link
Member

ahrtr commented Jul 26, 2023

Please see #16276 (comment)

@chaochn47 chaochn47 force-pushed the grpc_health_check_poc branch 4 times, most recently from 14b7905 to 3d4aa56 Compare September 21, 2023 17:28
@chaochn47
Copy link
Member Author

Hi @ahrtr @serathius @jmhbnz @fuweid, could you please take a look this approach and share your thoughts about it?

  1. making defrag not blocking quorum reads
  2. making etcd client's load balancer switch to a follower if the leader is doing defrag. (I believe today we only have round robin LB, am I correct?)

I was inspired by what kubernetes/kubernetes#93280 (comment) proposed in the k8s issue.

It will be helpful for eliminating online defrag impact.

/cc @lavacat @dims

@chaochn47 chaochn47 changed the title introduce grpc health check in etcd client poc with fail over on defrag test introduce grpc health check in etcd client poc with fail over on defrag Sep 21, 2023
@chaochn47 chaochn47 changed the title introduce grpc health check in etcd client poc with fail over on defrag gRPC health server sets serving status to NOT_SERVING on defrag Sep 21, 2023
@lavacat
Copy link

lavacat commented Sep 26, 2023

Please see #16276 (comment)

@ahrtr are you referring to Livez/Readyz, 3.7 defrag plans or experimental dependency?

  • Livez/Readyz: It will be a larger change and imo will build on this implementation. Also, current design doesn't address client side traffic failover during defrag.
  • 3.7 defrag: not clear if it will be back ported to 3.5 and will take longer to implement.
  • experimental dependency: I don't see this PR adding new dependencies. It does add "healthCheckConfig": {"serviceName": ""}} deepening current dependency (adds new stream behind the scenes).

IMO this PR is a bugfix for an existing problem with defrag. Fundamentally if we don't want to use https://pkg.go.dev/google.golang.org/grpc/health, we shouldn't merge this, otherwise I think it's very useful fix.

tests/e2e/failover_test.go Outdated Show resolved Hide resolved
@ahrtr ahrtr mentioned this pull request Sep 29, 2023
@fuweid
Copy link
Member

fuweid commented Sep 30, 2023

@chaochn47 For the change, it looks good to me. It's helpful to the kube-apiserver which has multiple endpoints instead of one service. The kube-apiserver can forward the requests to available servers. What do you think about introducing the flag to enable this feature? It might bring heavy traffic to other servers.

@chaochn47
Copy link
Member Author

chaochn47 commented Oct 4, 2023

Thanks @fuweid for the feedback!

What do you think about introducing the flag to enable this feature? It might bring heavy traffic to other servers.
Yeah, the heuristic is defrag won't take too long and once it's completed. New requests will be routed to this server and watch will break and has 1/3 chance to establish on this server.

There is no built-in supported capacity aware load balancer in gRPC-go. I will explore and explain it in the design doc after livez/readyz.

For disabling this feature by introducing a flag in etcd client, as I explained in the #16278 (comment), we could add flag in the app that consuming etcd client instead.

server/etcdserver/api/v3rpc/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3rpc/health.go Show resolved Hide resolved
tests/e2e/failover_test.go Outdated Show resolved Hide resolved
tests/e2e/failover_test.go Outdated Show resolved Hide resolved
tests/e2e/failover_test.go Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented Oct 10, 2023

Please also rebase the PR.

@chaochn47 chaochn47 force-pushed the grpc_health_check_poc branch 3 times, most recently from d1544fd to 3397052 Compare October 12, 2023 19:23
@chaochn47
Copy link
Member Author

chaochn47 commented Oct 12, 2023

Please also rebase the PR.

Done. /cc @ahrtr


IMHO, the PR itself is good to merge and safe to cherry picked to release branch since it's recommended to do defrag against one member at a time and client side fail over is disabled by default. Some users may configure grpc probes in their livez set up, we can just call it out that please consider moving to http livez probe instead when upgrading.

One of the remaining risks to evaluate before expanding this to other readyz checks is client should fail open (at least keep one connection in the picker if all the etcd health servers reports NOT_SERVING).

@ahrtr
Copy link
Member

ahrtr commented Oct 13, 2023

I see that the e2e test case is very similar to the integration test case, is it possible to write a common test case?

@ahrtr
Copy link
Member

ahrtr commented Oct 13, 2023

Another big question is should we add an experimental flag something like "--experimental-stop-grpc-service-on-defragmentation"? I think the answer should be YES. Any new feature might also bring new problem, at least we add an option for users to disable it.

@chaochn47
Copy link
Member Author

I see that the e2e test case is very similar to the integration test case, is it possible to write a common test case?

Enabling failpoint both in e2e and integration test has not yet built into the common test framework. This can be a separate PR to follow up.

Another big question is should we add an experimental flag something like "--experimental-stop-grpc-service-on-defragmentation"? I think the answer should be YES. Any new feature might also bring new problem, at least we add an option for users to disable it.

Okay.

@chaochn47
Copy link
Member Author

Thanks @serathius for the review. I have added aggressive test threshold for happy and negative test cases.

@chaochn47
Copy link
Member Author

E2E test failed in 386 arch. Reason is gofail is not enabled in CI. Test should be skipped after rebasing on top of the changes #16698 introduces.

2023-10-18T19:24:18.1851440Z     failover_test.go:145: request failure rate is 0.00%, traffic volume successfulRequestCount 15786 requests, total 15786 requests
2023-10-18T19:24:18.1852640Z     failover_test.go:169: 
2023-10-18T19:24:18.1853387Z         	Error Trace:	/__w/etcd/etcd/tests/e2e/failover_test.go:169
2023-10-18T19:24:18.1854532Z         	            				/__w/etcd/etcd/tests/e2e/failover_test.go:146
2023-10-18T19:24:18.1855364Z         	Error:      	"0" is not greater than or equal to "90"
2023-10-18T19:24:18.1857108Z         	Test:       	TestFailoverOnDefrag/defrag_blocks_one-third_of_requests_with_stopGRPCServiceOnDefrag_set_to_true_and_client_health_check_disabled

ref. https://github.com/etcd-io/etcd/actions/runs/6565503688/job/17834165379?pr=16278

@chaochn47
Copy link
Member Author

Kindly ping @serathius @wenjiaswe @ptabor @mitake @spzala @jmhbnz ~

@chaochn47
Copy link
Member Author

chaochn47 commented Oct 24, 2023

Hi @ahrtr @fuweid

What's the suggested process to move progress on this PR? I think all the review comments are addressed.

Edit: found one reference

Poke reviewer if needed
Reviewers are responsive in a timely fashion, but considering everyone is busy, give them some time after requesting review if quick response is not provided. If response is not provided in 10 days, feel free to contact them via adding a comment in the PR or sending an email or message on the Slack.

https://etcd.io/docs/v3.5/triage/prs/#poke-reviewer-if-needed

@ahrtr
Copy link
Member

ahrtr commented Oct 24, 2023

This is a nice to have feature, it can prevent client from being blocked on the member in progress of defragmentation.

We also have flag --experimental-stop-grpc-service-on-defrag (defaults to false). There is no any impact on users if they don't intentionally enable it. So it should be safe.

Leave to other maintainers to take a second look.

Thanks @chaochn47

Signed-off-by: Chao Chen <chaochn@amazon.com>
Signed-off-by: Chao Chen <chaochn@amazon.com>
Signed-off-by: Chao Chen <chaochn@amazon.com>
@serathius
Copy link
Member

I think this is good enough to merge, still please consider implementing followups.

@serathius serathius merged commit 4d77fd1 into etcd-io:main Oct 25, 2023
27 checks passed
@chaochn47 chaochn47 deleted the grpc_health_check_poc branch October 25, 2023 19:41
tjungblu added a commit to tjungblu/etcd that referenced this pull request Apr 30, 2024
gRPC health server sets serving status to NOT_SERVING on defrag
Backport from 3.6 in etcd-io#16278

Co-authored-by: Chao Chen <chaochn@amazon.com>
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Apr 30, 2024
gRPC health server sets serving status to NOT_SERVING on defrag
Backport from 3.6 in etcd-io#16278

Co-authored-by: Chao Chen <chaochn@amazon.com>
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Defrag on etcd node causing API 500s
5 participants