Switch default load balancer policy from "pick_first" to "round_robin" #1318

jcferretti · 2024-02-15T20:29:42Z

Is your feature request related to a problem? Please describe.

The (default) "pick_first" load balancer policy is "sticky": it tries to send RPCs to the same endpoint as long as that endpoint is connected.

https://github.com/grpc/grpc/blob/master/doc/load-balancing.md#pick_first.

"pick_first" has some desirable properties when things are working, eg, you can pick an order for the endpoints where you privilege a "closer" host, say if the client happens to be running in a machine that also runs one of the etcd servers. However, "pick_first" is problematic in some failure scenarios. Consider a client that is talking to an etcd server that is (becomes) partitioned from the master, and makes an etcd request that requires a master. Eg, a write (put), linearized read (default get without the serialized option set) or watcher with the "required leader" option. In this scenario the RPC will fail with gRPC Status UNAVAILABLE. The issue is, with "pick_first" any retries will be routed to the same etcd server, which most likely is still partitioned. A better strategy is to use "round_robin", which will try to use the next endpoint for the retry. The etcd go client already uses "round_robin".

https://github.com/etcd-io/etcd/blob/840d4869234a94e7ec7b669cc7e9bcb79606bab2/client/v3/internal/resolver/resolver.go#L44

Some of the rationale is described in the documentation for the client balancer. Note that the documentation for the client balancer is written in a way that suggests that etcd client uses a custom load balancer; this doesn't look true to me from the current etcd go client sources, I believe etcd go client uses the stock "round_robin" gRPC-go load balancing policy.

https://etcd.io/docs/v3.6/learning/design-client/#clientv3-grpc123-balancer-overview

Describe the solution you'd like
Use "round_robin" as the default load balancing policy.

Describe alternatives you've considered
A custom load balancer that is "sticky" ("stable" may be a better word) in sending to a single endpoint that is working, like "pick_first", but switches to another endpoint that is already connected when an UNAVAILABLE error occurs would be desirable. However, such a custom load balancer would be additional code to write and maintain, code that interfaces with gRPC APIs that are changing frequently between versions: many of the features around name resolvers and load balancing in gRPC are marked as experimental. Now is probably not a good time to try to write custom load balancers.

Additional context
Another characteristic of "pick_first" is that when a subchannel (TCP connection) fails, the other subchannels are not already connected, so failing over implies trying to make a new connection and takes longer than in "round_robin" (*). In some circumstances, eg trying to reach an ip for a machine or kubernetes pod that is down, trying a new connection can take a long time to fail. If the machine is up but the port is not open, the kernel in the server machine answers back the SYN request and the TCP connection attempt will fail immediately; if the machine is not up, however, TCP SYN retries will happen for about 2 minutes (according to the defaults in Linux) before the connection attempt is failed. In this scenario, "round_robin" is also better in that it does its best to keep connections established to all alternative endpoints.

See https://www.evanjones.ca/tcp-connection-timeouts.html, under the heading "Connecting to a failed process/machine"
(warning: some of the descriptions in that page about gRPC behaviors are out of date and/or innacurate because some of these gRPC behaviors depend on configuration parameters, eg waitForReady, automated gRPC retries, etc; the description about TCP syn retries in that page is relevant however).

(*) this is actually slightly worse than it sounds at first: in "pick_first" if a subchannel connection drops while no RPC is being attempted, the channel is just marked as IDLE; on the next RPC attempt gRPC will try to (1) first re-connect the IDLE channel, and (2) if that fails then try the next address. This all eats from the RPC deadline time budget if one is set.

jcferretti mentioned this issue Feb 15, 2024

Switch default load balancer policy to round_robin #1319

Merged

lburgazzoli closed this as completed in #1319 Mar 9, 2024

lburgazzoli mentioned this issue Apr 12, 2024

Jetcd watcher is not able to reconnect when etcd leader goes down or when etcd cluster loses its quorum and comes back #1352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch default load balancer policy from "pick_first" to "round_robin" #1318

Switch default load balancer policy from "pick_first" to "round_robin" #1318

jcferretti commented Feb 15, 2024 •

edited

Loading

Switch default load balancer policy from "pick_first" to "round_robin" #1318

Switch default load balancer policy from "pick_first" to "round_robin" #1318

Comments

jcferretti commented Feb 15, 2024 • edited Loading

jcferretti commented Feb 15, 2024 •

edited

Loading