rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

knz · 2021-08-04T14:54:10Z

investigated by @nvanbenschoten and @erikgrinaker.

When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this (destination, rpc class) pair twice in order to establish a connection:

First, before creating a new RaftTransport queue:

cockroach/pkg/kv/kvserver/raft_transport.go

Line 553 in 9f15510

if !t.dialer.GetCircuitBreaker(toNodeID, class).Ready() {

Second, when the RaftTransport queue started up and dialed the destination node:

cockroach/pkg/rpc/nodedialer/nodedialer.go

Line 154 in 9f15510

if breaker != nil && !breaker.Ready() {

from

cockroach/pkg/kv/kvserver/raft_transport.go

Line 621 in e1d01d0

conn, err := t.dialer.Dial(ctx, toNodeID, class)

So the theory is that once a second, a request will make it through the first call to Breaker.Ready. However, when it does, it launches a new RaftTransport queue that immediately checks the breaker again. And since we haven't waited a second between calls to Breaker.Ready, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed here. Instead, we shut down the RaftTransport queue and start over again.

This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:

func TestTrippableBreakerState(t *testing.T) {
	c := clock.NewMock()
	cb := NewBreaker()
	cb.Clock = c
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready")
	}
	cb.Trip()
	if cb.Ready() {
		t.Fatal("expected breaker to not be ready")
	}
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout")
	}
+	if !cb.Ready() {
+		// !!! This fails !!!
+		// There's no breaker affinity.
+		t.Fatal("expected breaker to be ready after reset timeout")
+	}
	cb.Fail(nil)
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout, post failure")
	}
}

So any code that requires two consecutive calls to a breaker's Ready() function in order to reset the breaker is bound to be starved forever.

It's not yet clear what the best fix is for this. One solution is to expose an option from nodedialer.Dialer.Dial to skip the breaker check. Another is to do something more clever around breaker affinity.

The text was updated successfully, but these errors were encountered:

See cockroachdb/cockroach#68419 for details.

tbg · 2021-08-26T10:03:23Z

I looked into how this should work instead a little bit. The basic idea is that instead of letting through the occasional request, a breaker should have associated with it a "watcher" goroutine. If the breaker trips, it is the watcher's job (and only the watcher's job) to try to determine when the target is available again. In the meantime, everyone else will fail-fast 100% of the time; effectively all calls to Ready() will be replaced by calls to Tripped() (which is a pure read, unlike Ready()).

In practice, the tangliness of our RPC layer raises additional questions, especially if we're also trying to fix parts of #53410.

The circuit breakers sit at the level of NodeID, ConnectionClass tuples in the node dialer. So right there, anything that dials addresses directly doesn't have any circuit breaker protection (not sure who exactly this is, but there's probably someone). Next, they sort of do! The way *rpc.Context works is that it doesn't return the connection until it's "proven healthy" through a recent PingRequest (and while it's intermittently unhealthy, callers will fail-fast). So if a node goes down, the previously healthy connection will turn unhealthy and callers will be bounced here:

cockroach/pkg/rpc/context.go

Lines 248 to 253 in 57ad801

    
           // Connect returns the underlying grpc.ClientConn after it has been validated, 
        
           // or an error if dialing or validation fails. 
        
           func (c *Connection) Connect(ctx context.Context) (*grpc.ClientConn, error) { 
        
           	if c.dialErr != nil { 
        
           		return nil, c.dialErr 
        
           	}

But on certain errors the connection gets removed from rpc.Context's pool and we lose that state. I think this is realistically only when a node gets decommissioned, in which case we try to also persist a local state on all cluster nodes that fail-fast outgoing connection attempts to the node (so in a sense we already have permanent circuit breakers for this case). Anyway, when a dialing creates a new Connection object, it shouldn't block there, but I'm not so sure since it will hit this path:

cockroach/pkg/rpc/context.go

Lines 1097 to 1101 in 57ad801

    
           conn.initOnce.Do(func() { 
        
           	// Either we kick off the heartbeat loop (and clean up when it's done), 
        
           	// or we clean up the connKey entries immediately. 
        
           	var redialChan <-chan struct{} 
        
           	conn.grpcConn, redialChan, conn.dialErr = ctx.grpcDialRaw(target, remoteNodeID, class)

and I'm not sure if grpcDialRaw blocks.

I think what we want is:

the circuit breakers move to rpc.Context and are scoped at an (address, class) level.
they are never cleared.
the heartbeat loop (and only the heartbeat loop) for the connection manages the state of the breaker. When a breaker is tripped but nobody inquires about the state of the breaker for a while, we stop the heartbeat loop. (It will start up again should the breaker be checked again in the future)
code that needs to connect somewhere can check breaker.Tripped() before dialing (not .Ready() but anyway that whole half-open concept should be disabled in our case). But you don't even need to check anything in the common case, because the dialing will check the breaker for you.

The result should be that each node that "anyone" tries to actually inquire about has a watcher goroutine associated with it. Connection attempts from CRDB code will never be recruited as canary requests and so don't eat the possibly disastrous latencies that come with that. We also simplify the different layers of circuit breaking and notions of connection health, which should significantly simplify the code. I think we are then also in a position to relatively easily address what must be the issues behind #53410 - we make sure that the initial connection attempt to a node is timeboxed to, say, 1s (not sure what the current state is) and then the managed breakers are going to be doing the rest.

There is a lot of hard-earned wisdom in this tangly code, so we need to be careful.

erikgrinaker · 2021-08-26T10:15:22Z

I like this direction.

the heartbeat loop (and only the heartbeat loop) for the connection manages the state of the breaker. When a breaker is tripped but nobody inquires about the state of the breaker for a while, we stop the heartbeat loop. (It will start up again should the breaker be checked again in the future)

Do we actually need the breaker at all then? If we were to always keep a heartbeat loop running, presumably that could periodically try to connect to the remote node and fail any requests in the meanwhile. I suppose it might still be useful to keep the breaker as internal state, but I do like the idea of having a single actor responsible for managing the connection.

tbg · 2021-08-26T10:21:23Z

I think it becomes a game of naming things. There will be something like a circuit breaker, the question is if there will be a *circuit.Breaker. Because *circuit.Breaker is so bought into this "recruit calls to do discovery" pattern, I don't see how we can continue using it.

presumably that could periodically try to connect to the remote node and fail any requests in the meanwhile

That is how it works today, except... the code is pretty hard to understand and we are not confident how well it works; we definitely it doesn't work that well for nodes for which we get an IsAuthError because that causes all state to be removed, so the next request (if there is one) will start from scratch.

See cockroachdb#68419. Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

69405: kvserver: remove extraneous circuit breaker check in Raft transport r=erikgrinaker a=tbg See #68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds). Co-authored-by: Tobias Grieger <tobias.schottdorf@gmail.com>

erikgrinaker · 2021-09-13T09:53:47Z

Opened #70111 which expands on @tbg's proposal above by having a component that's also responsible for maintaining RPC connections to all nodes by dialing them when appropriate, in addition to managing health checks.

Maybe this issue can be subsumed by #70111 now that #69405 has been merged?

See #68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 Release note: None

See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization labels Aug 4, 2021

knz added this to Incoming in KV via automation Aug 4, 2021

blathers-crl bot added the T-kv KV Team label Aug 4, 2021

knz added this to To do in DB Server & Security via automation Aug 4, 2021

blathers-crl bot added the T-server-and-security DB Server & Security label Aug 4, 2021

knz mentioned this issue Aug 5, 2021

failed to send RPC: sending to all replicas failed; last error: unable to dial n6: breaker open #68489

Closed

piyush-singh moved this from To do to Linked issues (from the roadmap columns on the right) in DB Server & Security Aug 9, 2021

tbg mentioned this issue Aug 26, 2021

kvserver: remove extraneous circuit breaker check in Raft transport #69405

Merged

tbg added a commit to cockroachdb/circuitbreaker that referenced this issue Aug 26, 2021

Document that Ready() == true must result in an attempt

bb89c90

See cockroachdb/cockroach#68419 for details.

tbg mentioned this issue Aug 26, 2021

Document that Ready() == true must result in an attempt cockroachdb/circuitbreaker#8

Merged

This was referenced Sep 7, 2021

rpc: Dialer.ConnHealth must not cause network IO #69888

Closed

rpc: avoid network IO in Dialer.ConnHealth #70017

Merged

rpc: automatically maintain RPC connections across cluster #70111

Open

tbg closed this as completed Sep 13, 2021

KV automation moved this from Incoming to Closed Sep 13, 2021

DB Server & Security automation moved this from Linked issues (from the roadmap columns on the right) to Done 21.2 Sep 13, 2021

blathers-crl bot mentioned this issue Sep 16, 2021

release-21.1: kvserver: remove extraneous circuit breaker check in Raft transport #70311

Merged

exalate-issue-sync bot removed the T-kv KV Team label Sep 16, 2021

tbg mentioned this issue Sep 17, 2021

release-20.2: kvserver: remove extraneous circuit breaker check in Raft transport #70349

Closed

tbg mentioned this issue Sep 17, 2021

release-20.2: kvserver: remove extraneous circuit breaker check in Raft transport #70353

Merged

tbg mentioned this issue Sep 21, 2021

rpc: use async-probing based circuit breakers #70485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

knz commented Aug 4, 2021 •

edited by nvanbenschoten

tbg commented Aug 26, 2021 •

edited

erikgrinaker commented Aug 26, 2021

tbg commented Aug 26, 2021

erikgrinaker commented Sep 13, 2021

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Comments

knz commented Aug 4, 2021 • edited by nvanbenschoten

tbg commented Aug 26, 2021 • edited

erikgrinaker commented Aug 26, 2021

tbg commented Aug 26, 2021

erikgrinaker commented Sep 13, 2021

knz commented Aug 4, 2021 •

edited by nvanbenschoten

tbg commented Aug 26, 2021 •

edited