Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Closed
knz opened this issue Aug 4, 2021 · 4 comments
Closed

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

knz opened this issue Aug 4, 2021 · 4 comments
Labels
A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security

Comments

@knz
Copy link
Contributor

knz commented Aug 4, 2021

investigated by @nvanbenschoten and @erikgrinaker.

When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this (destination, rpc class) pair twice in order to establish a connection:

First, before creating a new RaftTransport queue:

if !t.dialer.GetCircuitBreaker(toNodeID, class).Ready() {

Second, when the RaftTransport queue started up and dialed the destination node:

if breaker != nil && !breaker.Ready() {
from
conn, err := t.dialer.Dial(ctx, toNodeID, class)

So the theory is that once a second, a request will make it through the first call to Breaker.Ready. However, when it does, it launches a new RaftTransport queue that immediately checks the breaker again. And since we haven't waited a second between calls to Breaker.Ready, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed here. Instead, we shut down the RaftTransport queue and start over again.

This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:

func TestTrippableBreakerState(t *testing.T) {
	c := clock.NewMock()
	cb := NewBreaker()
	cb.Clock = c
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready")
	}
	cb.Trip()
	if cb.Ready() {
		t.Fatal("expected breaker to not be ready")
	}
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout")
	}
+	if !cb.Ready() {
+		// !!! This fails !!!
+		// There's no breaker affinity.
+		t.Fatal("expected breaker to be ready after reset timeout")
+	}
	cb.Fail(nil)
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout, post failure")
	}
}

So any code that requires two consecutive calls to a breaker's Ready() function in order to reset the breaker is bound to be starved forever.

It's not yet clear what the best fix is for this. One solution is to expose an option from nodedialer.Dialer.Dial to skip the breaker check. Another is to do something more clever around breaker affinity.

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization labels Aug 4, 2021
@knz knz added this to Incoming in KV via automation Aug 4, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 4, 2021
@knz knz added this to To do in DB Server & Security via automation Aug 4, 2021
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Aug 4, 2021
@piyush-singh piyush-singh moved this from To do to Linked issues (from the roadmap columns on the right) in DB Server & Security Aug 9, 2021
tbg added a commit to cockroachdb/circuitbreaker that referenced this issue Aug 26, 2021
@tbg
Copy link
Member

tbg commented Aug 26, 2021

I looked into how this should work instead a little bit. The basic idea is that instead of letting through the occasional request, a breaker should have associated with it a "watcher" goroutine. If the breaker trips, it is the watcher's job (and only the watcher's job) to try to determine when the target is available again. In the meantime, everyone else will fail-fast 100% of the time; effectively all calls to Ready() will be replaced by calls to Tripped() (which is a pure read, unlike Ready()).

In practice, the tangliness of our RPC layer raises additional questions, especially if we're also trying to fix parts of #53410.

The circuit breakers sit at the level of NodeID, ConnectionClass tuples in the node dialer. So right there, anything that dials addresses directly doesn't have any circuit breaker protection (not sure who exactly this is, but there's probably someone). Next, they sort of do! The way *rpc.Context works is that it doesn't return the connection until it's "proven healthy" through a recent PingRequest (and while it's intermittently unhealthy, callers will fail-fast). So if a node goes down, the previously healthy connection will turn unhealthy and callers will be bounced here:

// Connect returns the underlying grpc.ClientConn after it has been validated,
// or an error if dialing or validation fails.
func (c *Connection) Connect(ctx context.Context) (*grpc.ClientConn, error) {
if c.dialErr != nil {
return nil, c.dialErr
}

But on certain errors the connection gets removed from rpc.Context's pool and we lose that state. I think this is realistically only when a node gets decommissioned, in which case we try to also persist a local state on all cluster nodes that fail-fast outgoing connection attempts to the node (so in a sense we already have permanent circuit breakers for this case). Anyway, when a dialing creates a new Connection object, it shouldn't block there, but I'm not so sure since it will hit this path:

cockroach/pkg/rpc/context.go

Lines 1097 to 1101 in 57ad801

conn.initOnce.Do(func() {
// Either we kick off the heartbeat loop (and clean up when it's done),
// or we clean up the connKey entries immediately.
var redialChan <-chan struct{}
conn.grpcConn, redialChan, conn.dialErr = ctx.grpcDialRaw(target, remoteNodeID, class)

and I'm not sure if grpcDialRaw blocks.

I think what we want is:

  • the circuit breakers move to rpc.Context and are scoped at an (address, class) level.
  • they are never cleared.
  • the heartbeat loop (and only the heartbeat loop) for the connection manages the state of the breaker. When a breaker is tripped but nobody inquires about the state of the breaker for a while, we stop the heartbeat loop. (It will start up again should the breaker be checked again in the future)
  • code that needs to connect somewhere can check breaker.Tripped() before dialing (not .Ready() but anyway that whole half-open concept should be disabled in our case). But you don't even need to check anything in the common case, because the dialing will check the breaker for you.

The result should be that each node that "anyone" tries to actually inquire about has a watcher goroutine associated with it. Connection attempts from CRDB code will never be recruited as canary requests and so don't eat the possibly disastrous latencies that come with that. We also simplify the different layers of circuit breaking and notions of connection health, which should significantly simplify the code. I think we are then also in a position to relatively easily address what must be the issues behind #53410 - we make sure that the initial connection attempt to a node is timeboxed to, say, 1s (not sure what the current state is) and then the managed breakers are going to be doing the rest.

There is a lot of hard-earned wisdom in this tangly code, so we need to be careful.

@erikgrinaker
Copy link
Contributor

I like this direction.

the heartbeat loop (and only the heartbeat loop) for the connection manages the state of the breaker. When a breaker is tripped but nobody inquires about the state of the breaker for a while, we stop the heartbeat loop. (It will start up again should the breaker be checked again in the future)

Do we actually need the breaker at all then? If we were to always keep a heartbeat loop running, presumably that could periodically try to connect to the remote node and fail any requests in the meanwhile. I suppose it might still be useful to keep the breaker as internal state, but I do like the idea of having a single actor responsible for managing the connection.

@tbg
Copy link
Member

tbg commented Aug 26, 2021

I think it becomes a game of naming things. There will be something like a circuit breaker, the question is if there will be a *circuit.Breaker. Because *circuit.Breaker is so bought into this "recruit calls to do discovery" pattern, I don't see how we can continue using it.

presumably that could periodically try to connect to the remote node and fail any requests in the meanwhile

That is how it works today, except... the code is pretty hard to understand and we are not confident how well it works; we definitely it doesn't work that well for nodes for which we get an IsAuthError because that causes all state to be removed, so the next request (if there is one) will start from scratch.

tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2021
See cockroachdb#68419.

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2021
See cockroachdb#68419.

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2021
See cockroachdb#68419.

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2021
See cockroachdb#68419.

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
tbg added a commit to tbg/cockroach that referenced this issue Aug 26, 2021
See cockroachdb#68419. We now use
`DialNoBreaker` for the raft transport, taking into account the previous
`Ready()` check.

`DialNoBreaker` was previously bypassing the breaker as it ought to but
was also *not reporting to the breaker* the result of the operation;
this is not ideal and was caught by the tests. This commit changes
`DialNoBreaker` to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
craig bot pushed a commit that referenced this issue Aug 27, 2021
69405: kvserver: remove extraneous circuit breaker check in Raft transport r=erikgrinaker a=tbg

See #68419. We now use
`DialNoBreaker` for the raft transport, taking into account the previous
`Ready()` check.

`DialNoBreaker` was previously bypassing the breaker as it ought to but
was also *not reporting to the breaker* the result of the operation;
this is not ideal and was caught by the tests. This commit changes
`DialNoBreaker` to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).

Co-authored-by: Tobias Grieger <tobias.schottdorf@gmail.com>
@erikgrinaker
Copy link
Contributor

Opened #70111 which expands on @tbg's proposal above by having a component that's also responsible for maintaining RPC connections to all nodes by dialing them when appropriate, in addition to managing health checks.

Maybe this issue can be subsumed by #70111 now that #69405 has been merged?

@tbg tbg closed this as completed Sep 13, 2021
KV automation moved this from Incoming to Closed Sep 13, 2021
DB Server & Security automation moved this from Linked issues (from the roadmap columns on the right) to Done 21.2 Sep 13, 2021
blathers-crl bot pushed a commit that referenced this issue Sep 16, 2021
See #68419. We now use
`DialNoBreaker` for the raft transport, taking into account the previous
`Ready()` check.

`DialNoBreaker` was previously bypassing the breaker as it ought to but
was also *not reporting to the breaker* the result of the operation;
this is not ideal and was caught by the tests. This commit changes
`DialNoBreaker` to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Sep 16, 2021
tbg added a commit to tbg/cockroach that referenced this issue Sep 17, 2021
See cockroachdb#68419. We now use
`DialNoBreaker` for the raft transport, taking into account the previous
`Ready()` check.

`DialNoBreaker` was previously bypassing the breaker as it ought to but
was also *not reporting to the breaker* the result of the operation;
this is not ideal and was caught by the tests. This commit changes
`DialNoBreaker` to report the result (i.e. fail or success).

Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where `XX` is the ID
of the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).
tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are
pretty clear about what they protect - but it is worth mentioning
as this consideration likely influenced the design of the original
breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 1, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007
which will use the `Signal()` call.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Done 21.2
Development

No branches or pull requests

3 participants