kv: convert uni-directional network partitions to bi-directional #94778

andrewbaptist · 2023-01-05T19:17:49Z

Previously one-way partitions where a node could initiate a successful TCP
connection in one direction, but the reverse connection fails which causes
problems. The node that initiates outgoing connections can acquire
leases and cause failures for reads and writes to those ranges. This is
particularly a problem if it acquires the liveness range leases, but is
a problem even for other ranges.

This commit adds an additional check during server-to-server
communication where the recipient of a new PingRequest first validates
that it is able to open a reverse connection to the initiator before
responding. Additionally, it will monitor whether it has a successful
reverse connection over time and asynchronously validate reverse
connections to the sender. The ongoing validation is asynchronous to
avoid adding delays to PingResponses as they are used for measuring
clock offsets.

Release note (bug fix): Detects and addresses one-way partitions.
Epic: none

cockroach-teamcity · 2023-01-05T19:17:58Z

This change is

erikgrinaker · 2023-01-12T12:57:18Z

pkg/rpc/context.go

+		// We want this connection to block on connection so we verify it
+		// succeeds.
+		dialOpts = append(dialOpts, grpc.WithBlock())
+		conn, err := grpc.DialContext(dialCtx, target, dialOpts...)


As we discussed elsewhere, we should go via grpcDialNodeInternal() such that we'll retain this connection for future use and join any in-flight connection attempts.

For an initial inbound connection, this will naïvely result in a circular dependency since we'll be attempting to ping each other in either direction, both waiting on the other's ping response. However, for the dialback connection attempt we don't have to wait for the ping response itself, we can simply wait for the underlying gRPC connection to be established asynchronously, e.g. via ClientConn.WaitForStateChange.

We may want to limit this to only initial ping attempts, such that we later verify that we're actually receiving pings in both directions too, but this is a lesser concern.

erikgrinaker · 2023-01-12T13:25:30Z

pkg/server/server.go

@@ -339,6 +339,18 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 	nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g),
 		nodedialer.DialerOpt{TestingKnobs: dialerKnobs})

+	// This is somewhat tangled due to the need to use gossip to resolve the
+	// address. If we are not able to resolve, we simply skip this verification.
+	rpcContext.PingResolver = func(ctx context.Context, nodeID roachpb.NodeID) string {


We can avoid this tangle by not passing rpcContext in to gossip.New. Gossip only uses the RPC context once it starts making outbound connections, and that only happens once we call Gossip.Start() far below here. The cleanest approach is probably to remove the Gossip.rpcContext field entirely, and thread the RPC context via Gossip.Start() down into bootstrap() and manage(), but alternatively we can set the rpcContext field during Gossip.Start() as long as we properly synchronize access to it. This should preferably be done as a separate commit.

When we do that, we can set up gossip before the RPC context and just pass a nodedialer.AddressResolver to rpc.NewContext() (if we get away with it without any dependency cycles, alternatively use some other corresponding type).

erikgrinaker · 2023-01-12T13:31:33Z

pkg/rpc/context.go

+func (rpcCtx *Context) VerifyDialback(ctx context.Context, nodeID roachpb.NodeID) error {
+	log.Errorf(ctx, "Verifying health of connection to n%d", nodeID)
+	if nodeID == 0 {
+		//FIXME: at startup, the nodeID might not be set. Unfortunately


This only happens when bootstrapping a new node, right? Once the node has joined a cluster, this will always be set even on the initial connection attempts? If so, this seems totally fine, since the node can't participate in consensus or acquire any leases before it's allocated a node ID.

erikgrinaker · 2023-01-12T13:34:21Z

pkg/rpc/heartbeat.go

+	// established. If there is already a healthy connection set up, it will
+	// simply return immediately, however if not, it attempts to establish a
+	// "dummy" connection which is never used to send messages on.
+	verifyDialback func(context.Context, roachpb.NodeID) error


We already have onHandlePing which effectively does the same thing, should we hook into that instead? We can't go via ContextOptions.OnIncomingPing because that's necessarily constructed before the RPC context itself, but we can inject a handler during RPC context construction before setting onHandlePing. Not particularly important, and I'm not sure if it's really making things more clear or less, so take it or leave it.

erikgrinaker

Thanks for getting this across the line, I appreciate it was a longer slog than expected.

We should mark this as resolving #84289, and put the GA blocker label on that issue (labels on PRs don't count).

Reviewed 1 of 4 files at r5, 10 of 10 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Changed to wait until the dialback connection is completed before verified rather than just one ping later.

Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.

As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.

pkg/rpc/context.go line 2249 at r14 (raw file):

			// so we ignore it.
			err := rpcCtx.runHeartbeat(ctx, conn, target)
			log.Health.Infof(ctx, "connection heartbeat loop ended with err: %v", err)

nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?

pkg/rpc/context.go line 2297 at r14 (raw file):

	"rpc.dialback.enabled",
	"if true, require bidirectional RPC connections between nodes to prevent one-way network unavailability",
	true,

Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.

pkg/rpc/context.go line 2578 at r14 (raw file):

	// not the node. In that case, we can't look up if we have a connection to the
	// node and instead need to always try dialback.
	var err error

nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.

pkg/rpc/context.go line 2600 at r14 (raw file):

		// Clear out the ServerTime on our response so the receiver does not use
		// this response in its latency calculations.
		response.ServerTime = 0

Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.

Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.

pkg/rpc/context.go line 2645 at r14 (raw file):

	previousAttempt := rpcCtx.dialbackMu.m[nodeID]

	// Block here for the previous connection to be completed (successfully or

nit: comment seems outdated, we don't block here.

pkg/rpc/context.go line 2652 at r14 (raw file):

	if previousAttempt != nil {
		select {
		case <-previousAttempt.initialHeartbeatDone:

nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?

pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.

Sure. I suppose we could use a version gate whenever we add an enum value instead.

pkg/rpc/context_test.go line 2618 at r14 (raw file):

	rpcCtx := NewContext(context.Background(), opts)
	// This is normally set up inside the server, we want to hold onto all PingRequests that come through.
	rpcCtx.OnIncomingPing = func(ctx context.Context, req *PingRequest, resp *PingResponse) error {

So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?

pkg/rpc/context_test.go line 2715 at r14 (raw file):

	// Forcibly shut down listener 2 and the connection node1 -> node2.
	// Verify the reverse connection will also close within a DialTimeout.

nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.

andrewbaptist

TFTR - I updated the PR and comments also. I'm rerunning all the blackhole tests one more time with the final code and will push once that is done and successful.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/rpc/context.go line 2589 at r6 (raw file):