Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,rpc: validate node IDs in RPC heartbeats #34197

Merged
merged 2 commits into from Apr 24, 2019

Conversation

4 participants
@knz
Copy link
Member

commented Jan 23, 2019

Fixes #34158.

Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See #34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

  • RPC requests for distSQL are OK with being served on a different
    node than intended (with potential performance drop);
  • RPC requests to the KV layer are OK with being served on a different
    node than intended (they would route underneath);
  • RPC requests to the storage layer are rejected by the
    remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
GRPCDialNode() is introduced to establish such connections.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.

@knz knz requested review from tbg and petermattis Jan 23, 2019

@knz knz requested review from cockroachdb/core-prs as code owners Jan 23, 2019

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented Jan 23, 2019

This change is Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

A review to this code was made in #34155 (review)

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

@petermattis I've seen your review on context.go:

One problem with this approach is that we won't remove this conn if there is a heartbeat or RPC error: we'll only remove the entry with the non-zero remoteNodeID.

Yes and what's the problem with that?

My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong?

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

@petermattis you're going to love this. multiTestContext in storage/client_test.go uses a single rpc.Context for all the simulated nodes.

This is why enforcing args.NodeID != 0 && args.NodeID != nodeID in (*HeartbeatService).Ping() does not work: the nodeID comes from the current rpc.Context, but there is one hearbeat service per simulated node.

We have no way to distinguish the services (and have a different value for nodeID) in each hearbeat service short of instantiating a separate rpc.Context for each simulated server.

This would require a major change to multiTestContext (quite out of my league).

Hence my proposal to revert the condition to args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID as I had initially.

@knz knz force-pushed the knz:20190123-rpc-id branch 3 times, most recently from 5499d8d to 176b3c1 Jan 23, 2019

@tbg

This comment has been minimized.

Copy link
Member

commented Jan 24, 2019

Hence my proposal to revert the condition to args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID as I had initially.

Before you do that, I would special-case nodeID -1 to mean "this is the multiTestContext, please let me do bad things".

@knz knz force-pushed the knz:20190123-rpc-id branch from 176b3c1 to d999e24 Jan 24, 2019

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2019

Before you do that, I would special-case nodeID -1 to mean "this is the multiTestContext, please let me do bad things".

Done, with a testing knob instead.

@tbg
Copy link
Member

left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)


pkg/rpc/heartbeat.go, line 85 at r1 (raw file):

// populate separate node IDs for each heartbeat service.
// The returned callback should be called to cancel the effect.
func TestingAllowNamedRPCToAnonymousServer() func() {

This is a dated way of tweaking behavior for tests. Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:

  1. add a TestingNoValidateNodeIDs into Context
  2. in NewServerWithInterceptor, carry the bool over into HeartbeatService
  3. use the bool in Ping

Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).

@tbg
Copy link
Member

left a comment

Reviewed 18 of 18 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @petermattis)

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 24, 2019

I'm still a bit anxious about this PR (I just realized that 2 RPC connections means we'll be exchanging clock synchronization twice per node), and I won't have time to properly think about it until tomorrow. I appreciate the work you're doing here @knz, but I'd like to request that we don't try to rush this in.

@knz knz force-pushed the knz:20190123-rpc-id branch from d999e24 to 3e52502 Jan 25, 2019

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 25, 2019

Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)?

Done, RFAL

(I had no choice but to do that anyway, otherwise the race detector didn't like me during stress runs)

@knz knz force-pushed the knz:20190123-rpc-id branch from 3e52502 to f2506c7 Jan 25, 2019

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 25, 2019

@petermattis

I'm still a bit anxious about this PR (I just realized that 2 RPC connections means we'll be exchanging clock synchronization twice per node

I think that since I'm still doing the thing to share the *Connection object between gossip and "named" (specific node ID) dials, there should be just 1 RPC connection between two nodes.

@petermattis
Copy link
Contributor

left a comment

I think that since I'm still doing the thing to share the *Connection object between gossip and "named" (specific node ID) dials, there should be just 1 RPC connection between two nodes.

I think gossip will frequently be the first connection to a remote node. That will prohibit sharing a connection.

Yes and what's the problem with that?

My understanding is that if the conn object gets closed on the non-zero path, it's going to be marked as unusable (the grpc code marks it as closed) and so further zero-nodeid rpc activity (gossip) will cause a re-dial. Is my understanding wrong?

I'm not sure what problems that will cause. I don't think this scenario is adequately tested to give us any confidence that the right thing occurs. Feel free to point to tests that I'm missing. My anxiety here is relaxing a previous invariant: closing a connection removes the connection from the connection map.

I'm wondering what the aim of this PR should be. Should we be trying to make it impossible to send an RPC to a remote node with the wrong node ID? That's noble (and perhaps the right thing to do), but also risky in the short term. Another approach would be to focus on reducing the log spam (that's the near term impetus for a change). For example, nodeDialer could poison the local gossip address cache for a node if it discovers the remote node has an unexpected ID. I think that could be done while leaving the current connection behavior in place.

I apologize for the strictness of this review. I'm happy to take this change over if you'd like.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 26, 2019

Should we be trying to make it impossible to send an RPC to a remote node with the wrong node ID? That's noble (and perhaps the right thing to do),

The evidence (based on my limited testing) is that we do not have a disciplined approach to prevent mis-routed RPCs from being incorrectly served and break invariants. See the commit message at the top of this PR. On its own that's an important goal.

but also risky in the short term.

Can you clarify why?

For example, nodeDialer could poison the local gossip address cache for a node if it discovers the remote node has an unexpected ID.

We can certainly also do this (the current PR, as it stands, does not do this yet). I'm happy to add this.

@knz knz force-pushed the knz:20190123-rpc-id branch from f2506c7 to f3c1744 Jan 26, 2019

@tbg tbg added this to Cold storage in Core Jan 28, 2019

@tbg tbg moved this from Cold storage to Incoming in Core Jan 28, 2019

@knz knz force-pushed the knz:20190123-rpc-id branch from f3c1744 to 7b0309b Apr 15, 2019

@knz

This comment has been minimized.

Copy link
Member Author

commented Apr 15, 2019

(rebased PR - will check what CI fallout we get)

@knz knz force-pushed the knz:20190123-rpc-id branch 4 times, most recently from 3e2e015 to 1ef60f1 Apr 15, 2019

@tbg

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Could you update the commit/PR message to indicate what the behavior you're introducing is? I gather that outbound connections to NodeID zero don't validate the peer. From reading the code, I think the situation is as follows:

  1. zero given, nothing cached: creates new connection that doesn't validate NodeID.
  2. NodeID given, but cached with zero: opens new connection, leaves old connection in place (so dialing to zero later still gives the unvalidated conn back)
  3. zero given, cached with nonzero: will use the cached connection.

Please state that (in better words) and give a bit of motivation. It'd also be good to learn who doesn't necessarily supply a NodeID. I think we should consider removing the nodeID-less API completely and force callers to explicitly pass a nodeID (even if it's zero) to make sure we don't regress in silly ways.

The major problem (but it's not a new problem) here seems to be that these unvalidated connections have no reason to ever go away. This is just fallout of the fact that we don't have a mechanism to gracefully tear down connections at all. We don't want to just brutally close the old connection because that gives ugly errors. And yet we know that consumers of a connection can hold on to it for essentially forever (for example Raft transport). Fixing this seems out of scope here and will require all callers to buy into the fact that they'll need to switch over at some point. It'll be a while until that's worth it.

I also looked into @petermattis' concern about clock offsets. We will have multiple connections open but the clock offset measurements key only on the address (i.e. we won't double count the measurements, though we'll measure and verify more frequently). That seems like it should be fine, but should also be mentioned in the commit, or even in the code, where appropriate.

I still have to review the code in detail, but it looked pretty solid. @petermattis had a good point that putting a connection in multiple slots in the map (zero and nonzero) might cause new behavior that needs to be tested appropriately.

@knz knz force-pushed the knz:20190123-rpc-id branch from 1ef60f1 to a4eb29f Apr 15, 2019

@knz

This comment has been minimized.

Copy link
Member Author

commented Apr 15, 2019

Could you update the commit/PR message to indicate what the behavior you're introducing is?
Please state that (in better words) and give a bit of motivation.

Done.

It'd also be good to learn who doesn't necessarily supply a NodeID.

Done (Gossip + CLI client commands)

I think we should consider removing the nodeID-less API completely and force callers to explicitly pass a nodeID (even if it's zero) to make sure we don't regress in silly ways.

Agreed! I added a 2nd commit to do exactly that. PTAL!

The major problem (but it's not a new problem) here seems to be that these unvalidated connections have no reason to ever go away.

  • for gossip it does not matter
  • for CLI clients it does not matter
  • there are no other users
  • so we're fine!

@knz knz force-pushed the knz:20190123-rpc-id branch from a4eb29f to 9d2c1e4 Apr 15, 2019

@knz knz requested a review from cockroachdb/cli-prs as a code owner Apr 15, 2019

@knz knz force-pushed the knz:20190123-rpc-id branch from 9d2c1e4 to 25747e5 Apr 16, 2019

@tbg

tbg approved these changes Apr 17, 2019

Copy link
Member

left a comment

Before you merge this, can you set up a local three node cluster (or just run the tpcc headroom roachtest) and watch the logs for annoying errors, especially early as the cluster is brought up? I don't expect there to be much (that isn't already there before this PR) but it's worth a look.

Thanks for pulling through!

Reviewed 2 of 8 files at r2, 16 of 16 files at r3, 17 of 17 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/cli/start.go, line 1137 at r4 (raw file):

	// We use GRPCGossipDial() here because it does not matter
	// to which node we're talking to.
	conn, err := rpcContext.GRPCGossipDial(addr).Connect(ctx)

how about GRPCUnvalidatedDial?


pkg/rpc/heartbeat.go, line 60 at r3 (raw file):

	// currently used by the multiTestContext which does not suitably
	// populate separate node IDs for each heartbeat service.
	// The returned callback should be called to cancel the effect.

Remove this line.


pkg/server/server.go, line 1493 at r1 (raw file):

	// Now that we have a node ID, ensure that incoming RPC connections
	// are validated against this node ID.
	s.rpcContext.NodeID.Set(ctx, s.NodeID())

Why did you (have to) lower this into *Node?

knz added some commits Jan 22, 2019

server,rpc: validate node IDs in RPC heartbeats
Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See #34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

- RPC requests for distSQL are OK with being served on a different
  node than intended (with potential performance drop);
- RPC requests to the KV layer are OK with being served on a different
  node than intended (they would route underneath);
- RPC requests to the storage layer are rejected by the
  remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
`GRPCDialNode()` is introduced to establish such connections.

This behaves as follows:

- node ID zero given, no connection cached: creates new connection
  that doesn't validate NodeID.

  This is suitable for the initial GRPC handshake during gossip,
  before node IDs are known. It is also suitable for the CLI
  commands which do not care about which node they are talking to (and
  they do not know the node ID yet -- only the RPC address).

- nonzero NodeID given, but connection cached with node ID zero: opens
  new connection, leaves old connection in place (so dialing to node
  ID zero later still gives the unvalidated conn back.)

  This is suitable when setting up e.g. Raft clients after the
  peer node IDs are determined. At this point we want to introduce
  node ID validation.

  The old connection remains in place because the gossip code does not
  react well from having its connection closed from "under it".

- zero given, cached with nonzero: will use the cached connection.

  This is suitable when gossip needs to verify e.g. the health of
  some remote node known only by its address. In this case it's OK
  to have it use the connection that is already established.

This flexibility suggests that it is possible for clent components to
"opt out" of node ID validation by specifying a zero value, in other
places than strictly necessary for gossip and CLI commands. In fact,
the situation is even more uncomfortable: it requires extra work
to set up the node ID and naive test code will be opting out of
validation implicitly, without clear feedback. This mis-design is
addressed by a subsequent commit.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.
rpc: remove GRPCDial() and disallow anonymous non-gossip connections
The previous patch introduced node ID verification for GRPC
connections but preserved the `GRPCDial()` API, alongside
the ability to use node ID 0 with `GRPCDialNode()`, to signal
that node ID verification should be disabled.

Further examination revealed that this flexibility is 1) hard to
reason about and 2) unneeded.

So instead of keeping this option and then investing time into
producing tests for all the combinations of verifications protocols,
this patch "cuts the gordian knot" by removing this flexibility
altogether.

In summary:

- `GRPCDial()` is removed.
- `GRPCDialNode()` will call log.Fatal() if provided a 0 node ID.
- `GRPCUnvalidatedDial()` is introduced, with a clarification about
  its contract. I have audited the code to validate that this is
  indeed only used by gossip, and the CLI client commands that really
  don't care about the node ID.

Release note: None

@knz knz force-pushed the knz:20190123-rpc-id branch from 25747e5 to 6641499 Apr 18, 2019

@knz
Copy link
Member Author

left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/cli/start.go, line 1137 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

how about GRPCUnvalidatedDial?

I'll trust your judgement.


pkg/rpc/heartbeat.go, line 85 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a dated way of tweaking behavior for tests. Can you put a boolean on *HeartbeatService and Context (for plumbing from the latter into the former in NewServerWithInterceptor)? This is a bit more work, but now it's unclear that we'll really undo this change when an mtc-based test fails. I really want to avoid hacks like this, sorry to make you jump through another hoop. So concretely my suggestion is:

  1. add a TestingNoValidateNodeIDs into Context
  2. in NewServerWithInterceptor, carry the bool over into HeartbeatService
  3. use the bool in Ping

Perhaps there's a more direct way, but we never hold on to the heartbeat service (we just stick it into a grpc server) and doing so wouldn't add clarity (then it would seem that one could replace the heartbeat service, but that wouldn't work).

Done.


pkg/rpc/heartbeat.go, line 60 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Remove this line.

Done.


pkg/server/server.go, line 1493 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why did you (have to) lower this into *Node?

I think you're misreading the code and commenting on a rebase diff. I haven't changed anything here.

@knz
Copy link
Member Author

left a comment

Before you merge this, can you set up a local three node cluster [...] and watch the logs for annoying errors, especially early as the cluster is brought up?

I have used my testing script from earlier and I didn't find anything suspicious.

I did find what I was looking for however:

I190418 13:02:08.329989 333 rpc/nodedialer/nodedialer.go:143  [n1] unable to connect to n4: failed to connect to n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unknown desc = client requested node ID 4 doesn't match server node ID 3

(the new check)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@tbg

tbg approved these changes Apr 24, 2019

Copy link
Member

left a comment

:lgtm:

Reviewed 19 of 19 files at r5, 17 of 17 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@knz

This comment has been minimized.

Copy link
Member Author

commented Apr 24, 2019

Thank you

bors r=tbg

craig bot pushed a commit that referenced this pull request Apr 24, 2019

Merge #34197 #36952
34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz

Fixes #34158.

Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See #34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

- RPC requests for distSQL are OK with being served on a different
  node than intended (with potential performance drop);
- RPC requests to the KV layer are OK with being served on a different
  node than intended (they would route underneath);
- RPC requests to the storage layer are rejected by the
  remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
`GRPCDialNode()` is introduced to establish such connections.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.

36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz

Fixes #35675.

Prior to this patch, this test would fail `stressrace` after a few
dozen iterations.

With this patch, `stressrace` succeeds thousands of iterations.

I have checked that the test logic is preserved: if I change one of
the expected statuses in `testData`, the test still fail properly.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig

This comment has been minimized.

Copy link

commented Apr 24, 2019

Build succeeded

@craig craig bot merged commit 6641499 into cockroachdb:master Apr 24, 2019

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

Core automation moved this from Incoming to Closed Apr 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.