New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network: connection deduplication #4695
network: connection deduplication #4695
Conversation
Now, via headers passed between peers, an Identity Challenge is passed between peers. This object is used for cryptographic identification, currently for connection deduplication.
Also clean up debug logging lines and reorganize code slightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should re-read netidentity.go more carefully, but a couple early notes
Codecov Report
@@ Coverage Diff @@
## master #4695 +/- ##
==========================================
+ Coverage 53.42% 53.62% +0.19%
==========================================
Files 431 432 +1
Lines 54372 54541 +169
==========================================
+ Hits 29050 29249 +199
+ Misses 23062 23039 -23
+ Partials 2260 2253 -7
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think that got everything.
like prioScheme, only initialize identityScheme if it is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general, my comments and questions are minor and not-blocking
network/netidentity.go
Outdated
// as a header. It returns the identityChallengeValue used for this challenge, so the network can | ||
// confirm it later (by passing it to VerifyResponse), or returns an empty challenge if dedupName is | ||
// not set. | ||
func (i identityChallengePublicKeyScheme) AttachChallenge(attach http.Header, addr string) identityChallengeValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, feel free to drop: I'd prefer the attach
variable name to be something like attachTo
or baseHeaders
. To me attach
is ambiguous and the first parse of it sounds like the attachment itself rather than the target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy enough! consider it done.
err := protocol.Decode(message.Data, &msg) | ||
if err != nil { | ||
networkPeerIdentityError.Inc(nil) | ||
peer.net.log.With("err", err).With("remote", peer.OriginAddress()).With("local", localAddr).Warn("peer identity verification could not be decoded, disconnecting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a short named field for reason instead of having that information in the Message string itself?
It's a low cardinality value and it would make looking at the breakdown of reasons for identity related disconnects easier without breaking out the telemetry metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a wsNetwork disconnects a peer, it has either the exported Disconnect(peer)
method, or the internal disconnect(peer, reason)
method. Since we are using the internal, I added a reason "disconnectDuplicateConnection". So, the accounting should already be handled.
Plus, all non-connections (due to duplicate or error) trigger metric counter increments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that made me think, what do you think of changing all the Disconnect(peer) calls added in this file where we disconnect due to identity failures to use a new disconnectReason
like disconnect(peer, disconnectBadIdentityData)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it: 5d9c380
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's partially what I was thinking but even more granular on the reason for what went wrong in the identity. It's findable right now though via the message field in Elastic Search though so that's good enough.
) | ||
|
||
// The following msgp objects are implemented in this file: | ||
// disconnectReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's not hurting but do we actually need to generate methods for this type or is it safe to msgp:ignore
it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue, my goal was to get msgp working for the networking package. As others start using msgp in networking, they'll need to make sure the generated files work as intended for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is not if you are not using it and network didn't run msgp:generate
prior to your commit. nbd but you can msgp:ignore
it which will cut down a few tests and remove unnecessary interface implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, disconnectReason should be internal only and can be ignored
// will prevent this connection from attempting in the first place | ||
// in the real world, that isConnectedTo doesn't always trigger, if the hosts are behind | ||
// a load balancer or other NAT | ||
if _, ok := netA.tryConnectReserveAddr(addrB); ok || true { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're testing for ok || true
here but I assume this really means require.True(t, ok)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it means you don't need to call tryConnectReserveAddr at all, right? Since the A->B conn is already connected, ok
will always be false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I wrote it like this for a couple reasons:
-
All the tests I wrote that use tryConnect attempt a
tryConnectReserveAddr
because it's how tryConnect gets used in the wsNetwork. In this case I really don't care about the value, but I would rather explicitly overload it and demonstrate/explain the difference than simply not include it. -
At the original time of writing, I was unsure of the potential side effects that
tryConnectReserveAddr
would have on the upcomingtryConnect
, so I wanted to make sure it matched the real-world code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it seems like you would want to assert that ok
is false to assert the behavior that tryConnectReserveAddr wouldn't let you connect twice to the same address, and that it didn't lose track of the first conn's address.
have mockIdentityTracker wrap real identityTracker
Identity tests require a full HTTP to Web-Socket peering, and are then checked for expected for the expected peerings. This means some time is required to let the connection fully settle, and let each network's readLoop for the peer find potentially closed connections. Also removes a duplicate test case
@@ -1165,12 +1214,14 @@ func (wn *WebsocketNetwork) ServeHTTP(response http.ResponseWriter, request *htt | |||
prioChallenge: challenge, | |||
createTime: trackedRequest.created, | |||
version: matchingVersion, | |||
identity: peerID, | |||
identityChallenge: peerIDChallenge, | |||
identityVerified: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: identityVerified is zero so it doesn't need to be listed here (and the other wsPeer struct you fill out an identity: peerID
below doesn't have the zero value initialized)
const maxAddressLen = 256 + 32 // Max DNS (255) + margin for port specification | ||
|
||
// identityChallengeValue is 32 random bytes used for identity challenge exchange | ||
type identityChallengeValue [32]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
type identityChallengeValue [32]byte | |
type identityChallengeValue crypto.Digest |
return identityChallengeValue{}, crypto.PublicKey{}, err | ||
} | ||
if !idChal.Verify() { | ||
return identityChallengeValue{}, crypto.PublicKey{}, fmt.Errorf("identity challenge incorrectly signed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use named error created with errors.New(...)
we are trying to eliminate errors literals and new code should not add more
return crypto.PublicKey{}, []byte{}, err | ||
} | ||
if resp.Msg.Challenge != c { | ||
return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response did not contain originally issued challenge value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named errors
return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response did not contain originally issued challenge value") | ||
} | ||
if !resp.Verify() { | ||
return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response incorrectly signed ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named errors
if err != nil { | ||
networkPeerIdentityError.Inc(nil) | ||
peer.net.log.With("err", err).With("remote", peer.OriginAddress()).With("local", localAddr).Warn("peer identity verification could not be decoded, disconnecting") | ||
peer.net.disconnect(peer, disconnectBadIdentityData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why disconnecting right here instead of returning Action=Disconnect as other handler do?
if ok { | ||
url, err := url.Parse(addr) | ||
if err == nil { | ||
wn.config.PublicAddress = fmt.Sprintf("%s:%s", url.Hostname(), url.Port()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifying config that has constant/read only semantic in random parts of codebase is not good at all. Why not a local var since all the usage of it is right here?
Summary
What
Introduces "Identity Challenge" exchange during peering. This is a process in which two peers exchange signed challenges to register one another with public keys that validate their identities. This validated identity is then used as the mechanism to prevent duplicate and bidirectional connections between peers on the network.
Why
Today, Algorand nodes connect outwardly to a number of peers on the network (default of 4). Peer selection strategies within the node will regularly drop the worst connection, and seek a random new peer. This means that over time, nodes will select highly performant connections. However, it also means the likelihood of peers connecting to one another bidirectionally is high, since highly performant connections would tend to be mutually preferred.
Why can't we simply deduplicate against addresses, Telemetry ID, or other basic node information? Because Algorand is distributed, nodes need a way to trust the identity being presented as secure. A bad faith actor could impersonate a node's identity, and "claim" connections to all relay peers, which is a denial of service vector. Instead, we need to be able to validate a connection against a secret identity on the node. A cryptographic solution is the obvious one, where peers exchange signed messages.
How
From the package comment:
Additional Testing Feature:
Public Address: auto
Note: use only for local networks and performance tests.
Thank you very much @brianolson for this idea -- in order to enable IdentityExchange, nodes must have set their PublicAddress to what they expect the DNS knows them to be. This can be tricky in test scenarios (or even live scenarios) where the exact Public Address wouldn't be easily determined -- if "auto" is specified, the PublicAddress value is set once the listener is up. This allows for simpler tests, and could also have utility for node operators who maybe don't want to explicitly set Public Address.
Test Plan
Unit Tests:
Identity Challenges and Responses:
Identity Tracker Tests:
wsNetwork
Manual Testing
(last done Jan17)
tryConnect
. If the receiver does not have a ConectionDeduplicationName, the header payload is never even checked, and all peering continues without identity, as expected.setPeersByID
method to return false (indicating the key is in use) and observed that the connection is disconnected at the point that the peer has a verified identity, as expected.Potential outcomes: