Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: connection deduplication #4695

Merged

Conversation

AlgoAxel
Copy link
Contributor

@AlgoAxel AlgoAxel commented Oct 25, 2022

Summary

What

Introduces "Identity Challenge" exchange during peering. This is a process in which two peers exchange signed challenges to register one another with public keys that validate their identities. This validated identity is then used as the mechanism to prevent duplicate and bidirectional connections between peers on the network.

Why

Today, Algorand nodes connect outwardly to a number of peers on the network (default of 4). Peer selection strategies within the node will regularly drop the worst connection, and seek a random new peer. This means that over time, nodes will select highly performant connections. However, it also means the likelihood of peers connecting to one another bidirectionally is high, since highly performant connections would tend to be mutually preferred.

Why can't we simply deduplicate against addresses, Telemetry ID, or other basic node information? Because Algorand is distributed, nodes need a way to trust the identity being presented as secure. A bad faith actor could impersonate a node's identity, and "claim" connections to all relay peers, which is a denial of service vector. Instead, we need to be able to validate a connection against a secret identity on the node. A cryptographic solution is the obvious one, where peers exchange signed messages.

How

From the package comment:

// netidentity.go implements functionality to participate in an "Identity Challenge Exchange"
// with the purpose of identifying redundant connections between peers, and preventing them.
// The identity challenge exchange protocol is a 3 way handshake that exchanges signed messages.
//
// Message 1 (Identity Challenge): when a request is made to start a gossip connection, an
// identityChallengeSigned message is added to HTTP request headers, containing:
// - a 32 byte random challenge
// - the requester's "identity" PublicKey
// - the PublicAddress of the intended recipient
// - Signature on the above by the requester's PublicKey
//
// Message 2 (Identity Challenge Response): when responding to the gossip connection request,
// if the identity challenge is valid, an identityChallengeResponseSigned message is added
// to the HTTP response headers, containing:
// - the original 32 byte random challenge from Message 1
// - a new "response" 32 byte random challenge
// - the responder's "identity" PublicKey
// - Signature on the above by the responder's PublicKey
//
// Message 3 (Identity Verification): if the identityChallengeResponse is valid, the requester
// sends a NetIDVerificationTag message over websockets to verify it owns its PublicKey, with:
// - Signature on the response challenge from Message 2, using the requester's PublicKey
//
// Upon receipt of Message 2, the requester has enough data to consider the responder's identity "verified".
// Upon receipt of Message 3, the responder has enough data to consider the requester's identity "verified".
// At each of these steps, if the peer's identity was verified, wsNetwork will attempt to add it to the
// identityTracker, which maintains a single peer per identity PublicKey. If the identity is already in use
// by another connected peer, we know this connection is a duplicate, and can be closed.
//
// Protocol Enablement:
// This exchange is optional, and is enabled by setting the configuration value "PublicAddress" to match the
// node's public endpoint address stored in other peers' phonebooks (like "r-aa.algorand-mainnet.network:4160").
//
// Protocol Error Handling:
// Message 1
// - If the Message is not included, assume the peer does not use identity exchange, and peer without attaching an identityChallengeResponse
// - If the Address included in the challenge is not this node's PublicAddress, peering continues without identity exchange.
//   this is so that if an operator misconfigures PublicAddress, it does not decline well meaning peering attempts
// - If the Message is malformed or cannot be decoded, the peering attempt is stopped
// - If the Signature in the challenge does not verify to the included key, the peering attempt is stopped
//
// Message 2
// - If the Message is not included, assume the peer does not use identity exchange, and do not send Message 3
// - If the Message is malformed or cannot be decoded, the peering attempt is stopped
// - If the original 32 byte challenge does not match the one sent in Message 1, the peering attempt is stopped
// - If the Signature in the challenge does not verify to the included key, the peering attempt is stopped
//
// Message 3
// - If the Message is malformed or cannot be decoded, the peer is disconnected
// - If the Signature in the challenge does not verify peer's assumed PublicKey and assigned Challenge Bytes, the peer is disconnected
// - If the Message is not received, no action is taken to disconnect the peer.

Additional Testing Feature: Public Address: auto

Note: use only for local networks and performance tests.
Thank you very much @brianolson for this idea -- in order to enable IdentityExchange, nodes must have set their PublicAddress to what they expect the DNS knows them to be. This can be tricky in test scenarios (or even live scenarios) where the exact Public Address wouldn't be easily determined -- if "auto" is specified, the PublicAddress value is set once the listener is up. This allows for simpler tests, and could also have utility for node operators who maybe don't want to explicitly set Public Address.

Test Plan

Unit Tests:

Identity Challenges and Responses:

  • Tests across Encode / Decode and Verify of Challenge Objects

Identity Tracker Tests:

  • SetIdentity and RemoveIdentity positive and negative testing

wsNetwork

  • "Happy Path" test where two networks connect with identity and we confirm that deduplication works as indended
  • "Single Sided" tests where only the sender (or receiver) is set for identity, identity challenge participation stops and the peers connect normally
  • "Incorrect Address" tests where two peers connect to one another, but the "Address" field is incorrect for one of them. Peers should be able to connect without identity.
  • Protocol Error Handling: all of the above error handling above is tested via wsNetwork tests which use a mockedIdentityScheme which accepts overloaded behavior for one or more of the handshake steps. Confirms that for each type of incorrect behavior the correct outcome (in terms of number of peers for each node) is correct

Manual Testing

(last done Jan17)

  • Confirmed that on a 3 node network, if no node has a ConnectionDeduplicationName set, nodes connect to one another as usual and no identity challenge exchange prevents connection.
  • Confirmed that on a 3 node network, if one node does have a ConnectionDeduplicationName set, it will construct an ID Challenge when doing tryConnect. If the receiver does not have a ConectionDeduplicationName, the header payload is never even checked, and all peering continues without identity, as expected.
  • Confirmed that on a 3 node network, if two nodes both have ConnectionDeduplicationName set, but the specified "Address" of the challenge is wrong, the ID challenge is inspected and the peer does not participate in identity challenge once this mismatch is discovered. peers connect as intended without identity.
  • Overloaded the ConnectionDeduplicationName check (because the gossip names have random ports in local testing) and peered with both sides having ConnectionDeduplicationName set. Observed the whole exchange, including final WS verification from the initiating side. Peers were added to the maps as expected.
  • Further Overloaded the setPeersByID method to return false (indicating the key is in use) and observed that the connection is disconnected at the point that the peer has a verified identity, as expected.

Potential outcomes:

  1. Lower bandwidth from less duplicate messages
  2. Better connected graph for less hops across the network (txn latency)

Now, via headers passed between peers, an Identity Challenge is passed
between peers. This object is used for cryptographic identification,
currently for connection deduplication.
Also clean up debug logging lines and reorganize code slightly.
network/netidentity.go Outdated Show resolved Hide resolved
network/wsPeer.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
Copy link
Contributor

@brianolson brianolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should re-read netidentity.go more carefully, but a couple early notes

network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy changed the title Feature/connection deduplication network: connection deduplication Oct 26, 2022
@AlgoAxel AlgoAxel marked this pull request as ready for review October 31, 2022 20:37
network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsPeer.go Outdated Show resolved Hide resolved
network/netidentity.go Outdated Show resolved Hide resolved
network/netidentity.go Outdated Show resolved Hide resolved
network/netidentity.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Nov 1, 2022

Codecov Report

Merging #4695 (5d9c380) into master (7f7939d) will increase coverage by 0.19%.
The diff coverage is 95.93%.

@@            Coverage Diff             @@
##           master    #4695      +/-   ##
==========================================
+ Coverage   53.42%   53.62%   +0.19%     
==========================================
  Files         431      432       +1     
  Lines       54372    54541     +169     
==========================================
+ Hits        29050    29249     +199     
+ Misses      23062    23039      -23     
+ Partials     2260     2253       -7     
Impacted Files Coverage Δ
config/localTemplate.go 63.26% <ø> (ø)
network/connPerfMon.go 84.13% <ø> (ø)
network/phonebook.go 83.49% <ø> (ø)
network/requestTracker.go 70.81% <ø> (ø)
network/topics.go 97.91% <ø> (ø)
network/wsNetwork.go 68.80% <88.52%> (+3.64%) ⬆️
network/netidentity.go 100.00% <100.00%> (ø)
network/wsPeer.go 71.26% <100.00%> (+3.44%) ⬆️
ledger/blockqueue.go 82.25% <0.00%> (-2.69%) ⬇️
crypto/merkletrie/trie.go 66.42% <0.00%> (-2.19%) ⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

brianolson
brianolson previously approved these changes Nov 2, 2022
Copy link
Contributor

@brianolson brianolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think that got everything.

@algorandskiy algorandskiy requested a review from cce November 2, 2022 21:05
@algorandskiy algorandskiy requested review from iansuvak and removed request for brianolson February 9, 2023 21:03
iansuvak
iansuvak previously approved these changes Feb 10, 2023
Copy link
Contributor

@iansuvak iansuvak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, my comments and questions are minor and not-blocking

// as a header. It returns the identityChallengeValue used for this challenge, so the network can
// confirm it later (by passing it to VerifyResponse), or returns an empty challenge if dedupName is
// not set.
func (i identityChallengePublicKeyScheme) AttachChallenge(attach http.Header, addr string) identityChallengeValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, feel free to drop: I'd prefer the attach variable name to be something like attachTo or baseHeaders. To me attach is ambiguous and the first parse of it sounds like the attachment itself rather than the target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy enough! consider it done.

err := protocol.Decode(message.Data, &msg)
if err != nil {
networkPeerIdentityError.Inc(nil)
peer.net.log.With("err", err).With("remote", peer.OriginAddress()).With("local", localAddr).Warn("peer identity verification could not be decoded, disconnecting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add a short named field for reason instead of having that information in the Message string itself?

It's a low cardinality value and it would make looking at the breakdown of reasons for identity related disconnects easier without breaking out the telemetry metric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a wsNetwork disconnects a peer, it has either the exported Disconnect(peer) method, or the internal disconnect(peer, reason) method. Since we are using the internal, I added a reason "disconnectDuplicateConnection". So, the accounting should already be handled.

Plus, all non-connections (due to duplicate or error) trigger metric counter increments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that made me think, what do you think of changing all the Disconnect(peer) calls added in this file where we disconnect due to identity failures to use a new disconnectReason like disconnect(peer, disconnectBadIdentityData)?

Copy link
Contributor

@cce cce Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it: 5d9c380

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's partially what I was thinking but even more granular on the reason for what went wrong in the identity. It's findable right now though via the message field in Elastic Search though so that's good enough.

)

// The following msgp objects are implemented in this file:
// disconnectReason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's not hurting but do we actually need to generate methods for this type or is it safe to msgp:ignore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue, my goal was to get msgp working for the networking package. As others start using msgp in networking, they'll need to make sure the generated files work as intended for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is not if you are not using it and network didn't run msgp:generate prior to your commit. nbd but you can msgp:ignore it which will cut down a few tests and remove unnecessary interface implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, disconnectReason should be internal only and can be ignored

@iansuvak iansuvak self-requested a review February 10, 2023 22:08
// will prevent this connection from attempting in the first place
// in the real world, that isConnectedTo doesn't always trigger, if the hosts are behind
// a load balancer or other NAT
if _, ok := netA.tryConnectReserveAddr(addrB); ok || true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're testing for ok || true here but I assume this really means require.True(t, ok)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it means you don't need to call tryConnectReserveAddr at all, right? Since the A->B conn is already connected, ok will always be false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I wrote it like this for a couple reasons:

  • All the tests I wrote that use tryConnect attempt a tryConnectReserveAddr because it's how tryConnect gets used in the wsNetwork. In this case I really don't care about the value, but I would rather explicitly overload it and demonstrate/explain the difference than simply not include it.

  • At the original time of writing, I was unsure of the potential side effects that tryConnectReserveAddr would have on the upcoming tryConnect, so I wanted to make sure it matched the real-world code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it seems like you would want to assert that ok is false to assert the behavior that tryConnectReserveAddr wouldn't let you connect twice to the same address, and that it didn't lose track of the first conn's address.

cce and others added 7 commits February 13, 2023 15:47
have mockIdentityTracker wrap real identityTracker
Identity tests require a full HTTP to Web-Socket peering, and are then
checked for expected for the expected peerings. This means some time is
required to let the connection fully settle, and let each network's
readLoop for the peer find potentially closed connections.

Also removes a duplicate test case
network/wsNetwork_test.go Outdated Show resolved Hide resolved
@cce cce requested a review from yossigi February 16, 2023 01:27
@@ -1165,12 +1214,14 @@ func (wn *WebsocketNetwork) ServeHTTP(response http.ResponseWriter, request *htt
prioChallenge: challenge,
createTime: trackedRequest.created,
version: matchingVersion,
identity: peerID,
identityChallenge: peerIDChallenge,
identityVerified: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: identityVerified is zero so it doesn't need to be listed here (and the other wsPeer struct you fill out an identity: peerID below doesn't have the zero value initialized)

const maxAddressLen = 256 + 32 // Max DNS (255) + margin for port specification

// identityChallengeValue is 32 random bytes used for identity challenge exchange
type identityChallengeValue [32]byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
type identityChallengeValue [32]byte
type identityChallengeValue crypto.Digest

return identityChallengeValue{}, crypto.PublicKey{}, err
}
if !idChal.Verify() {
return identityChallengeValue{}, crypto.PublicKey{}, fmt.Errorf("identity challenge incorrectly signed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use named error created with errors.New(...)
we are trying to eliminate errors literals and new code should not add more

return crypto.PublicKey{}, []byte{}, err
}
if resp.Msg.Challenge != c {
return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response did not contain originally issued challenge value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named errors

return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response did not contain originally issued challenge value")
}
if !resp.Verify() {
return crypto.PublicKey{}, []byte{}, fmt.Errorf("challenge response incorrectly signed ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named errors

if err != nil {
networkPeerIdentityError.Inc(nil)
peer.net.log.With("err", err).With("remote", peer.OriginAddress()).With("local", localAddr).Warn("peer identity verification could not be decoded, disconnecting")
peer.net.disconnect(peer, disconnectBadIdentityData)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disconnecting right here instead of returning Action=Disconnect as other handler do?

if ok {
url, err := url.Parse(addr)
if err == nil {
wn.config.PublicAddress = fmt.Sprintf("%s:%s", url.Hostname(), url.Port())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modifying config that has constant/read only semantic in random parts of codebase is not good at all. Why not a local var since all the usage of it is right here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants