Skip to content

Latest commit

 

History

History
537 lines (393 loc) · 26.8 KB

discv5-theory.md

File metadata and controls

537 lines (393 loc) · 26.8 KB

Node Discovery Protocol v5 - Theory

Protocol version v5.1

This document explains the algorithms and data structures used by the protocol.

Nodes, Records and Distances

A participant in the Node Discovery Protocol is represented by a 'node record' as defined in EIP-778. The node record keeps arbitrary information about the node. For the purposes of this protocol, the node must at least provide an IP address ("ip" or "ip6" key) and UDP port ("udp" key) in order to have it's record relayed in the DHT.

Node records are signed according to an 'identity scheme'. Any scheme can be used with Node Discovery Protocol, and nodes using different schemes can communicate.

The identity scheme of a node record defines how a 32-byte 'node ID' is derived from the information contained in the record. The 'distance' between two node IDs is the bitwise XOR of the IDs, taken as the big-endian number.

distance(n₁, n₂) = n₁ XOR n₂

In many situations, the logarithmic distance (i.e. length of differing suffix in bits) is used in place of the actual distance.

logdistance(n₁, n₂) = log2(distance(n₁, n₂))

Maintaining The Local Node Record

Participants should update their record, increase the sequence number and sign a new version of the record whenever their information changes. This is especially important for changes to the node's IP address and port. Implementations should determine the external endpoint (the Internet-facing IP address and port on which the node can be reached) and include it in their record.

If communication flows through a NAT device, the UPnP/NAT-PMP protocols or the mirrored UDP envelope IP and port found in the PONG message can be used to determine the external IP address and port.

If the endpoint cannot be determined (e.g. when the NAT doesn't support 'full-cone' translation), implementations should omit IP address and UDP port from the record.

Sessions

Discovery communication is encrypted and authenticated using session keys, established in the handshake. Since every node participating in the network acts as both client and server, a handshake can be initiated by either side of communication at any time.

Handshake Steps

Step 1: Node A sends message packet

In the following definitions, we assume that node A wishes to communicate with node B, e.g. to send a FINDNODE message. Node A must have a copy of node B's record in order to communicate with it.

If node A has session keys from prior communication with B, it encrypts its request with those keys. If no keys are known, it initiates the handshake by sending an ordinary message packet with random message content.

A -> B   FINDNODE message packet encrypted with unknown key

Step 2: Node B responds with challenge

Node B receives the message packet and extracts the source node ID from the packet header. If node B has session keys from prior communication with A, it attempts to decrypt the message data. If decryption and authentication of the message succeeds, there is no need for a handshake and node B can simply respond to the request.

If node B does not have session keys or decryption is not successful, it must initiate a handshake by responding with a WHOAREYOU packet.

It first generates a unique id-nonce value and includes it in the packet. Node B also checks if it has a copy of node A's record. If it does, it also includes the sequence number of this record in the challenge packet, otherwise it sets the enr-seq field to zero.

Node B must also store the A's record and the WHOAREYOU challenge for a short duration after sending it to node A because they will be needed again in step 4.

A <- B   WHOAREYOU packet including id-nonce, enr-seq

Step 3: Node A processes the challenge

Node A receives the challenge sent by node B, which confirms that node B is alive and is ready to perform the handshake. The challenge can be traced back to the request packet which solicited it by checking the nonce, which mirrors the request packet's nonce.

Node A proceeds with the handshake by re-sending the FINDNODE request as a handshake message packet. This packet contains three parts in addition to the message: id-signature, ephemeral-pubkey and record.

The handshake uses the unmasked WHOAREYOU challenge as an input:

challenge-data     = masking-iv || static-header || authdata

Node A can now derive the new session keys. To do so, it first generates an ephemeral key pair on the elliptic curve used by node B's identity scheme. As an example, let's assume the node record of B uses the "v4" scheme. In this case the ephemeral-pubkey will be a public key on the secp256k1 curve.

ephemeral-key      = random private key generated by node A
ephemeral-pubkey   = public key corresponding to ephemeral-key

The ephemeral key is used to perform Diffie-Hellman key agreement with node B's static public key and the session keys are derived from it using the HKDF key derivation function.

dest-pubkey        = public key corresponding to node B's static private key
secret             = ecdh(dest-pubkey, ephemeral-key)
kdf-info           = "discovery v5 key agreement" || node-id-A || node-id-B
prk                = HKDF-Extract(secret, challenge-data)
key-data           = HKDF-Expand(prk, kdf-info)
initiator-key      = key-data[:16]
recipient-key      = key-data[16:]

Node A creates the id-signature, which proves that it controls the private key which signed its node record. The signature also prevents replay of the handshake.

id-signature-text  = "discovery v5 identity proof"
id-signature-input = id-signature-text || challenge-data || ephemeral-pubkey || node-id-B
id-signature       = id_sign(sha256(id-signature-input))

Finally, node A compares the enr-seq element of the WHOAREYOU challenge against its own node record sequence number. If the sequence number in the challenge is lower, it includes its record into the handshake message packet.

The request is now re-sent, with the message encrypted using the new session keys.

A -> B   FINDNODE handshake message packet, encrypted with new initiator-key

Step 4: Node B receives handshake message

When node B receives the handshake message packet, it first loads the node record and WHOAREYOU challenge which it sent and stored earlier.

If node B did not have the node record of node A, the handshake message packet must contain a node record. A record may also be present if node A determined that its record is newer than B's current copy. If the packet contains a node record, B must first validate it by checking the record's signature.

Node B then verifies the id-signature against the identity public key of A's record.

After that, B can perform the key derivation using its own static private key and the ephemeral-pubkey from the handshake packet. Using the resulting session keys, it attempts to decrypt the message contained in the packet.

If the message can be decrypted and authenticated, Node B considers the new session keys valid and responds to the message. In our example case, the response is a NODES message:

A <- B   NODES encrypted with new recipient-key

Step 5: Node A receives response message

Node A receives the message packet response and authenticates/decrypts it with the new session keys. If decryption/authentication succeeds, node B's identity is verified and node A also considers the new session keys valid.

Identity-Specific Cryptography in the Handshake

Establishment of session keys is dependent on the identity scheme used by the recipient (i.e. the node which sends WHOAREYOU). Likewise, the signature over id-sig-input is made by the identity key of the initiator. It is not required that initiator and recipient use the same identity scheme in their respective node records. Implementations must be able to perform the handshake for all supported identity schemes.

At this time, the only supported identity scheme is "v4".

id_sign(hash) creates a signature over hash using the node's static private key. The signature is encoded as the 64-byte array r || s, i.e. as the concatenation of the signature values.

ecdh(pubkey, privkey) creates a secret through elliptic-curve Diffie-Hellman key agreement. The public key is multiplied by the private key to create a secret ephemeral key eph = pubkey * privkey. The 33-byte secret output is y || eph.x where y is 0x02 when eph.y is even or 0x03 when eph.y is odd.

Handshake Implementation Considerations

Since a handshake may happen at any time, UDP packets may be reordered by transmitting networking equipment, implementations must deal with certain subtleties regarding the handshake.

In general, implementations should keep a reference to all sent request packets until the request either times out, is answered by the corresponding response packet or answered by WHOAREYOU. If WHOAREYOU is received as the answer to a request, the request must be re-sent as a handshake packet.

If an implementation supports sending concurrent requests, multiple responses may be pending when WHOAREYOU is received, as in the following example:

A -> B   FINDNODE
A -> B   PING
A -> B   TOPICQUERY
A <- B   WHOAREYOU (nonce references PING)

When this happens, all buffered requests can be considered invalid (the remote end cannot decrypt them) and the packet referenced by the WHOAREYOU nonce (in this example: PING) must be re-sent as a handshake. When the response to the re-sent is received, the new session is established and other pending requests (example: FINDNODE, TOPICQUERY) may be re-sent.

Note that WHOAREYOU is only ever valid as a response to a previously sent request. If WHOAREYOU is received but no requests are pending, the handshake attempt can be ignored.

Another important issue is the processing of message packets while a challenge is received: consider the case where node A has sent a packet that B cannot decrypt, and B has responded with WHOAREYOU.

A -> B   FINDNODE
A <- B   WHOAREYOU

Node B is now waiting for a handshake message packet to complete the new session, but instead receives another ordinary message packet.

A -> B   ORDINARY MESSAGE PACKET

In this case, implementations should respond with a new WHOAREYOU challenge referencing the message packet.

Session Cache

Nodes should store session keys for communication with other recently-seen nodes. Since sessions are ephemeral and can be re-established whenever necessary, it is sufficient to store a limited number of sessions in an in-memory LRU cache.

To prevent IP spoofing attacks, implementations must ensure that session secrets and the handshake are tied to a specific UDP endpoint. This is simple to implement by using the node ID and IP/port as the 'key' into the in-memory session cache. When a node switches endpoints, e.g. when roaming between different wireless networks, sessions will have to be re-established by handshaking again. This requires no effort on behalf of the roaming node because the recipients of protocol messages will simply refuse to decrypt messages from the new endpoint and reply with WHOAREYOU.

The number of messages which can be encrypted with a certain session key is limited because encryption of each message requires a unique nonce for AES-GCM. In addition to the keys, the session cache must also keep track of the count of outgoing messages to ensure the uniqueness of nonce values. Since the wire protocol uses 96 bit AES-GCM nonces, it is strongly recommended to generate them by encoding the current outgoing message count into the first 32 bits of the nonce and filling the remaining 64 bits with random data generated by a cryptographically secure random number generator.

Node Table

Nodes keep information about other nodes in their neighborhood. Neighbor nodes are stored in a routing table consisting of 'k-buckets'. For each 0 ≤ i < 256, every node keeps a k-bucket for nodes of logdistance(self, n) == i. The Node Discovery Protocol uses k = 16, i.e. every k-bucket contains up to 16 node entries. The entries are sorted by time last seen — least-recently seen node at the head, most-recently seen at the tail.

Whenever a new node N₁ is encountered, it can be inserted into the corresponding bucket. If the bucket contains less than k entries N₁ can simply be added as the first entry. If the bucket already contains k entries, the liveness of the least recently seen node in the bucket, N₂, needs to be revalidated. If no reply is received from N₂ it is considered dead, removed and N₁ added to the front of the bucket.

Neighbors of very low distance are unlikely to occur in practice. Implementations may omit k-buckets for low distances.

Table Maintenance In Practice

Nodes are expected to keep track of their close neighbors and regularly refresh their information. To do so, a lookup targeting the least recently refreshed bucket should be performed at regular intervals.

Checking node liveness whenever a node is to be added to a bucket is impractical and creates a DoS vector. Implementations should perform liveness checks asynchronously with bucket addition and occasionally verify that a random node in a random bucket is live by sending PING. When the PONG response indicates that a new version of the node record is available, the liveness check should pull the new record and update it in the local table.

If a node's liveness has been verified many times, implementations may consider occasional non-responsiveness permissible and assume the node is live.

When responding to FINDNODE, implementations must avoid relaying any nodes whose liveness has not been verified. This is easy to achieve by storing an additional flag per node in the table, tracking whether the node has ever successfully responded to a PING request.

In order to keep all k-bucket positions occupied even when bucket members fail liveness checks, it is strongly recommended to maintain a 'replacement cache' alongside each bucket. This cache holds recently-seen nodes which would fall into the corresponding bucket but cannot become a member of the bucket because it is already at capacity. Once a bucket member becomes unresponsive, a replacement can be chosen from the cache.

Lookup

A 'lookup' locates the k closest nodes to a node ID.

The lookup initiator starts by picking α closest nodes to the target it knows of from the local table. The initiator then sends FINDNODE requests to those nodes. α is an implementation-defined concurrency parameter, typically 3. As NODES responses are received, the initiator resends FINDNODE to nodes it has learned about from previous queries. Of the k nodes the initiator has heard of closest to the target, it picks α that it has not yet queried and sends FINDNODE to them. The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen.

To improve the resilience of lookups against adversarial nodes, the algorithm may be adapted to perform network traversal on multiple disjoint paths. Not only does this approach benefit security, it also improves effectiveness because more nodes are visited during a single lookup. The initial k closest nodes are partitioned into multiple independent 'path' buckets, and ​concurrent FINDNODE​ requests executed as described above, with one difference: results discovered on one path are not reused on another, i.e. each path attempts to reach the closest nodes to the lookup target independently without reusing intermediate results found on another path. Note that it is still necessary to track previously asked nodes across all paths to keep the paths disjoint.

Lookup Protocol

This section shows how the wire protocol messages can be used to perform a lookup interaction against a single node.

Node A is looking for target x. It selects node B from the local table or intermediate lookup results. To query for nodes close to x on B, node A computes the query distance d = logdistance(B, x) and sends its request.

A -> B  FINDNODE [d]

Node B responds with multiple nodes messages containing the nodes at the queried distance.

A <- B  NODES [N₁, N₂, N₃]
A <- B  NODES [N₄, N₅]

Depending on the value of d and the content of Bs table, the response to the initial query might contain very few nodes or no nodes at all. Should this be the case, A varies the distance to retrieve more nodes from adjacent k-buckets on B:

A -> B  FINDNODE [d+1]

B responds with more nodes:

A <- B  NODES [N₆, N₇]

Node A now sorts all received nodes by distance to the lookup target and proceeds by repeating the lookup procedure on another, closer node.

Topic Advertisement

The topic advertisement subsystem indexes participants by their provided services. A node's provided services are identified by arbitrary strings called 'topics'. A node providing a certain service is said to 'place an ad' for itself when it makes itself discoverable under that topic. Depending on the needs of the application, a node can advertise multiple topics or no topics at all. Every node participating in the discovery protocol acts as an advertisement medium, meaning that it accepts topic ads from other nodes and later returns them to nodes searching for the same topic.

Topic Table

Nodes store ads for any number of topics and a limited number of ads for each topic. The data structure holding advertisements is called the 'topic table'. The list of ads for a particular topic is called the 'topic queue' because it functions like a FIFO queue of limited length. The image below depicts a topic table containing three queues. The queue for topic T₁ is at capacity.

topic table

The queue size limit is implementation-defined. Implementations should place a global limit on the number of ads in the topic table regardless of the topic queue which contains them. Reasonable limits are 100 ads per queue and 50000 ads across all queues. Since ENRs are at most 300 bytes in size, these limits ensure that a full topic table consumes approximately 15MB of memory.

Any node may appear at most once in any topic queue, that is, registration of a node which is already registered for a given topic fails. Implementations may impose other restrictions on the table, such as restrictions on the number of IP-addresses in a certain range or number of occurrences of the same node across queues.

Tickets

Ads should remain in the queue for a constant amount of time, the target-ad-lifetime. To maintain this guarantee, new registrations are throttled and registrants must wait for a certain amount of time before they are admitted. When a node attempts to place an ad, it receives a 'ticket' which tells them how long they must wait before they will be accepted. It is up to the registrant node to keep the ticket and present it to the advertisement medium when the waiting time has elapsed.

The waiting time constant is:

target-ad-lifetime = 15min

The assigned waiting time for any registration attempt is determined according to the following rules:

  • When the table is full, the waiting time is assigned based on the lifetime of the oldest ad across the whole table, i.e. the registrant must wait for a table slot to become available.
  • When the topic queue is full, the waiting time depends on the lifetime of the oldest ad in the queue. The assigned time is target-ad-lifetime - oldest-ad-lifetime in this case.
  • Otherwise the ad may be placed immediately.

Tickets are opaque objects storing arbitrary information determined by the issuing node. While details of encoding and ticket validation are up to the implementation, tickets must contain enough information to verify that:

  • The node attempting to use the ticket is the node which requested it.
  • The ticket is valid for a single topic only.
  • The ticket can only be used within the registration window.
  • The ticket can't be used more than once.

Implementations may choose to include arbitrary other information in the ticket, such as the cumulative wait time spent by the advertiser. A practical way to handle tickets is to encrypt and authenticate them with a dedicated secret key:

ticket       = aesgcm_encrypt(ticket-key, ticket-nonce, ticket-pt, '')
ticket-pt    = [src-node-id, src-ip, topic, req-time, wait-time, cum-wait-time]
src-node-id  = node ID that requested the ticket
src-ip       = IP address that requested the ticket
topic        = the topic that ticket is valid for
req-time     = absolute time of REGTOPIC request
wait-time    = waiting time assigned when ticket was created
cum-wait     = cumulative waiting time of this node

Registration Window

The image below depicts a single ticket's validity over time. When the ticket is issued, the node keeping it must wait until the registration window opens. The length of the registration window is 10 seconds. The ticket becomes invalid after the registration window has passed.

ticket validity over time

Since all ticket waiting times are assigned to expire when a slot in the queue opens, the advertisement medium may receive multiple valid tickets during the registration window and must choose one of them to be admitted in the topic queue. The winning node is notified using a REGCONFIRMATION response.

Picking the winner can be achieved by keeping track of a single 'next ticket' per queue during the registration window. Whenever a new ticket is submitted, first determine its validity and compare it against the current 'next ticket' to determine which of the two is better according to an implementation-defined metric such as the cumulative wait time stored in the ticket.

Advertisement Protocol

This section explains how the topic-related protocol messages are used to place an ad.

Let us assume that node A provides topic T. It selects node C as advertisement medium and wants to register an ad, so that when node B (who is searching for topic T) asks C, C can return the registration entry of A to B.

Node A first attempts to register without a ticket by sending REGTOPIC to C.

A -> C  REGTOPIC [T, ""]

C replies with a ticket and waiting time.

A <- C  TICKET [ticket, wait-time]

Node A now waits for the duration of the waiting time. When the wait is over, A sends another registration request including the ticket. C does not need to remember its issued tickets since the ticket is authenticated and contains enough information for C to determine its validity.

A -> C  REGTOPIC [T, ticket]

Node C replies with another ticket. Node A must keep this ticket in place of the earlier one, and must also be prepared to handle a confirmation call in case registration was successful.

A <- C  TICKET [ticket, wait-time]

Node C waits for the registration window to end on the queue and selects A as the node which is registered. Node C places A into the topic queue for T and sends a REGCONFIRMATION response.

A <- C  REGCONFIRMATION [T]

Ad Placement And Topic Radius

Since every node may act as an advertisement medium for any topic, advertisers and nodes looking for ads must agree on a scheme by which ads for a topic are distributed. When the number of nodes advertising a topic is at least a certain percentage of the whole discovery network (rough estimate: at least 1%), ads may simply be placed on random nodes because searching for the topic on randomly selected nodes will locate the ads quickly enough.

However, topic search should be fast even when the number of advertisers for a topic is much smaller than the number of all live nodes. Advertisers and searchers must agree on a subset of nodes to serve as advertisement media for the topic. This subset is simply a region of the node ID address space, consisting of nodes whose Kademlia address is within a certain distance to the topic hash sha256(T). This distance is called the 'topic radius'.

Example: for a topic f3b2529e... with a radius of 2^240, the subset covers all nodes whose IDs have prefix f3b2.... A radius of 2^256 means the entire network, in which case advertisements are distributed uniformly among all nodes. The diagram below depicts a region of the address space with topic hash t in the middle and several nodes close to t surrounding it. Dots above the nodes represent entries in the node's queue for the topic.

diagram explaining the topic radius concept

To place their ads, participants simply perform a random walk within the currently estimated radius and run the advertisement protocol by collecting tickets from all nodes encountered during the walk and using them when their waiting time is over.

Topic Radius Estimation

Advertisers must estimate the topic radius continuously in order to place their ads on nodes where they will be found. The radius mustn't fall below a certain size because restricting registration to too few nodes leaves the topic vulnerable to censorship and leads to long waiting times. If the radius were too large, searching nodes would take too long to find the ads.

Estimating the radius uses the waiting time as an indicator of how many other nodes are attempting to place ads in a certain region. This is achieved by keeping track of the average time to successful registration within segments of the address space surrounding the topic hash. Advertisers initially assume the radius is 2^256, i.e. the entire network. As tickets are collected, the advertiser samples the time it takes to place an ad in each segment and adjusts the radius such that registration at the chosen distance takes approximately target-ad-lifetime / 2 to complete.

Topic Search

Finding nodes that provide a certain topic is a continuous process which reads the content of topic queues inside the approximated topic radius. This is a much simpler process than topic advertisement because collecting tickets and waiting on them is not required.

To find nodes for a topic, the searcher generates random node IDs inside the estimated topic radius and performs Kademlia lookups for these IDs. All (intermediate) nodes encountered during lookup are asked for topic queue entries using the TOPICQUERY packet.

Radius estimation for topic search is similar to the estimation procedure for advertisement, but samples the average number of results from TOPICQUERY instead of average time to registration. The radius estimation value can be shared with the registration algorithm if the same topic is being registered and searched for.