Skip to content

test_three_node_network_connectivity fails: asymmetric CheckConnectivity prevents NAT traversal #1907

@sanity

Description

@sanity

test_three_node_network_connectivity fails: asymmetric CheckConnectivity prevents NAT traversal

Summary

The test_three_node_network_connectivity test fails because peer-to-peer connections cannot establish when peers join at different times. Investigation reveals that the CheckConnectivity message flow is asymmetric - only one peer initiates an outbound connection, preventing successful NAT traversal which requires bidirectional packet exchange.

Test Configuration

  • 3 nodes: 1 gateway + 2 regular peers
  • All nodes configured with min_connections=2, max_connections=2
  • Peers join sequentially: Gateway at T0, Peer1 at T+12s, Peer2 at T+17s

Expected Behavior

Full mesh connectivity:

  • Gateway: 2 connections (Peer1 + Peer2)
  • Peer1: 2 connections (Gateway + Peer2)
  • Peer2: 2 connections (Gateway + Peer1)

Actual Behavior

Partial connectivity - peers stuck at 1 connection each:

  • Gateway: 2 connections ✓
  • Peer1: 1 connection (Gateway only) ✗
  • Peer2: 1 connection (Gateway only) ✗

Root Cause Analysis

Two Bugs Identified

Bug #1: Off-by-one error in max_connections check ✅ FIXED

Location: crates/core/src/ring/connection_manager.rs:169

Problem:

} else if total_conn >= self.max_connections {  // Should be >
    tracing::debug!(%peer_id, "Rejected connection, max connections reached");
    false

With max_connections=2, the condition >= means peers reject connections when total_conn=2, allowing only 1 connection instead of 2.

Fix: Change to total_conn > self.max_connections

Impact: This prevented Peer1 from accepting Peer2's connection attempt, but was not the only issue.

Bug #2: Asymmetric CheckConnectivity prevents NAT traversal ❌ PRIMARY ISSUE

Problem: When the gateway introduces two peers, it only sends CheckConnectivity to ONE of them, not both.

Evidence from debug logs:

  1. Peer1 asks gateway for connections (18:58:16):

    FindOptimalPeer(joiner=Peer1, ideal_location=random)
    → Gateway response: "No desirable peer found" 
    (Peer2 hasn't joined yet)
    
  2. Peer2 asks gateway for connections (18:58:21):

    FindOptimalPeer(joiner=Peer2, ideal_location=random)
    → Gateway response: "Found desirable peer: Peer1"
    → Gateway sends: CheckConnectivity(target=Peer1, joiner=Peer2)
    
  3. Peer1 receives CheckConnectivity:

    [18:58:21.469767] Peer1: Accepting connection from, joiner: Peer2
    [18:58:21.471767] Peer1: Connecting to peer, remote: Peer2
    [18:58:21.471954] Peer1: Starting outbound connection to 127.0.0.1:51683 (Peer2)
    

    → Peer1 initiates outbound NAT traversal to Peer2 ✅

  4. Peer2's transport layer rejects Peer1's packets:

    [18:58:21.472823] Peer2: unexpected packet from non-gateway node, remote_addr: 127.0.0.1:37039 (Peer1)
    

    → Peer2 has NOT initiated outbound to Peer1, so rejects inbound packets ✗

Why NAT Traversal Fails

NAT traversal requires bidirectional packet exchange:

From crates/core/src/transport/connection_handler.rs:405-412:

if let Some((packets_sender, open_connection)) = ongoing_connections.remove(&remote_addr) {
    // Process packet from expected peer
    ...
} else if !self.is_gateway {
    tracing::debug!(%remote_addr, "unexpected packet from non-gateway node");
    continue;  // Reject packet
}

Non-gateway peers only accept inbound packets from peers they've initiated outbound connections to (i.e., peers in ongoing_connections map).

The ongoing_connections map is only populated when the peer calls NodeEvent::ConnectPeer, which happens when processing CheckConnectivity (connect.rs:305-310).

Result:

  • Peer1 receives CheckConnectivity(joiner=Peer2) → adds Peer2 to ongoing_connections → accepts Peer2's packets ✅
  • Peer2 NEVER receives CheckConnectivity(joiner=Peer1) → never adds Peer1 to ongoing_connections → rejects Peer1's packets ✗
  • NAT traversal fails because only one side is sending packets

Timing Race Condition

The asymmetry occurs because FindOptimalPeer responses depend on when the gateway learns about each peer:

  1. Peer1 joins at T0: Sends FindOptimalPeer immediately via aggressive_initial_connections()

    • Gateway hasn't seen Peer2 yet → "No desirable peer found"
    • Peer1 learns about: nobody
  2. Peer2 joins at T+5s: Sends FindOptimalPeer immediately

    • Gateway NOW knows about Peer1 → "Found: Peer1"
    • Gateway sends CheckConnectivity(target=Peer1, joiner=Peer2)
    • Peer2 learns about: Peer1
  3. Later retries (T+10s, T+32s): Peer1 sends more FindOptimalPeer requests via connection_maintenance()

    • But Peer1 has no peer knowledge to share (only knows Gateway)
    • Gateway queries Peer1: "Who should I connect to?"
    • Peer1 responds: "No desirable peer found" (only knows Gateway, which is in skip list)

The early-joining peer never learns about late-joining peers.

Questions for Nacho

  1. Is the asymmetric CheckConnectivity intentional?

    • Should the protocol send CheckConnectivity to BOTH peers when introducing them?
    • Or is there another mechanism that should trigger Peer2 to connect to Peer1?
  2. How should NAT traversal work with one-way CheckConnectivity?

    • The transport layer rejects unexpected packets from non-gateway nodes (connection_handler.rs:405-412)
    • But if only one peer initiates outbound, the other peer will reject packets
    • Should the transport layer handle unsolicited peer connections differently?
  3. Should peers retry FindOptimalPeer after initial join?

    • Currently connection_maintenance() sends FindOptimalPeer periodically
    • But if peers have no peer knowledge, they can't help the gateway find connections
    • Should the gateway proactively push peer introductions when new peers join?
  4. Is this a known limitation in small networks?

    • The protocol works fine in large networks where many peers have diverse connections
    • But in 3-node networks, the timing race is deterministic
    • Should there be special handling for small networks?

Reproduction

cd ~/code/freenet/freenet-core/main

# Apply off-by-one fix
sed -i 's/total_conn >= self.max_connections/total_conn > self.max_connections/' \
    crates/core/src/ring/connection_manager.rs

# Run test (will still fail due to asymmetric CheckConnectivity)
cargo test --test connectivity test_three_node_network_connectivity -- --nocapture

# Expected: Peers stuck at 1 connection after 60s
# Logs show: Peer2 rejects Peer1's packets as "unexpected from non-gateway node"

Debug Logs

Full debug logs with detailed message flow available at: /tmp/connectivity_debug.log

Key events:

  • 18:58:16: Peer1 FindOptimalPeer → "No desirable peer found"
  • 18:58:21: Peer2 FindOptimalPeer → "Found: Peer1"
  • 18:58:21: Gateway → Peer1: CheckConnectivity(joiner=Peer2)
  • 18:58:21: Peer1 initiates outbound to Peer2
  • 18:58:21: Peer2 rejects Peer1's packets ("unexpected from non-gateway")
  • 18:58:54: Gateway asks Peer1 for recommendations → "No desirable peer"

Related Issues

Test Location

crates/core/tests/connectivity.rs:424 - test_three_node_network_connectivity

[AI-assisted debugging and comment]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions