Skip to content

Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity #1908

@sanity

Description

@sanity

Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity

Summary

After fixing the off-by-one bug (#1905) and implementing bidirectional connection initiation (#1907), the test_three_node_network_connectivity test still fails due to a transaction ID reuse race condition that causes the gateway's node.run().await to return an error, shutting down the gateway process.

Impact: While this bug hasn't been observed in production (checked production gateway logs - no unexpected restarts), the test reliably reproduces a race condition that could cause node shutdowns under specific timing conditions.

Environment

  • Test: crates/core/tests/connectivity.rs:424 - test_three_node_network_connectivity
  • Configuration: 3 nodes (1 gateway + 2 peers), all with min_connections=2, max_connections=2
  • Timing: Peers join sequentially at T+0s (gateway), T+12s (peer1), T+17s (peer2)

Observed Behavior

Gateway process exits after 33 seconds with error:

[21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc: Timeout or error establishing connection to v6MWKgqJtu7MJ9jv
thread 'test_three_node_network_connectivity' panicked at crates/core/tests/connectivity.rs:772:13:
error: anyhow::Error - Gateway error: channel closed

What happens:

  • Gateway's node.run().await returns Err("channel closed") instead of running indefinitely
  • In production, this would cause the freenet process to exit (cleanly, not panic)
  • Systemd would restart it if configured with Restart=always
  • The test panics because it uses tokio::select! and didn't expect the gateway task to finish

Note: The gateway doesn't panic - it exits cleanly when an internal channel closes. In production this would be a process exit, not a crash.

Background: How the Connect Protocol Works

This issue involves the interaction between two layers in the codebase:

Layer 1: Ring Topology (ConnectionManager)

  • Lives in crates/core/src/ring/connection_manager.rs
  • Tracks logical connections between peers in the ring
  • Methods: should_accept(), add_connection(), num_connections()
  • Purpose: Determines routing and which peers we're "connected to" for DHT purposes

Layer 2: P2P Network (self.connections in p2p_protoc.rs)

Key insight: A peer can be registered in the ring but not have an active P2P connection yet (or vice versa). These layers can temporarily be out of sync.

The Join Protocol Flow

When a peer wants to join the network:

  1. Peer sends StartJoinReq → Gateway (with transaction ID, e.g., "tx A")
  2. Gateway finds optimal peer for the joiner to connect to
  3. Gateway sends CheckConnectivity → Existing peer (asking "will you accept this joiner?")
    • Uses the same transaction ID "tx A" from the original join request
  4. Existing peer responds AcceptedBy → Gateway (using same "tx A")
  5. Gateway forwards AcceptedBy → Original joiner (using same "tx A")

The same transaction ID flows through the entire protocol to track the join operation.

Bootstrap Connection Special Case

When a gateway has zero connections and receives its first join:

Code path (connect.rs:1140-1151):

if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!("Gateway bootstrap: accepting first connection directly");
        // Returns ConnectState::AwaitingConnectivity
        // This triggers immediate registration in ring (p2p_protoc.rs:836)
    }
}

What happens:

  • Connection gets added to ConnectionManager (ring topology)
  • Connection gets added to self.connections (P2P layer) when handshake completes
  • But there's a timing window where ring thinks we're connected before P2P layer does

Root Cause Analysis

Timeline of Events

Peer IDs:

  • Gateway: v6MWKgqJtu7MJ9jv
  • Peer1: v6MWKgqJ46eKhx16 (starts T+12s)
  • Peer2: v6MWKgqJE4HoFnZW (starts T+17s)

Sequence:

  1. 21:52:29 - Peer2 initiates join to gateway with tx 01K6P0ZVBVWQCW0JKYP4N2J480

    [21:52:29.051238] INFO freenet::operations::connect: Connecting to gateway,
        tx: 01K6P0ZVBVWQCW0JKYP4N2J480, gateway: v6MWKgqJtu7MJ9jv
    
  2. 21:52:34.075 - Peer1's bootstrap connection to gateway completes

    [21:52:34.075684] INFO freenet::ring: Adding connection to peer,
        peer: v6MWKgqJtu7MJ9jv, this: Some(v6MWKgqJ46eKhx16), was_reserved: true
    
    • Connection registered in Peer1's ring topology (ConnectionManager) ✓
    • P2P connection may still be establishing (timing race)
  3. 21:52:39.069 - Peer2's join reaches P2P layer

    [21:52:39.069451] INFO freenet::node::network_bridge::p2p_protoc: Connecting to peer,
        tx: 01K6P0ZVBVWQCW0JKYP4N2J480, remote: v6MWKgqJtu7MJ9jv
    
    • Peer2's join (tx 01K6P0ZVBVWQCW0JKYP4N2J480) is still active
  4. 21:52:39.083 - Gateway forwards CheckConnectivity to Peer1 reusing Peer2's tx

    [21:52:39.083277] INFO freenet::operations::connect: CheckConnectivity received - TRACE,
        tx: 01K6P0ZVBVWQCW0JKYP4N2J480, at: v6MWKgqJ46eKhx16,
        joiner: v6MWKgqJE4HoFnZW, sender: v6MWKgqJtu7MJ9jv
    
    • Gateway asks Peer1: "Will you accept connection from Peer2?"
    • Uses Peer2's join transaction ID 01K6P0ZVBVWQCW0JKYP4N2J480
  5. 21:52:39.083 - Peer1 processes CheckConnectivity, decides to accept, creates AcceptedBy response

  6. 21:52:39.088 - Peer1 tries to send AcceptedBy response back to gateway

    Code path (p2p_protoc.rs:247-258):

    match self.connections.get(&target_peer.peer) {
        Some(peer_connection) => {
            // Send message directly ← We want this path
        }
        None => {
            tracing::warn!("No existing outbound connection, establishing connection first");
            // Initiate new connection ← But we hit this path instead!
        }
    }

    What went wrong:

    [21:52:39.088928] WARN freenet::node::network_bridge::p2p_protoc:
        No existing outbound connection, establishing connection first,
        id: 01K6P0ZVBVWQCW0JKYP4N2J480, target: v6MWKgqJtu7MJ9jv
    
    • Peer1's ring says it's connected to gateway (added at :34.075) ✓
    • Peer1's P2P layer (self.connections) doesn't have gateway yet ✗
    • P2P layer tries to establish connection using the message's transaction ID
    • Uses tx 01K6P0ZVBVWQCW0JKYP4N2J480 - the same one Peer2 is using!
  7. 21:52:44.090 - Connection attempt times out after 5 seconds

    [21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc:
        Timeout or error establishing connection to v6MWKgqJtu7MJ9jv
    
    • The connection attempt (with conflicting transaction ID) fails
    • Gateway's network event listener receives timeout error
    • Error propagates up through the async task stack
    • Gateway's node.run().await returns with error
    • In production: freenet process exits
    • In test: tokio::select! picks up gateway completion and test panics

The Core Problems

Problem 1: Transaction ID Reuse Across Concurrent Operations

Where it happens (connect.rs:1265-1275):

fn create_forward_message(
    id: Transaction,  // ← Reuses transaction from original join request
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    target: &PeerKeyLocation,
    hops_to_live: usize,
    max_hops_to_live: usize,
    skip_connections: HashSet<PeerId>,
    skip_forwards: HashSet<PeerId>,
) -> NetMessage {
    NetMessage::from(ConnectMsg::Request {
        id,  // ← Same transaction ID flows through entire protocol
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            hops_to_live,
            max_hops_to_live,
            skip_connections,
            skip_forwards,
        },
    })
}

The flow:

  1. Peer2 starts join with tx 01K6P0ZVBVWQCW0JKYP4N2J480
  2. Gateway calls create_forward_message(01K6P0ZVBVWQCW0JKYP4N2J480, ...) to create CheckConnectivity
  3. Peer1 receives CheckConnectivity with tx 01K6P0ZVBVWQCW0JKYP4N2J480
  4. Peer1's P2P layer needs to send response but has no connection to gateway
  5. P2P layer tries to establish connection using the message's tx 01K6P0ZVBVWQCW0JKYP4N2J480
  6. Conflict: Both Peer2's join AND Peer1's response connection use the same transaction ID

Design question: Is this intentional? The protocol appears designed to track an entire join operation with one transaction ID, but this assumes operations don't overlap or create conflicts.

Problem 2: Bootstrap Connections Don't Guarantee Active P2P Links

Where bootstrap happens (connect.rs:1140-1151):

// When gateway has zero connections, accept first joiner directly
if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!(
            tx = %id,
            joiner = %joiner.peer,
            "Gateway bootstrap: accepting first connection directly"
        );
        let connectivity_info = ConnectivityInfo::new_bootstrap(
            joiner.clone(),
            1, // Single check for direct connection
        );
        // Returns immediately, triggering ring registration
        return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
    }
}

Then in handshake handler (p2p_protoc.rs:836):

if connectivity_info.is_bootstrap_acceptance {
    tracing::info!(
        id = %id,
        joiner = %joiner.peer,
        location = %location,
        "Bootstrap connection: immediately registering in ring"
    );
    // Adds to ring topology
    self.ring.add_connection(location, joiner.peer.clone(), true).await;
}

What's missing:

  • The code adds the peer to self.ring (ConnectionManager)
  • But doesn't ensure the peer is in self.connections (active P2P transport layer)
  • This creates a window where ring thinks we're connected but P2P layer doesn't

Why this causes issues:
When Peer1 receives CheckConnectivity and tries to send AcceptedBy response:

// In network_bridge/p2p_protoc.rs when sending a message:
match self.connections.get(&target_peer.peer) {
    Some(_) => { /* send directly */ }
    None => {
        // This path gets hit even though ring shows connection!
        tracing::warn!("No existing outbound connection, establishing connection first");
        // Tries to establish new connection with message's transaction ID
    }
}

Why This Matters

The combination of these two problems creates a race condition:

  1. Peer1's bootstrap registers in ring but P2P layer connection may not be ready
  2. When Peer1 needs to respond to CheckConnectivity, it tries to establish P2P connection
  3. Uses the same transaction ID that Peer2 is actively using for its join
  4. Connection conflict causes timeout, which closes channels and causes gateway to exit

Impact Assessment

Test environment: Reproduces reliably (100% failure rate with current timing)

Production environment:

  • Checked production gateway logs (past 7 days): No "channel closed" errors observed
  • No unexpected restarts (only manual restarts for maintenance)
  • Gateway uptime: ~2 days without issues
  • Why it doesn't happen in production:
    • Larger network: More peers, different timing patterns
    • Less predictable join timing: Peers don't join in rapid succession like the test
    • Race condition requires very specific timing window

Severity: Medium

  • Could cause node shutdowns in production if timing aligns
  • Systemd would auto-restart (if configured), but creates service gap
  • Not currently impacting production, but is a latent bug

Expected vs Actual Behavior

Expected:

  • Peer1 bootstraps to gateway → ring registration + active P2P connection
  • Peer2 joins → gateway sends CheckConnectivity to Peer1
  • Peer1 sends AcceptedBy response over existing P2P connection
  • Both peers initiate bidirectional connections to each other
  • Full mesh connectivity achieved

Actual:

  • Peer1 bootstraps → ring registration, P2P connection may not be ready
  • Gateway reuses Peer2's join tx for CheckConnectivity
  • Peer1 tries to respond → no P2P connection in self.connections
  • P2P layer establishes new connection using Peer2's tx
  • Transaction conflict → timeout → gateway exits with error

Reproduction

cd ~/code/freenet/freenet-core/main

# Apply previous fixes from PR #1905 and issue #1907
# 1. Off-by-one fix in connection_manager.rs:169 (>= → >)
# 2. Bidirectional connection initiation in AcceptedBy handler
# 3. Graceful channel closure handling in CheckConnectivity/AcceptedBy

# Run test
cargo test --test connectivity test_three_node_network_connectivity -- --nocapture

# Expected: Gateway exits after ~33s with "channel closed" error
# Test panics at line 772
# Logs show transaction ID conflict

Questions for Discussion

  1. Is transaction ID reuse intentional design?

    • Should CheckConnectivity messages reuse the joiner's transaction ID?
    • The current design allows tracking an entire join operation with one transaction
    • But this assumes no concurrent operations will conflict
    • Was there a specific reason for this design vs. generating new transaction IDs?
  2. Should bootstrap connections guarantee active P2P links?

    • Currently bootstrap adds to ring topology immediately
    • Should it wait for P2P layer connection to be fully established?
    • Or is the timing race acceptable and we should handle it differently?
  3. How should message sending handle missing P2P connections?

    • Current behavior: Establish new connection using message's transaction ID
    • Should we queue messages and wait for connection from elsewhere?
    • Should we use a different transaction ID for connection establishment?
  4. Is this a regression or was the test always flaky?

Potential Solutions

Option A: Don't Reuse Transaction IDs for CheckConnectivity

Generate fresh transaction IDs when forwarding CheckConnectivity:

fn create_forward_message(
    // Remove id parameter
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    // ...
) -> NetMessage {
    let new_tx = Transaction::new::<ConnectOp>();  // Fresh transaction
    NetMessage::from(ConnectMsg::Request {
        id: new_tx,  // New ID for this specific CheckConnectivity
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            // ...
        },
    })
}

Pros:

  • Eliminates transaction ID conflicts
  • Each operation has its own lifecycle
  • Simple, targeted fix

Cons:

  • Breaks transaction flow tracking (can't trace entire join with one ID)
  • May affect other parts of protocol that assume transaction continuity
  • Requires audit of code that tracks transactions

Recommendation: This seems like the cleanest fix - the transaction ID conflict is the root cause.

Option B: Ensure Bootstrap Creates Active P2P Connections

Modify bootstrap path to wait for P2P connection before ring registration:

if num_connections == 0 && num_reserved == 1 && is_gateway && accepted {
    tracing::info!("Gateway bootstrap: accepting first connection directly");

    // First ensure bidirectional P2P connection exists
    let (callback, mut result) = tokio::sync::mpsc::channel(10);
    op_manager.notify_node_event(NodeEvent::ConnectPeer {
        peer: joiner.peer.clone(),
        tx: id,
        callback,
        is_gw: false,
    }).await?;

    // Wait for P2P layer to establish connection
    result.recv().await.ok_or(OpError::NotificationError)??;

    // Now register in ring (both layers are in sync)
    return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
}

Pros:

  • Ensures consistent state between ring and P2P layers
  • Fixes the root cause of "No existing outbound connection"
  • Makes bootstrap more predictable

Cons:

  • May slow down bootstrap (adds async wait)
  • Changes bootstrap semantics (was fast path, now waits)
  • Need to verify this doesn't break other bootstrap assumptions
  • Doesn't fix the transaction ID conflict issue

Recommendation: This is a good defensive improvement, but doesn't fully solve the problem.

Option C: Use Separate Transaction for Connection Establishment

When sending responses requires establishing P2P connection, use a different transaction ID:

None => {
    tracing::warn!("No existing outbound connection, establishing connection first");

    let conn_tx = Transaction::new::<ConnectOp>();  // Fresh tx for this connection
    let (callback, mut result) = tokio::sync::mpsc::channel(10);

    self.bridge.ev_listener_tx.send(Right(NodeEvent::ConnectPeer {
        peer: target_peer.peer.clone(),
        tx: conn_tx,  // Use new transaction, not message's tx
        callback,
        is_gw: false,
    })).await?;

    // Wait for connection, then send original message with its original tx
    match timeout(Duration::from_secs(5), result.recv()).await {
        Ok(Some(Ok(_))) => {
            // Connection established, now send the original message
            if let Some(peer_connection) = self.connections.get(&target_peer.peer) {
                peer_connection.send(Left(msg)).await?;
            }
        }
        // Handle errors...
    }
}

Pros:

  • Minimal protocol changes
  • Fixes immediate issue without changing transaction semantics
  • Preserves transaction ID tracking for join operations

Cons:

  • Creates additional transactions (one for join, one for connection)
  • May complicate transaction lifecycle tracking
  • Adds complexity to message sending logic
  • Doesn't address underlying bootstrap timing issue

Recommendation: This is a surgical fix that targets the specific conflict point.

Recommended Approach

Combination of Option A + Option B:

  1. Short term: Implement Option A to fix the transaction ID conflict
  2. Medium term: Implement Option B to make bootstrap more robust

This ensures both:

  • Immediate fix for the test failure (Option A)
  • Long-term improvement to prevent similar timing issues (Option B)

Test Location

crates/core/tests/connectivity.rs:424 - test_three_node_network_connectivity

Related Issues

Additional Context

Two-layer architecture reminder:

  • Ring layer (ConnectionManager): Logical peer-to-peer connections for DHT routing
  • P2P layer (self.connections): Actual transport channels for sending messages

These can temporarily be out of sync, especially during bootstrap or rapid connection changes.

Transaction lifecycle:

  • Transaction IDs are created per-operation (Transaction::new::<ConnectOp>())
  • Currently, a single transaction flows through: join request → CheckConnectivity → AcceptedBy response
  • This allows tracking the entire operation but assumes no concurrent conflicts

Production testing:

  • Checked production gateway on nova.locut.us
  • No "channel closed" errors in past 7 days
  • No unexpected restarts
  • Only manual restarts for maintenance

[AI-assisted debugging and comment]

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-networkingArea: Networking, ring protocol, peer discoveryP-highHigh priority

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions