- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 106
Description
Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity
Summary
After fixing the off-by-one bug (#1905) and implementing bidirectional connection initiation (#1907), the test_three_node_network_connectivity test still fails due to a transaction ID reuse race condition that causes the gateway's node.run().await to return an error, shutting down the gateway process.
Impact: While this bug hasn't been observed in production (checked production gateway logs - no unexpected restarts), the test reliably reproduces a race condition that could cause node shutdowns under specific timing conditions.
Environment
- Test: crates/core/tests/connectivity.rs:424-test_three_node_network_connectivity
- Configuration: 3 nodes (1 gateway + 2 peers), all with min_connections=2,max_connections=2
- Timing: Peers join sequentially at T+0s (gateway), T+12s (peer1), T+17s (peer2)
Observed Behavior
Gateway process exits after 33 seconds with error:
[21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc: Timeout or error establishing connection to v6MWKgqJtu7MJ9jv
thread 'test_three_node_network_connectivity' panicked at crates/core/tests/connectivity.rs:772:13:
error: anyhow::Error - Gateway error: channel closed
What happens:
- Gateway's node.run().awaitreturnsErr("channel closed")instead of running indefinitely
- In production, this would cause the freenet process to exit (cleanly, not panic)
- Systemd would restart it if configured with Restart=always
- The test panics because it uses tokio::select!and didn't expect the gateway task to finish
Note: The gateway doesn't panic - it exits cleanly when an internal channel closes. In production this would be a process exit, not a crash.
Background: How the Connect Protocol Works
This issue involves the interaction between two layers in the codebase:
Layer 1: Ring Topology (ConnectionManager)
- Lives in crates/core/src/ring/connection_manager.rs
- Tracks logical connections between peers in the ring
- Methods: should_accept(),add_connection(),num_connections()
- Purpose: Determines routing and which peers we're "connected to" for DHT purposes
Layer 2: P2P Network (self.connections in p2p_protoc.rs)
- Lives in crates/core/src/node/network_bridge/p2p_protoc.rs
- Tracks active transport-layer connections (UDP/TCP channels)
- Type: HashMap<PeerId, mpsc::Sender<NetMessage>>
- Purpose: Actually sends messages between peers
Key insight: A peer can be registered in the ring but not have an active P2P connection yet (or vice versa). These layers can temporarily be out of sync.
The Join Protocol Flow
When a peer wants to join the network:
- Peer sends StartJoinReq→ Gateway (with transaction ID, e.g., "tx A")
- Gateway finds optimal peer for the joiner to connect to
- Gateway sends CheckConnectivity→ Existing peer (asking "will you accept this joiner?")- Uses the same transaction ID "tx A" from the original join request
 
- Existing peer responds AcceptedBy→ Gateway (using same "tx A")
- Gateway forwards AcceptedBy→ Original joiner (using same "tx A")
The same transaction ID flows through the entire protocol to track the join operation.
Bootstrap Connection Special Case
When a gateway has zero connections and receives its first join:
Code path (connect.rs:1140-1151):
if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!("Gateway bootstrap: accepting first connection directly");
        // Returns ConnectState::AwaitingConnectivity
        // This triggers immediate registration in ring (p2p_protoc.rs:836)
    }
}What happens:
- Connection gets added to ConnectionManager(ring topology)
- Connection gets added to self.connections(P2P layer) when handshake completes
- But there's a timing window where ring thinks we're connected before P2P layer does
Root Cause Analysis
Timeline of Events
Peer IDs:
- Gateway: v6MWKgqJtu7MJ9jv
- Peer1: v6MWKgqJ46eKhx16(starts T+12s)
- Peer2: v6MWKgqJE4HoFnZW(starts T+17s)
Sequence:
- 
21:52:29 - Peer2 initiates join to gateway with tx 01K6P0ZVBVWQCW0JKYP4N2J480[21:52:29.051238] INFO freenet::operations::connect: Connecting to gateway, tx: 01K6P0ZVBVWQCW0JKYP4N2J480, gateway: v6MWKgqJtu7MJ9jv
- 
21:52:34.075 - Peer1's bootstrap connection to gateway completes [21:52:34.075684] INFO freenet::ring: Adding connection to peer, peer: v6MWKgqJtu7MJ9jv, this: Some(v6MWKgqJ46eKhx16), was_reserved: true- Connection registered in Peer1's ring topology (ConnectionManager) ✓
- P2P connection may still be establishing (timing race)
 
- 
21:52:39.069 - Peer2's join reaches P2P layer [21:52:39.069451] INFO freenet::node::network_bridge::p2p_protoc: Connecting to peer, tx: 01K6P0ZVBVWQCW0JKYP4N2J480, remote: v6MWKgqJtu7MJ9jv- Peer2's join (tx 01K6P0ZVBVWQCW0JKYP4N2J480) is still active
 
- Peer2's join (tx 
- 
21:52:39.083 - Gateway forwards CheckConnectivityto Peer1 reusing Peer2's tx[21:52:39.083277] INFO freenet::operations::connect: CheckConnectivity received - TRACE, tx: 01K6P0ZVBVWQCW0JKYP4N2J480, at: v6MWKgqJ46eKhx16, joiner: v6MWKgqJE4HoFnZW, sender: v6MWKgqJtu7MJ9jv- Gateway asks Peer1: "Will you accept connection from Peer2?"
- Uses Peer2's join transaction ID 01K6P0ZVBVWQCW0JKYP4N2J480
 
- 
21:52:39.083 - Peer1 processes CheckConnectivity, decides to accept, createsAcceptedByresponse
- 
21:52:39.088 - Peer1 tries to send AcceptedByresponse back to gatewayCode path ( p2p_protoc.rs:247-258):match self.connections.get(&target_peer.peer) { Some(peer_connection) => { // Send message directly ← We want this path } None => { tracing::warn!("No existing outbound connection, establishing connection first"); // Initiate new connection ← But we hit this path instead! } } What went wrong: [21:52:39.088928] WARN freenet::node::network_bridge::p2p_protoc: No existing outbound connection, establishing connection first, id: 01K6P0ZVBVWQCW0JKYP4N2J480, target: v6MWKgqJtu7MJ9jv- Peer1's ring says it's connected to gateway (added at :34.075) ✓
- Peer1's P2P layer (self.connections) doesn't have gateway yet ✗
- P2P layer tries to establish connection using the message's transaction ID
- Uses tx 01K6P0ZVBVWQCW0JKYP4N2J480- the same one Peer2 is using!
 
- 
21:52:44.090 - Connection attempt times out after 5 seconds [21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc: Timeout or error establishing connection to v6MWKgqJtu7MJ9jv- The connection attempt (with conflicting transaction ID) fails
- Gateway's network event listener receives timeout error
- Error propagates up through the async task stack
- Gateway's node.run().awaitreturns with error
- In production: freenet process exits
- In test: tokio::select!picks up gateway completion and test panics
 
The Core Problems
Problem 1: Transaction ID Reuse Across Concurrent Operations
Where it happens (connect.rs:1265-1275):
fn create_forward_message(
    id: Transaction,  // ← Reuses transaction from original join request
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    target: &PeerKeyLocation,
    hops_to_live: usize,
    max_hops_to_live: usize,
    skip_connections: HashSet<PeerId>,
    skip_forwards: HashSet<PeerId>,
) -> NetMessage {
    NetMessage::from(ConnectMsg::Request {
        id,  // ← Same transaction ID flows through entire protocol
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            hops_to_live,
            max_hops_to_live,
            skip_connections,
            skip_forwards,
        },
    })
}The flow:
- Peer2 starts join with tx 01K6P0ZVBVWQCW0JKYP4N2J480
- Gateway calls create_forward_message(01K6P0ZVBVWQCW0JKYP4N2J480, ...)to createCheckConnectivity
- Peer1 receives CheckConnectivitywith tx01K6P0ZVBVWQCW0JKYP4N2J480
- Peer1's P2P layer needs to send response but has no connection to gateway
- P2P layer tries to establish connection using the message's tx 01K6P0ZVBVWQCW0JKYP4N2J480
- Conflict: Both Peer2's join AND Peer1's response connection use the same transaction ID
Design question: Is this intentional? The protocol appears designed to track an entire join operation with one transaction ID, but this assumes operations don't overlap or create conflicts.
Problem 2: Bootstrap Connections Don't Guarantee Active P2P Links
Where bootstrap happens (connect.rs:1140-1151):
// When gateway has zero connections, accept first joiner directly
if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!(
            tx = %id,
            joiner = %joiner.peer,
            "Gateway bootstrap: accepting first connection directly"
        );
        let connectivity_info = ConnectivityInfo::new_bootstrap(
            joiner.clone(),
            1, // Single check for direct connection
        );
        // Returns immediately, triggering ring registration
        return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
    }
}Then in handshake handler (p2p_protoc.rs:836):
if connectivity_info.is_bootstrap_acceptance {
    tracing::info!(
        id = %id,
        joiner = %joiner.peer,
        location = %location,
        "Bootstrap connection: immediately registering in ring"
    );
    // Adds to ring topology
    self.ring.add_connection(location, joiner.peer.clone(), true).await;
}What's missing:
- The code adds the peer to self.ring(ConnectionManager)
- But doesn't ensure the peer is in self.connections(active P2P transport layer)
- This creates a window where ring thinks we're connected but P2P layer doesn't
Why this causes issues:
When Peer1 receives CheckConnectivity and tries to send AcceptedBy response:
// In network_bridge/p2p_protoc.rs when sending a message:
match self.connections.get(&target_peer.peer) {
    Some(_) => { /* send directly */ }
    None => {
        // This path gets hit even though ring shows connection!
        tracing::warn!("No existing outbound connection, establishing connection first");
        // Tries to establish new connection with message's transaction ID
    }
}Why This Matters
The combination of these two problems creates a race condition:
- Peer1's bootstrap registers in ring but P2P layer connection may not be ready
- When Peer1 needs to respond to CheckConnectivity, it tries to establish P2P connection
- Uses the same transaction ID that Peer2 is actively using for its join
- Connection conflict causes timeout, which closes channels and causes gateway to exit
Impact Assessment
Test environment: Reproduces reliably (100% failure rate with current timing)
Production environment:
- Checked production gateway logs (past 7 days): No "channel closed" errors observed
- No unexpected restarts (only manual restarts for maintenance)
- Gateway uptime: ~2 days without issues
- Why it doesn't happen in production:
- Larger network: More peers, different timing patterns
- Less predictable join timing: Peers don't join in rapid succession like the test
- Race condition requires very specific timing window
 
Severity: Medium
- Could cause node shutdowns in production if timing aligns
- Systemd would auto-restart (if configured), but creates service gap
- Not currently impacting production, but is a latent bug
Expected vs Actual Behavior
Expected:
- Peer1 bootstraps to gateway → ring registration + active P2P connection
- Peer2 joins → gateway sends CheckConnectivityto Peer1
- Peer1 sends AcceptedByresponse over existing P2P connection
- Both peers initiate bidirectional connections to each other
- Full mesh connectivity achieved
Actual:
- Peer1 bootstraps → ring registration, P2P connection may not be ready
- Gateway reuses Peer2's join tx for CheckConnectivity
- Peer1 tries to respond → no P2P connection in self.connections
- P2P layer establishes new connection using Peer2's tx
- Transaction conflict → timeout → gateway exits with error
Reproduction
cd ~/code/freenet/freenet-core/main
# Apply previous fixes from PR #1905 and issue #1907
# 1. Off-by-one fix in connection_manager.rs:169 (>= → >)
# 2. Bidirectional connection initiation in AcceptedBy handler
# 3. Graceful channel closure handling in CheckConnectivity/AcceptedBy
# Run test
cargo test --test connectivity test_three_node_network_connectivity -- --nocapture
# Expected: Gateway exits after ~33s with "channel closed" error
# Test panics at line 772
# Logs show transaction ID conflictQuestions for Discussion
- 
Is transaction ID reuse intentional design? - Should CheckConnectivitymessages reuse the joiner's transaction ID?
- The current design allows tracking an entire join operation with one transaction
- But this assumes no concurrent operations will conflict
- Was there a specific reason for this design vs. generating new transaction IDs?
 
- Should 
- 
Should bootstrap connections guarantee active P2P links? - Currently bootstrap adds to ring topology immediately
- Should it wait for P2P layer connection to be fully established?
- Or is the timing race acceptable and we should handle it differently?
 
- 
How should message sending handle missing P2P connections? - Current behavior: Establish new connection using message's transaction ID
- Should we queue messages and wait for connection from elsewhere?
- Should we use a different transaction ID for connection establishment?
 
- 
Is this a regression or was the test always flaky? - The test was removed from CI due to topology manager bugs (issue Topology manager requests duplicate connections to same peer instead of diversifying #1889)
- Was this transaction conflict issue present before?
- Or did recent changes (bidirectional initiation, etc.) expose it?
 
Potential Solutions
Option A: Don't Reuse Transaction IDs for CheckConnectivity
Generate fresh transaction IDs when forwarding CheckConnectivity:
fn create_forward_message(
    // Remove id parameter
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    // ...
) -> NetMessage {
    let new_tx = Transaction::new::<ConnectOp>();  // Fresh transaction
    NetMessage::from(ConnectMsg::Request {
        id: new_tx,  // New ID for this specific CheckConnectivity
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            // ...
        },
    })
}Pros:
- Eliminates transaction ID conflicts
- Each operation has its own lifecycle
- Simple, targeted fix
Cons:
- Breaks transaction flow tracking (can't trace entire join with one ID)
- May affect other parts of protocol that assume transaction continuity
- Requires audit of code that tracks transactions
Recommendation: This seems like the cleanest fix - the transaction ID conflict is the root cause.
Option B: Ensure Bootstrap Creates Active P2P Connections
Modify bootstrap path to wait for P2P connection before ring registration:
if num_connections == 0 && num_reserved == 1 && is_gateway && accepted {
    tracing::info!("Gateway bootstrap: accepting first connection directly");
    // First ensure bidirectional P2P connection exists
    let (callback, mut result) = tokio::sync::mpsc::channel(10);
    op_manager.notify_node_event(NodeEvent::ConnectPeer {
        peer: joiner.peer.clone(),
        tx: id,
        callback,
        is_gw: false,
    }).await?;
    // Wait for P2P layer to establish connection
    result.recv().await.ok_or(OpError::NotificationError)??;
    // Now register in ring (both layers are in sync)
    return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
}Pros:
- Ensures consistent state between ring and P2P layers
- Fixes the root cause of "No existing outbound connection"
- Makes bootstrap more predictable
Cons:
- May slow down bootstrap (adds async wait)
- Changes bootstrap semantics (was fast path, now waits)
- Need to verify this doesn't break other bootstrap assumptions
- Doesn't fix the transaction ID conflict issue
Recommendation: This is a good defensive improvement, but doesn't fully solve the problem.
Option C: Use Separate Transaction for Connection Establishment
When sending responses requires establishing P2P connection, use a different transaction ID:
None => {
    tracing::warn!("No existing outbound connection, establishing connection first");
    let conn_tx = Transaction::new::<ConnectOp>();  // Fresh tx for this connection
    let (callback, mut result) = tokio::sync::mpsc::channel(10);
    self.bridge.ev_listener_tx.send(Right(NodeEvent::ConnectPeer {
        peer: target_peer.peer.clone(),
        tx: conn_tx,  // Use new transaction, not message's tx
        callback,
        is_gw: false,
    })).await?;
    // Wait for connection, then send original message with its original tx
    match timeout(Duration::from_secs(5), result.recv()).await {
        Ok(Some(Ok(_))) => {
            // Connection established, now send the original message
            if let Some(peer_connection) = self.connections.get(&target_peer.peer) {
                peer_connection.send(Left(msg)).await?;
            }
        }
        // Handle errors...
    }
}Pros:
- Minimal protocol changes
- Fixes immediate issue without changing transaction semantics
- Preserves transaction ID tracking for join operations
Cons:
- Creates additional transactions (one for join, one for connection)
- May complicate transaction lifecycle tracking
- Adds complexity to message sending logic
- Doesn't address underlying bootstrap timing issue
Recommendation: This is a surgical fix that targets the specific conflict point.
Recommended Approach
Combination of Option A + Option B:
- Short term: Implement Option A to fix the transaction ID conflict
- Medium term: Implement Option B to make bootstrap more robust
This ensures both:
- Immediate fix for the test failure (Option A)
- Long-term improvement to prevent similar timing issues (Option B)
Test Location
crates/core/tests/connectivity.rs:424 - test_three_node_network_connectivity
Related Issues
- Issue test_three_node_network_connectivity fails: peers don't connect to each other #1904: Original test failure report
- PR Restore test_three_node_network_connectivity as minimal failing test for peer connectivity bug #1905: Off-by-one fix in max_connections check
- PR Fix peer recommendation to use full skip_connections set #1906: Skip-list fix - merged
- Issue test_three_node_network_connectivity fails: asymmetric CheckConnectivity prevents NAT traversal #1907: Missing bidirectional connection initiation
- Issue Topology manager requests duplicate connections to same peer instead of diversifying #1889: Topology manager bugs that caused test to be removed from CI
Additional Context
Two-layer architecture reminder:
- Ring layer (ConnectionManager): Logical peer-to-peer connections for DHT routing
- P2P layer (self.connections): Actual transport channels for sending messages
These can temporarily be out of sync, especially during bootstrap or rapid connection changes.
Transaction lifecycle:
- Transaction IDs are created per-operation (Transaction::new::<ConnectOp>())
- Currently, a single transaction flows through: join request → CheckConnectivity → AcceptedBy response
- This allows tracking the entire operation but assumes no concurrent conflicts
Production testing:
- Checked production gateway on nova.locut.us
- No "channel closed" errors in past 7 days
- No unexpected restarts
- Only manual restarts for maintenance
[AI-assisted debugging and comment]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status