Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity

# Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity

## Summary

After fixing the off-by-one bug (#1905) and implementing bidirectional connection initiation (#1907), the `test_three_node_network_connectivity` test still fails due to a **transaction ID reuse race condition** that causes the gateway's `node.run().await` to return an error, shutting down the gateway process.

**Impact:** While this bug hasn't been observed in production (checked production gateway logs - no unexpected restarts), the test reliably reproduces a race condition that could cause node shutdowns under specific timing conditions.

## Environment

- Test: [`crates/core/tests/connectivity.rs:424`](https://github.com/freenet/freenet-core/blob/main/crates/core/tests/connectivity.rs#L424) - `test_three_node_network_connectivity`
- Configuration: 3 nodes (1 gateway + 2 peers), all with `min_connections=2`, `max_connections=2`
- Timing: Peers join sequentially at T+0s (gateway), T+12s (peer1), T+17s (peer2)

## Observed Behavior

Gateway process exits after 33 seconds with error:
```
[21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc: Timeout or error establishing connection to v6MWKgqJtu7MJ9jv
thread 'test_three_node_network_connectivity' panicked at crates/core/tests/connectivity.rs:772:13:
error: anyhow::Error - Gateway error: channel closed
```

**What happens:**
- Gateway's `node.run().await` returns `Err("channel closed")` instead of running indefinitely
- In production, this would cause the freenet process to exit (cleanly, not panic)
- Systemd would restart it if configured with `Restart=always`
- The **test** panics because it uses `tokio::select!` and didn't expect the gateway task to finish

**Note:** The gateway doesn't panic - it exits cleanly when an internal channel closes. In production this would be a process exit, not a crash.

## Background: How the Connect Protocol Works

This issue involves the interaction between two layers in the codebase:

### Layer 1: Ring Topology (`ConnectionManager`)
- Lives in [`crates/core/src/ring/connection_manager.rs`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/ring/connection_manager.rs)
- Tracks **logical connections** between peers in the ring
- Methods: `should_accept()`, `add_connection()`, `num_connections()`
- Purpose: Determines routing and which peers we're "connected to" for DHT purposes

### Layer 2: P2P Network (`self.connections` in `p2p_protoc.rs`)
- Lives in [`crates/core/src/node/network_bridge/p2p_protoc.rs`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/node/network_bridge/p2p_protoc.rs)
- Tracks **active transport-layer connections** (UDP/TCP channels)
- Type: `HashMap<PeerId, mpsc::Sender<NetMessage>>`
- Purpose: Actually sends messages between peers

**Key insight:** A peer can be registered in the ring but not have an active P2P connection yet (or vice versa). These layers can temporarily be out of sync.

### The Join Protocol Flow

When a peer wants to join the network:

1. **Peer sends `StartJoinReq`** → Gateway (with transaction ID, e.g., "tx A")
2. **Gateway finds optimal peer** for the joiner to connect to
3. **Gateway sends `CheckConnectivity`** → Existing peer (asking "will you accept this joiner?")
   - **Uses the same transaction ID "tx A"** from the original join request
4. **Existing peer responds `AcceptedBy`** → Gateway (using same "tx A")
5. **Gateway forwards `AcceptedBy`** → Original joiner (using same "tx A")

The **same transaction ID flows through the entire protocol** to track the join operation.

### Bootstrap Connection Special Case

When a gateway has zero connections and receives its first join:

**Code path** ([`connect.rs:1140-1151`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/operations/connect.rs#L1140-L1151)):
```rust
if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!("Gateway bootstrap: accepting first connection directly");
        // Returns ConnectState::AwaitingConnectivity
        // This triggers immediate registration in ring (p2p_protoc.rs:836)
    }
}
```

**What happens:**
- Connection gets added to `ConnectionManager` (ring topology)
- Connection gets added to `self.connections` (P2P layer) when handshake completes
- But there's a **timing window** where ring thinks we're connected before P2P layer does

## Root Cause Analysis

### Timeline of Events

**Peer IDs:**
- Gateway: `v6MWKgqJtu7MJ9jv`
- Peer1: `v6MWKgqJ46eKhx16` (starts T+12s)
- Peer2: `v6MWKgqJE4HoFnZW` (starts T+17s)

**Sequence:**

1. **21:52:29** - Peer2 initiates join to gateway with tx `01K6P0ZVBVWQCW0JKYP4N2J480`
   ```
   [21:52:29.051238] INFO freenet::operations::connect: Connecting to gateway,
       tx: 01K6P0ZVBVWQCW0JKYP4N2J480, gateway: v6MWKgqJtu7MJ9jv
   ```

2. **21:52:34.075** - Peer1's bootstrap connection to gateway completes
   ```
   [21:52:34.075684] INFO freenet::ring: Adding connection to peer,
       peer: v6MWKgqJtu7MJ9jv, this: Some(v6MWKgqJ46eKhx16), was_reserved: true
   ```
   - Connection registered in Peer1's **ring topology** (ConnectionManager) ✓
   - P2P connection may still be establishing (timing race)

3. **21:52:39.069** - Peer2's join reaches P2P layer
   ```
   [21:52:39.069451] INFO freenet::node::network_bridge::p2p_protoc: Connecting to peer,
       tx: 01K6P0ZVBVWQCW0JKYP4N2J480, remote: v6MWKgqJtu7MJ9jv
   ```
   - Peer2's join (tx `01K6P0ZVBVWQCW0JKYP4N2J480`) is still active

4. **21:52:39.083** - Gateway forwards `CheckConnectivity` to Peer1 **reusing Peer2's tx**
   ```
   [21:52:39.083277] INFO freenet::operations::connect: CheckConnectivity received - TRACE,
       tx: 01K6P0ZVBVWQCW0JKYP4N2J480, at: v6MWKgqJ46eKhx16,
       joiner: v6MWKgqJE4HoFnZW, sender: v6MWKgqJtu7MJ9jv
   ```
   - Gateway asks Peer1: "Will you accept connection from Peer2?"
   - Uses Peer2's join transaction ID `01K6P0ZVBVWQCW0JKYP4N2J480`

5. **21:52:39.083** - Peer1 processes `CheckConnectivity`, decides to accept, creates `AcceptedBy` response

6. **21:52:39.088** - Peer1 tries to send `AcceptedBy` response back to gateway

   **Code path** ([`p2p_protoc.rs:247-258`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/node/network_bridge/p2p_protoc.rs#L247-L258)):
   ```rust
   match self.connections.get(&target_peer.peer) {
       Some(peer_connection) => {
           // Send message directly ← We want this path
       }
       None => {
           tracing::warn!("No existing outbound connection, establishing connection first");
           // Initiate new connection ← But we hit this path instead!
       }
   }
   ```

   **What went wrong:**
   ```
   [21:52:39.088928] WARN freenet::node::network_bridge::p2p_protoc:
       No existing outbound connection, establishing connection first,
       id: 01K6P0ZVBVWQCW0JKYP4N2J480, target: v6MWKgqJtu7MJ9jv
   ```
   - Peer1's **ring** says it's connected to gateway (added at :34.075) ✓
   - Peer1's **P2P layer** (`self.connections`) doesn't have gateway yet ✗
   - P2P layer tries to establish connection using the message's transaction ID
   - **Uses tx `01K6P0ZVBVWQCW0JKYP4N2J480`** - the same one Peer2 is using!

7. **21:52:44.090** - Connection attempt times out after 5 seconds
   ```
   [21:52:44.090160] ERROR freenet::node::network_bridge::p2p_protoc:
       Timeout or error establishing connection to v6MWKgqJtu7MJ9jv
   ```
   - The connection attempt (with conflicting transaction ID) fails
   - Gateway's network event listener receives timeout error
   - Error propagates up through the async task stack
   - Gateway's `node.run().await` returns with error
   - In production: freenet process exits
   - In test: `tokio::select!` picks up gateway completion and test panics

### The Core Problems

#### Problem 1: Transaction ID Reuse Across Concurrent Operations

**Where it happens** ([`connect.rs:1265-1275`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/operations/connect.rs#L1265-L1275)):
```rust
fn create_forward_message(
    id: Transaction,  // ← Reuses transaction from original join request
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    target: &PeerKeyLocation,
    hops_to_live: usize,
    max_hops_to_live: usize,
    skip_connections: HashSet<PeerId>,
    skip_forwards: HashSet<PeerId>,
) -> NetMessage {
    NetMessage::from(ConnectMsg::Request {
        id,  // ← Same transaction ID flows through entire protocol
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            hops_to_live,
            max_hops_to_live,
            skip_connections,
            skip_forwards,
        },
    })
}
```

**The flow:**
1. Peer2 starts join with tx `01K6P0ZVBVWQCW0JKYP4N2J480`
2. Gateway calls `create_forward_message(01K6P0ZVBVWQCW0JKYP4N2J480, ...)` to create `CheckConnectivity`
3. Peer1 receives `CheckConnectivity` with tx `01K6P0ZVBVWQCW0JKYP4N2J480`
4. Peer1's P2P layer needs to send response but has no connection to gateway
5. P2P layer tries to establish connection using the message's tx `01K6P0ZVBVWQCW0JKYP4N2J480`
6. **Conflict:** Both Peer2's join AND Peer1's response connection use the same transaction ID

**Design question:** Is this intentional? The protocol appears designed to track an entire join operation with one transaction ID, but this assumes operations don't overlap or create conflicts.

#### Problem 2: Bootstrap Connections Don't Guarantee Active P2P Links

**Where bootstrap happens** ([`connect.rs:1140-1151`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/operations/connect.rs#L1140-L1151)):
```rust
// When gateway has zero connections, accept first joiner directly
if num_connections == 0 {
    if num_reserved == 1 && is_gateway && accepted {
        tracing::info!(
            tx = %id,
            joiner = %joiner.peer,
            "Gateway bootstrap: accepting first connection directly"
        );
        let connectivity_info = ConnectivityInfo::new_bootstrap(
            joiner.clone(),
            1, // Single check for direct connection
        );
        // Returns immediately, triggering ring registration
        return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
    }
}
```

**Then in handshake handler** ([`p2p_protoc.rs:836`](https://github.com/freenet/freenet-core/blob/main/crates/core/src/node/network_bridge/p2p_protoc.rs#L836)):
```rust
if connectivity_info.is_bootstrap_acceptance {
    tracing::info!(
        id = %id,
        joiner = %joiner.peer,
        location = %location,
        "Bootstrap connection: immediately registering in ring"
    );
    // Adds to ring topology
    self.ring.add_connection(location, joiner.peer.clone(), true).await;
}
```

**What's missing:**
- The code adds the peer to `self.ring` (ConnectionManager)
- But doesn't ensure the peer is in `self.connections` (active P2P transport layer)
- This creates a window where ring thinks we're connected but P2P layer doesn't

**Why this causes issues:**
When Peer1 receives `CheckConnectivity` and tries to send `AcceptedBy` response:
```rust
// In network_bridge/p2p_protoc.rs when sending a message:
match self.connections.get(&target_peer.peer) {
    Some(_) => { /* send directly */ }
    None => {
        // This path gets hit even though ring shows connection!
        tracing::warn!("No existing outbound connection, establishing connection first");
        // Tries to establish new connection with message's transaction ID
    }
}
```

### Why This Matters

The combination of these two problems creates a race condition:
1. Peer1's bootstrap registers in ring but P2P layer connection may not be ready
2. When Peer1 needs to respond to `CheckConnectivity`, it tries to establish P2P connection
3. Uses the same transaction ID that Peer2 is actively using for its join
4. Connection conflict causes timeout, which closes channels and causes gateway to exit

## Impact Assessment

**Test environment:** Reproduces reliably (100% failure rate with current timing)

**Production environment:**
- Checked production gateway logs (past 7 days): No "channel closed" errors observed
- No unexpected restarts (only manual restarts for maintenance)
- Gateway uptime: ~2 days without issues
- **Why it doesn't happen in production:**
  - Larger network: More peers, different timing patterns
  - Less predictable join timing: Peers don't join in rapid succession like the test
  - Race condition requires very specific timing window

**Severity:** Medium
- **Could** cause node shutdowns in production if timing aligns
- Systemd would auto-restart (if configured), but creates service gap
- Not currently impacting production, but is a latent bug

## Expected vs Actual Behavior

**Expected:**
- Peer1 bootstraps to gateway → ring registration + active P2P connection
- Peer2 joins → gateway sends `CheckConnectivity` to Peer1
- Peer1 sends `AcceptedBy` response over existing P2P connection
- Both peers initiate bidirectional connections to each other
- Full mesh connectivity achieved

**Actual:**
- Peer1 bootstraps → ring registration, P2P connection may not be ready
- Gateway reuses Peer2's join tx for `CheckConnectivity`
- Peer1 tries to respond → no P2P connection in `self.connections`
- P2P layer establishes new connection using Peer2's tx
- Transaction conflict → timeout → gateway exits with error

## Reproduction

```bash
cd ~/code/freenet/freenet-core/main

# Apply previous fixes from PR #1905 and issue #1907
# 1. Off-by-one fix in connection_manager.rs:169 (>= → >)
# 2. Bidirectional connection initiation in AcceptedBy handler
# 3. Graceful channel closure handling in CheckConnectivity/AcceptedBy

# Run test
cargo test --test connectivity test_three_node_network_connectivity -- --nocapture

# Expected: Gateway exits after ~33s with "channel closed" error
# Test panics at line 772
# Logs show transaction ID conflict
```

## Questions for Discussion

1. **Is transaction ID reuse intentional design?**
   - Should `CheckConnectivity` messages reuse the joiner's transaction ID?
   - The current design allows tracking an entire join operation with one transaction
   - But this assumes no concurrent operations will conflict
   - Was there a specific reason for this design vs. generating new transaction IDs?

2. **Should bootstrap connections guarantee active P2P links?**
   - Currently bootstrap adds to ring topology immediately
   - Should it wait for P2P layer connection to be fully established?
   - Or is the timing race acceptable and we should handle it differently?

3. **How should message sending handle missing P2P connections?**
   - Current behavior: Establish new connection using message's transaction ID
   - Should we queue messages and wait for connection from elsewhere?
   - Should we use a different transaction ID for connection establishment?

4. **Is this a regression or was the test always flaky?**
   - The test was removed from CI due to topology manager bugs (issue #1889)
   - Was this transaction conflict issue present before?
   - Or did recent changes (bidirectional initiation, etc.) expose it?

## Potential Solutions

### Option A: Don't Reuse Transaction IDs for CheckConnectivity

Generate fresh transaction IDs when forwarding `CheckConnectivity`:

```rust
fn create_forward_message(
    // Remove id parameter
    request_peer: &PeerKeyLocation,
    joiner: &PeerKeyLocation,
    // ...
) -> NetMessage {
    let new_tx = Transaction::new::<ConnectOp>();  // Fresh transaction
    NetMessage::from(ConnectMsg::Request {
        id: new_tx,  // New ID for this specific CheckConnectivity
        target: target.clone(),
        msg: ConnectRequest::CheckConnectivity {
            sender: request_peer.clone(),
            joiner: joiner.clone(),
            // ...
        },
    })
}
```

**Pros:**
- Eliminates transaction ID conflicts
- Each operation has its own lifecycle
- Simple, targeted fix

**Cons:**
- Breaks transaction flow tracking (can't trace entire join with one ID)
- May affect other parts of protocol that assume transaction continuity
- Requires audit of code that tracks transactions

**Recommendation:** This seems like the cleanest fix - the transaction ID conflict is the root cause.

### Option B: Ensure Bootstrap Creates Active P2P Connections

Modify bootstrap path to wait for P2P connection before ring registration:

```rust
if num_connections == 0 && num_reserved == 1 && is_gateway && accepted {
    tracing::info!("Gateway bootstrap: accepting first connection directly");

    // First ensure bidirectional P2P connection exists
    let (callback, mut result) = tokio::sync::mpsc::channel(10);
    op_manager.notify_node_event(NodeEvent::ConnectPeer {
        peer: joiner.peer.clone(),
        tx: id,
        callback,
        is_gw: false,
    }).await?;

    // Wait for P2P layer to establish connection
    result.recv().await.ok_or(OpError::NotificationError)??;

    // Now register in ring (both layers are in sync)
    return Ok(Some(ConnectState::AwaitingConnectivity(connectivity_info)));
}
```

**Pros:**
- Ensures consistent state between ring and P2P layers
- Fixes the root cause of "No existing outbound connection"
- Makes bootstrap more predictable

**Cons:**
- May slow down bootstrap (adds async wait)
- Changes bootstrap semantics (was fast path, now waits)
- Need to verify this doesn't break other bootstrap assumptions
- Doesn't fix the transaction ID conflict issue

**Recommendation:** This is a good defensive improvement, but doesn't fully solve the problem.

### Option C: Use Separate Transaction for Connection Establishment

When sending responses requires establishing P2P connection, use a different transaction ID:

```rust
None => {
    tracing::warn!("No existing outbound connection, establishing connection first");

    let conn_tx = Transaction::new::<ConnectOp>();  // Fresh tx for this connection
    let (callback, mut result) = tokio::sync::mpsc::channel(10);

    self.bridge.ev_listener_tx.send(Right(NodeEvent::ConnectPeer {
        peer: target_peer.peer.clone(),
        tx: conn_tx,  // Use new transaction, not message's tx
        callback,
        is_gw: false,
    })).await?;

    // Wait for connection, then send original message with its original tx
    match timeout(Duration::from_secs(5), result.recv()).await {
        Ok(Some(Ok(_))) => {
            // Connection established, now send the original message
            if let Some(peer_connection) = self.connections.get(&target_peer.peer) {
                peer_connection.send(Left(msg)).await?;
            }
        }
        // Handle errors...
    }
}
```

**Pros:**
- Minimal protocol changes
- Fixes immediate issue without changing transaction semantics
- Preserves transaction ID tracking for join operations

**Cons:**
- Creates additional transactions (one for join, one for connection)
- May complicate transaction lifecycle tracking
- Adds complexity to message sending logic
- Doesn't address underlying bootstrap timing issue

**Recommendation:** This is a surgical fix that targets the specific conflict point.

## Recommended Approach

**Combination of Option A + Option B:**

1. **Short term:** Implement Option A to fix the transaction ID conflict
2. **Medium term:** Implement Option B to make bootstrap more robust

This ensures both:
- Immediate fix for the test failure (Option A)
- Long-term improvement to prevent similar timing issues (Option B)

## Test Location

[`crates/core/tests/connectivity.rs:424`](https://github.com/freenet/freenet-core/blob/main/crates/core/tests/connectivity.rs#L424) - `test_three_node_network_connectivity`

## Related Issues

- Issue #1904: Original test failure report
- PR #1905: Off-by-one fix in max_connections check
- PR #1906: Skip-list fix - merged
- Issue #1907: Missing bidirectional connection initiation
- Issue #1889: Topology manager bugs that caused test to be removed from CI

## Additional Context

**Two-layer architecture reminder:**
- **Ring layer** (`ConnectionManager`): Logical peer-to-peer connections for DHT routing
- **P2P layer** (`self.connections`): Actual transport channels for sending messages

These can temporarily be out of sync, especially during bootstrap or rapid connection changes.

**Transaction lifecycle:**
- Transaction IDs are created per-operation (`Transaction::new::<ConnectOp>()`)
- Currently, a single transaction flows through: join request → CheckConnectivity → AcceptedBy response
- This allows tracking the entire operation but assumes no concurrent conflicts

**Production testing:**
- Checked production gateway on nova.locut.us
- No "channel closed" errors in past 7 days
- No unexpected restarts
- Only manual restarts for maintenance

[AI-assisted debugging and comment]


Uh oh!

Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity #1908

Description

Transaction ID Reuse Causes Gateway Shutdown in test_three_node_network_connectivity

Summary

Environment

Observed Behavior

Background: How the Connect Protocol Works

Layer 1: Ring Topology (ConnectionManager)

Layer 2: P2P Network (self.connections in p2p_protoc.rs)

The Join Protocol Flow

Bootstrap Connection Special Case

Root Cause Analysis

Timeline of Events

The Core Problems

Problem 1: Transaction ID Reuse Across Concurrent Operations

Problem 2: Bootstrap Connections Don't Guarantee Active P2P Links

Why This Matters

Impact Assessment

Expected vs Actual Behavior

Reproduction

Questions for Discussion

Potential Solutions

Option A: Don't Reuse Transaction IDs for CheckConnectivity

Option B: Ensure Bootstrap Creates Active P2P Connections

Option C: Use Separate Transaction for Connection Establishment

Recommended Approach

Test Location

Related Issues

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Layer 1: Ring Topology (`ConnectionManager`)

Layer 2: P2P Network (`self.connections` in `p2p_protoc.rs`)