fix: harden transient connection handling #2120

sanity · 2025-11-22T04:33:37Z

Problem

Transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring, leaving gateways without routable neighbors and starving joiners.

This change

Track transient connections separately with explicit budget/TTL enforcement.
Re-apply admission logic on promotion and clean up pending records/unused helpers.
Restore reserved/pending accessors needed by the admission path.

Testing

cargo check -p freenet (local).

sanity · 2025-11-23T02:00:58Z

@claude Please carefully review this PR, both high-level and the details.

claude · 2025-11-23T02:01:14Z

Claude finished @sanity's task —— View job

PR Review: fix: harden transient connection handling

Todo List

Read all changed files
Review git diff against origin/main
Analyze high-level design and architecture
Review implementation details for bugs/issues
Provide comprehensive feedback

High-Level Design Review

Overview

This PR addresses a critical issue where transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring. The solution introduces explicit transient connection tracking with budget enforcement and TTL, separate from the main ring topology.

Strengths:

Clear separation of concerns: Transient connections are now tracked separately (transient_connections map) from established ring connections
Budget enforcement: The transient_budget prevents unbounded inbound connections
TTL mechanism: Automatic cleanup via spawned task prevents stale transients from accumulating
Simplified architecture: Removal of pending_locations map reduces complexity

Concerns:

⚠️ Potential race condition in promotion path (see detailed findings below)
⚠️ Routing exclusion could starve operations if transients aren't promoted promptly
⚠️ Test removal reduces coverage - the deleted tests validated important invariants

Detailed Code Review

crates/core/src/ring/connection_manager.rs

✅ Positive Changes:

Lines 14-21: TransientEntry documentation

Good: Added clear documentation explaining the purpose of transient entries
Good: Documents that these are unsolicited inbound connections

Lines 124-141: Simplified ConnectionManager initialization

Good: Removed pending_locations field, reducing state management complexity
Good: Transient tracking now centralized in one place

Lines 247-252: Early return for first connection

if open == 0 {
    tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
    return true;
}

Good: Cleaner early return pattern
Good: Removes nested conditionals

Lines 254-276: Gateway direct-accept limit enforcement

Good: Enforces GATEWAY_DIRECT_ACCEPT_LIMIT to prevent gateways from accepting too many direct connections
Good: Properly cleans up reservation counter on rejection

Lines 278-282: Duplicate peer detection

if self.location_for_peer.read().get(peer_id).is_some() {
    tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
    return true;
}

Good: Prevents duplicate reservations for the same peer
Good: Returns true to acknowledge existing reservation

Lines 300-320: record_pending_location simplified

✅ IMPORTANT FIX: Now records directly to location_for_peer instead of pending_locations
This makes pending peers immediately routable, which could be good or bad depending on intention
Question: Is it intentional that pending peers are now routable before handshake completes?

Lines 365-430: Transient connection management

pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool
pub fn drop_transient(&self, peer: &PeerId) -> Option<TransientEntry>
pub fn is_transient(&self, peer: &PeerId) -> bool

Good: Clean API for transient lifecycle management
Good: Budget enforcement with atomic operations
Good: Handles race conditions (lines 389-394)

Lines 451-481: add_connection cleanup

Good: Removed pending_locations manipulation
Simpler flow for promoting connections to active status

Lines 528-561: prune_connection unified

Good: Unified handling for both active and in-transit connections
Good: Uses single location_for_peer map instead of two separate maps
Good: Proper cleanup of both reservation and connection counters

Lines 589-604: routing() excludes transients

if self.is_transient(&conn.location.peer) {
    return None;
}

Good: Prevents routing through transient connections
⚠️ Potential issue: If transients aren't promoted quickly, this could starve routing options for gateways

⚠️ Issues and Concerns:

Lines 300-320: record_pending_location behavior change

pub fn record_pending_location(&self, peer_id: &PeerId, location: Location) {
    let mut locations = self.location_for_peer.write();  // ← Changed from pending_locations

⚠️ This is a significant behavioral change
Before: Pending connections were isolated in pending_locations, not visible to routing
After: Pending connections immediately appear in location_for_peer, making them routable
Risk: If a handshake is in progress but not complete, routing logic might try to use this peer
Recommendation: The comment says "makes the peer discoverable to the routing layer even before the connection is fully established" - this needs careful consideration. Is this intentional? Could this cause premature routing to peers that aren't ready?

Lines 620-622: has_connection_or_pending simplified

pub fn has_connection_or_pending(&self, peer: &PeerId) -> bool {
    self.location_for_peer.read().contains_key(peer)
}

Good: Simpler implementation
But: With the removal of pending_locations, there's no distinction between pending and active anymore
This is consistent with the architectural change, but callers might need to be aware

Test removal (lines 626-733 deleted)

❌ Major concern: Two important tests were deleted:
1. pending_connections_hidden_from_known_locations - validated isolation
2. should_accept_does_not_leak_reservations_for_duplicate_peer - validated reservation safety
Issue: The second test specifically validated that duplicate should_accept calls don't leak reservations
Current code: Lines 278-282 now handle duplicates by returning true early
Recommendation: These tests should be updated to reflect new behavior, not deleted. The invariants they tested are still important.

crates/core/src/node/network_bridge/p2p_protoc.rs

✅ Positive Changes:

Lines 1278-1314: Transient promotion in connect_peer

if self.connections.contains_key(&peer) {
    let connection_manager = &self.bridge.op_manager.ring.connection_manager;
    if let Some(entry) = connection_manager.drop_transient(&peer) {
        let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
        self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
        tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient");
    }

Good: Promotes transient to ring when connect_peer is called
Good: Uses stored location or derives from address
Good: Proper cleanup of transient entry

Lines 1611-1628: Budget enforcement on inbound

Good: Double-checks budget before accepting transient
Good: Cleans up callbacks if budget exhausted
Good: Removes pending transactions

Lines 1688-1702: Budget enforcement before insert

Good: Additional safety check before inserting into connections map
Prevents edge cases where budget could be exceeded

Lines 1729-1754: TTL-based cleanup task

tokio::spawn(async move {
    sleep(ttl).await;
    if cm.drop_transient(&peer).is_some() {
        tracing::info!(%peer, "Transient connection expired; dropping");
        if let Err(err) = drop_tx.send(Right(NodeEvent::DropConnection(peer.clone()))).await {
            tracing::warn!(%peer, ?err, "Failed to dispatch DropConnection for expired transient");
        }
    }
});

Good: Automatic cleanup prevents resource leaks
Good: Only drops if still transient (handles promotion case)

⚠️ Issues and Concerns:

Lines 1286-1296: Promotion logic race condition

if let Some(entry) = connection_manager.drop_transient(&peer) {
    let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
    self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
    tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient");
}

⚠️ Potential race: What if the transient expires (TTL task runs) between connections.contains_key() and drop_transient()?
The TTL cleanup task could send DropConnection while we're trying to promote
Recommendation: Consider adding a state field to TransientEntry (e.g., promoting: AtomicBool) to prevent concurrent expiration during promotion

Lines 1292-1296: Missing admission re-check on promotion

⚠️ Critical issue: When promoting a transient, the code calls add_connection(loc, peer.clone(), false) with was_reserved=false
This means the connection is added to the ring without re-checking admission logic
The PR description says "Re-apply admission logic on promotion" but I don't see where this happens
Current flow:
1. Transient admitted (bypassing normal admission)
2. connect_peer called → promotes directly to ring
3. No should_accept() call during promotion
Expected flow based on PR description:
1. Transient admitted (with budget limit)
2. When promoting, re-run should_accept() to validate admission
3. If rejected, drop the transient instead
Recommendation: Add should_accept() check before add_connection() at line 1292-1294:

if let Some(entry) = connection_manager.drop_transient(&peer) {
    let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
    if connection_manager.should_accept(loc, &peer) {
        self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
        tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient after admission check");
    } else {
        tracing::warn!(tx = %tx, remote = %peer, "connect_peer: transient failed admission on promotion; dropping");
        // Connection will be dropped naturally when callback returns Ok
    }
}

crates/core/tests/test_network_integration.rs

Lines 10-21: Simpler test network setup

Good: Uses freenet-test-network crate for cleaner tests
Good: Removes static network, avoiding Sync issues

crates/core/tests/token_expiration.rs

Lines 135-137: Transient config added

Good: Tests now properly configure transient settings
Minor: The None values preserve existing defaults

Summary of Findings

Critical Issues

❌ Missing admission re-check on promotion (p2p_protoc.rs:1292-1296)
- PR description claims "Re-apply admission logic on promotion" but this doesn't happen
- Transients can bypass admission checks when promoted
- Fix: Call should_accept() before add_connection() during promotion
⚠️ Potential race in promotion vs TTL expiration (p2p_protoc.rs:1286-1296)
- TTL cleanup task could fire while promotion is in progress
- Could lead to connection being dropped mid-promotion
- Fix: Add atomic flag or use better synchronization

Design Concerns

⚠️ Pending connections now routable (connection_manager.rs:300-320)
- Behavior change: record_pending_location now writes to location_for_peer instead of pending_locations
- Makes in-handshake peers visible to routing layer
- Question: Is this intentional? Could cause routing to incomplete connections
⚠️ Routing exclusion could starve gateways (connection_manager.rs:592-594)
- Transients are excluded from routing
- If promotion is slow or fails, gateways may have no routable peers
- Recommendation: Monitor promotion latency

Testing Concerns

❌ Important tests deleted (connection_manager.rs:626-733)
- pending_connections_hidden_from_known_locations - validated isolation
- should_accept_does_not_leak_reservations_for_duplicate_peer - validated safety
- Fix: Update and restore these tests to validate new behavior

Recommendations

Must Fix (Before Merge)

Add admission re-check during promotion - Critical for security
Fix promotion/TTL race condition - Prevents connection corruption
Restore and update deleted tests - Maintains test coverage

Should Consider

Clarify pending peer routability - Document whether this is intentional
Add metrics - Track transient count, promotion rate, expiration rate
Add test - Verify promotion respects admission logic
Add test - Verify TTL expiration vs promotion race handling

Nice to Have

Add integration test - Full transient lifecycle (accept → promote → route)
Document transient behavior - Add module-level docs explaining transient concept

Overall Assessment: The architectural direction is sound (separate transient tracking with budget/TTL), but there are critical gaps in the implementation:

Missing admission re-check contradicts PR description
Race conditions in promotion path
Reduced test coverage

With the recommended fixes, this PR will properly harden transient connection handling as intended.

Copilot

Pull request overview

This PR aims to harden transient (gateway) connection handling to prevent admission bypasses, reservation leaks, and failed promotions. However, several bugs were introduced that undermine these goals.

Key Changes

Merged pending_locations tracking into location_for_peer to simplify connection state management
Added explicit transient connection budget enforcement with TTL-based expiration
Restructured should_accept admission logic to check for duplicate peers after reservation increment
Added new config fields transient_budget and transient_ttl_secs to test configurations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`crates/core/tests/token_expiration.rs`	Added transient config fields to test setup (None values use defaults)
`crates/core/tests/test_network_integration.rs`	Changed from static singleton network to fresh network per test
`crates/core/src/ring/mod.rs`	Removed unused SocketAddr import (now in connection_manager.rs)
`crates/core/src/ring/connection_manager.rs`	Merged pending/active location tracking, restructured admission logic, removed 118 lines of tests
`crates/core/src/node/network_bridge/p2p_protoc.rs`	Added transient promotion path and budget enforcement checks

Comments suppressed due to low confidence (1)

crates/core/src/node/network_bridge/p2p_protoc.rs:1295

Transient promotion bypasses admission checks. When promoting a transient connection to a normal connection, add_connection is called directly without re-applying should_accept. This could allow connections that would normally be rejected by admission logic (e.g., max connections reached, topology manager rejection) to bypass these checks. According to the PR description, admission logic should be "re-applied on promotion".

            if let Some(entry) = connection_manager.drop_transient(&peer) {
                let loc = entry
                    .location
                    .unwrap_or_else(|| Location::from_address(&peer.addr));
                self.bridge
                    .op_manager
                    .ring
                    .add_connection(loc, peer.clone(), false)
                    .await;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-23T02:54:06Z

crates/core/src/ring/connection_manager.rs

+
+        if self.location_for_peer.read().get(peer_id).is_some() {
+            // We've already accepted this peer (pending or active); treat as a no-op acceptance.
+            tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");


Reservation leak when peer is already pending/connected. When should_accept is called for a peer already in location_for_peer, the function increments reserved_connections at the start (line 189-205) but returns early at line 251 without decrementing it. This will leak the reservation counter. The counter should be decremented before returning, similar to lines 241-242 and 286-287.

Suggested change

tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");

tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");

self.reserved_connections

.fetch_sub(1, std::sync::atomic::Ordering::SeqCst);

Is it though? if it already got reserved and we return true doesnt sound right we should decrement it

Fixed the reservation leak: duplicate/pending hits now release the reservation immediately and the promotion path now calls add_connection(.., was_reserved = true) so the counter is decremented when the connection is established. First-connection acceptance also records the pending location so we keep the bookkeeping consistent without leaking slots.

Copilot · 2025-11-23T02:54:07Z

crates/core/src/ring/connection_manager.rs

+    /// This makes the peer discoverable to the routing layer even before the connection
+    /// is fully established. The entry is removed automatically if the handshake fails
+    /// via `prune_in_transit_connection`.


Misleading documentation: pending peers are not routable. The comment claims "This makes the peer discoverable to the routing layer even before the connection is fully established," but this is incorrect. The routing method (line 589) only uses connections_by_location, which is populated by add_connection, not record_pending_location. Pending locations are only in location_for_peer and are not used for routing until the connection is fully established.

Suggested change

/// This makes the peer discoverable to the routing layer even before the connection

/// is fully established. The entry is removed automatically if the handshake fails

/// via `prune_in_transit_connection`.

/// This records the peer's advertised location before the connection is fully established.

/// The entry is removed automatically if the handshake fails via `prune_in_transit_connection`.

Updated the comment to state that pending locations live in location_for_peer purely for handshake de-duplication; routing still keys off connections_by_location, so pending peers are not routable until promotion.

Copilot · 2025-11-23T02:54:07Z

crates/core/src/ring/connection_manager.rs


-        let accepted = if open == 0 {
+        if open == 0 {
            tracing::debug!(%peer_id, "should_accept: first connection -> accepting");


First connection bypasses pending location tracking. When open == 0 (first connection), the function returns early at line 227 after incrementing reserved_connections, but without calling record_pending_location(peer_id, location). This means the first connection's location won't be tracked in location_for_peer until add_connection is called. This is inconsistent with the normal flow (line 290) where pending locations are recorded immediately upon acceptance.

Suggested change

tracing::debug!(%peer_id, "should_accept: first connection -> accepting");

tracing::debug!(%peer_id, "should_accept: first connection -> accepting");

self.record_pending_location(peer_id, location);

Covered this in the latest push: the first-connection fast path now records the pending location before returning, and promotion paths call add_connection with was_reserved = true so the reservation bookkeeping stays balanced.

sanity · 2025-11-23T02:54:41Z

Thanks for the thorough review. I’ve added an admission/cap recheck when promoting an existing transient transport (so promotions can’t bypass should_accept or max-connections). I’m keeping the pending-location visibility and TTL handling as-is in this slice to avoid scope creep; the later stacked PRs continue to tighten transient lifecycle, so we can revisit the sweep vs per-connection timer trade-off there. Tests remain in their current form here because the later PRs add broader coverage around the new semantics.

iduartgomez · 2025-11-23T12:00:44Z

crates/core/src/node/network_bridge/p2p_protoc.rs

        // Only insert if connection doesn't already exist to avoid dropping existing channel
        let mut newly_inserted = false;
        if !self.connections.contains_key(&peer_id) {
+            if is_transient {


all this logic leaking into here cannot be avoided? sounds like an architectural issue.

Agreed it is not ideal for the bridge to know so much about transient promotion; for now the admission/cap logic lives in ConnectionManager and this block just orchestrates the promotion for the existing transport. If we keep expanding this path we should extract a helper on the ring side and have the bridge call that instead of threading the details here.

iduartgomez · 2025-11-23T12:01:21Z

crates/core/src/ring/connection_manager.rs

 #[derive(Clone)]
 pub(crate) struct TransientEntry {
+    /// Entry tracking a transient connection that hasn't been added to the ring topology yet.
+    /// Transient connections are typically unsolicited inbound connections to gateways.


remove if not used, why is not being used? I think this was used for potential timeouts?

Removed the unused opened_at field from TransientEntry—the TTL enforcement in the bridge handles expiry, so we do not need to keep the timestamp here.

iduartgomez · 2025-11-23T12:02:35Z

crates/core/src/ring/connection_manager.rs

+
+        if self.location_for_peer.read().get(peer_id).is_some() {
+            // We've already accepted this peer (pending or active); treat as a no-op acceptance.
+            tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");


Is it though? if it already got reserved and we return true doesnt sound right we should decrement it

iduartgomez · 2025-11-23T12:03:45Z

crates/core/src/ring/connection_manager.rs

+    /// This makes the peer discoverable to the routing layer even before the connection
+    /// is fully established. The entry is removed automatically if the handshake fails
+    /// via `prune_in_transit_connection`.


iduartgomez · 2025-11-23T12:09:26Z

crates/core/src/ring/connection_manager.rs

+    /// via `prune_in_transit_connection`.
    pub fn record_pending_location(&self, peer_id: &PeerId, location: Location) {
-        let mut locations = self.pending_locations.write();
+        let mut locations = self.location_for_peer.write();


wait why are we writting a pending location into the definitive map?

location_for_peer now explicitly holds pending entries only for de-duping in-flight handshakes; routing still keys off connections_by_location, so these pending inserts are just to keep promotion bookkeeping consistent and avoid duplicate acceptances.

location_for_peer containe also definitive connections too though no? Can we document this throughtfully for future reference?

iduartgomez · 2025-11-23T12:12:26Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                            "connect_peer: transient promotion rejected due to capacity"
+                        );
+                    } else {
+                        self.bridge


as per Claude suggestion, beware of races between dropping and adding:

Apply the suggested fix to transient connections: Add atomic flag or use better synchronization OR check if should_accept again

The promotion flow now re-runs should_accept after dropping the transient and then calls add_connection(.., was_reserved = true) so the reservation counters and topology bookkeeping stay in sync. The drop + add happen on the same task (no concurrent promotion), so we are not racing another add; if we start doing concurrent promotions we can wrap this in a small guard, but for the current single-task path this keeps the state consistent.

sanity added 6 commits November 21, 2025 22:20

fix: track transient connections separately

69efe79

fix: tidy transient registry formatting

7f1c1f0

fix: clean transient promotion handling

315d0f9

fix: honor transient budget and promote correctly

5ade176

fix: remove unused transient helpers

f42d1d5

fix: restore connection manager accessors

2209c45

sanity marked this pull request as draft November 22, 2025 05:34

test: avoid static test network

d8bf247

freenet deleted a comment from claude bot Nov 23, 2025

sanity mentioned this pull request Nov 23, 2025

fix: track transient connections separately #2107

Closed

sanity requested review from Copilot and iduartgomez November 23, 2025 02:49

Copilot started reviewing on behalf of sanity November 23, 2025 02:49 View session

sanity marked this pull request as ready for review November 23, 2025 02:50

fix: recheck admission on transient promotion

b7da1dc

Copilot finished reviewing on behalf of sanity November 23, 2025 02:53

Copilot AI reviewed Nov 23, 2025

View reviewed changes

iduartgomez approved these changes Nov 23, 2025

View reviewed changes

iduartgomez self-requested a review November 23, 2025 11:47

iduartgomez requested changes Nov 23, 2025

View reviewed changes

sanity added 2 commits November 23, 2025 09:39

fix(connection): address acceptance reservations

08ff8ff

fix(connection): avoid double reservation on transient promotion

21a0246

iduartgomez approved these changes Nov 23, 2025

View reviewed changes

iduartgomez added this pull request to the merge queue Nov 23, 2025

Merged via the queue into main with commit 592c668 Nov 23, 2025
12 checks passed

iduartgomez deleted the fix/transient-split-1 branch November 23, 2025 20:07

	tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
	tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
	self.record_pending_location(peer_id, location);

Uh oh!

fix: harden transient connection handling #2120

fix: harden transient connection handling #2120

Uh oh!

Conversation

sanity commented Nov 22, 2025

Problem

This change

Testing

Uh oh!

sanity commented Nov 23, 2025

Uh oh!

claude bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix: harden transient connection handling

Todo List

High-Level Design Review

Overview

Detailed Code Review

crates/core/src/ring/connection_manager.rs

✅ Positive Changes:

⚠️ Issues and Concerns:

crates/core/src/node/network_bridge/p2p_protoc.rs

✅ Positive Changes:

⚠️ Issues and Concerns:

crates/core/tests/test_network_integration.rs

crates/core/tests/token_expiration.rs

Summary of Findings

Critical Issues

Design Concerns

Testing Concerns

Recommendations

Must Fix (Before Merge)

Should Consider

Nice to Have

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanity commented Nov 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

claude bot commented Nov 23, 2025 •

edited

Loading