Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 22, 2025

Problem

Transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring, leaving gateways without routable neighbors and starving joiners.

This change

  • Track transient connections separately with explicit budget/TTL enforcement.
  • Re-apply admission logic on promotion and clean up pending records/unused helpers.
  • Restore reserved/pending accessors needed by the admission path.

Testing

  • cargo check -p freenet (local).

@sanity sanity marked this pull request as draft November 22, 2025 05:34
@freenet freenet deleted a comment from claude bot Nov 23, 2025
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

@claude Please carefully review this PR, both high-level and the details.

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @sanity's task —— View job


PR Review: fix: harden transient connection handling

Todo List

  • Read all changed files
  • Review git diff against origin/main
  • Analyze high-level design and architecture
  • Review implementation details for bugs/issues
  • Provide comprehensive feedback

High-Level Design Review

Overview

This PR addresses a critical issue where transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring. The solution introduces explicit transient connection tracking with budget enforcement and TTL, separate from the main ring topology.

Strengths:

  1. Clear separation of concerns: Transient connections are now tracked separately (transient_connections map) from established ring connections
  2. Budget enforcement: The transient_budget prevents unbounded inbound connections
  3. TTL mechanism: Automatic cleanup via spawned task prevents stale transients from accumulating
  4. Simplified architecture: Removal of pending_locations map reduces complexity

Concerns:

  1. ⚠️ Potential race condition in promotion path (see detailed findings below)
  2. ⚠️ Routing exclusion could starve operations if transients aren't promoted promptly
  3. ⚠️ Test removal reduces coverage - the deleted tests validated important invariants

Detailed Code Review

crates/core/src/ring/connection_manager.rs

Positive Changes:

Lines 14-21: TransientEntry documentation

  • Good: Added clear documentation explaining the purpose of transient entries
  • Good: Documents that these are unsolicited inbound connections

Lines 124-141: Simplified ConnectionManager initialization

  • Good: Removed pending_locations field, reducing state management complexity
  • Good: Transient tracking now centralized in one place

Lines 247-252: Early return for first connection

if open == 0 {
    tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
    return true;
}
  • Good: Cleaner early return pattern
  • Good: Removes nested conditionals

Lines 254-276: Gateway direct-accept limit enforcement

  • Good: Enforces GATEWAY_DIRECT_ACCEPT_LIMIT to prevent gateways from accepting too many direct connections
  • Good: Properly cleans up reservation counter on rejection

Lines 278-282: Duplicate peer detection

if self.location_for_peer.read().get(peer_id).is_some() {
    tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
    return true;
}
  • Good: Prevents duplicate reservations for the same peer
  • Good: Returns true to acknowledge existing reservation

Lines 300-320: record_pending_location simplified

  • IMPORTANT FIX: Now records directly to location_for_peer instead of pending_locations
  • This makes pending peers immediately routable, which could be good or bad depending on intention
  • Question: Is it intentional that pending peers are now routable before handshake completes?

Lines 365-430: Transient connection management

pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool
pub fn drop_transient(&self, peer: &PeerId) -> Option<TransientEntry>
pub fn is_transient(&self, peer: &PeerId) -> bool
  • Good: Clean API for transient lifecycle management
  • Good: Budget enforcement with atomic operations
  • Good: Handles race conditions (lines 389-394)

Lines 451-481: add_connection cleanup

  • Good: Removed pending_locations manipulation
  • Simpler flow for promoting connections to active status

Lines 528-561: prune_connection unified

  • Good: Unified handling for both active and in-transit connections
  • Good: Uses single location_for_peer map instead of two separate maps
  • Good: Proper cleanup of both reservation and connection counters

Lines 589-604: routing() excludes transients

if self.is_transient(&conn.location.peer) {
    return None;
}
  • Good: Prevents routing through transient connections
  • ⚠️ Potential issue: If transients aren't promoted quickly, this could starve routing options for gateways

⚠️ Issues and Concerns:

Lines 300-320: record_pending_location behavior change

pub fn record_pending_location(&self, peer_id: &PeerId, location: Location) {
    let mut locations = self.location_for_peer.write();  // ← Changed from pending_locations
  • ⚠️ This is a significant behavioral change
  • Before: Pending connections were isolated in pending_locations, not visible to routing
  • After: Pending connections immediately appear in location_for_peer, making them routable
  • Risk: If a handshake is in progress but not complete, routing logic might try to use this peer
  • Recommendation: The comment says "makes the peer discoverable to the routing layer even before the connection is fully established" - this needs careful consideration. Is this intentional? Could this cause premature routing to peers that aren't ready?

Lines 620-622: has_connection_or_pending simplified

pub fn has_connection_or_pending(&self, peer: &PeerId) -> bool {
    self.location_for_peer.read().contains_key(peer)
}
  • Good: Simpler implementation
  • But: With the removal of pending_locations, there's no distinction between pending and active anymore
  • This is consistent with the architectural change, but callers might need to be aware

Test removal (lines 626-733 deleted)

  • Major concern: Two important tests were deleted:
    1. pending_connections_hidden_from_known_locations - validated isolation
    2. should_accept_does_not_leak_reservations_for_duplicate_peer - validated reservation safety
  • Issue: The second test specifically validated that duplicate should_accept calls don't leak reservations
  • Current code: Lines 278-282 now handle duplicates by returning true early
  • Recommendation: These tests should be updated to reflect new behavior, not deleted. The invariants they tested are still important.

crates/core/src/node/network_bridge/p2p_protoc.rs

Positive Changes:

Lines 1278-1314: Transient promotion in connect_peer

if self.connections.contains_key(&peer) {
    let connection_manager = &self.bridge.op_manager.ring.connection_manager;
    if let Some(entry) = connection_manager.drop_transient(&peer) {
        let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
        self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
        tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient");
    }
  • Good: Promotes transient to ring when connect_peer is called
  • Good: Uses stored location or derives from address
  • Good: Proper cleanup of transient entry

Lines 1611-1628: Budget enforcement on inbound

  • Good: Double-checks budget before accepting transient
  • Good: Cleans up callbacks if budget exhausted
  • Good: Removes pending transactions

Lines 1688-1702: Budget enforcement before insert

  • Good: Additional safety check before inserting into connections map
  • Prevents edge cases where budget could be exceeded

Lines 1729-1754: TTL-based cleanup task

tokio::spawn(async move {
    sleep(ttl).await;
    if cm.drop_transient(&peer).is_some() {
        tracing::info!(%peer, "Transient connection expired; dropping");
        if let Err(err) = drop_tx.send(Right(NodeEvent::DropConnection(peer.clone()))).await {
            tracing::warn!(%peer, ?err, "Failed to dispatch DropConnection for expired transient");
        }
    }
});
  • Good: Automatic cleanup prevents resource leaks
  • Good: Only drops if still transient (handles promotion case)

⚠️ Issues and Concerns:

Lines 1286-1296: Promotion logic race condition

if let Some(entry) = connection_manager.drop_transient(&peer) {
    let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
    self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
    tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient");
}
  • ⚠️ Potential race: What if the transient expires (TTL task runs) between connections.contains_key() and drop_transient()?
  • The TTL cleanup task could send DropConnection while we're trying to promote
  • Recommendation: Consider adding a state field to TransientEntry (e.g., promoting: AtomicBool) to prevent concurrent expiration during promotion

Lines 1292-1296: Missing admission re-check on promotion

  • ⚠️ Critical issue: When promoting a transient, the code calls add_connection(loc, peer.clone(), false) with was_reserved=false
  • This means the connection is added to the ring without re-checking admission logic
  • The PR description says "Re-apply admission logic on promotion" but I don't see where this happens
  • Current flow:
    1. Transient admitted (bypassing normal admission)
    2. connect_peer called → promotes directly to ring
    3. No should_accept() call during promotion
  • Expected flow based on PR description:
    1. Transient admitted (with budget limit)
    2. When promoting, re-run should_accept() to validate admission
    3. If rejected, drop the transient instead
  • Recommendation: Add should_accept() check before add_connection() at line 1292-1294:
if let Some(entry) = connection_manager.drop_transient(&peer) {
    let loc = entry.location.unwrap_or_else(|| Location::from_address(&peer.addr));
    if connection_manager.should_accept(loc, &peer) {
        self.bridge.op_manager.ring.add_connection(loc, peer.clone(), false).await;
        tracing::info!(tx = %tx, remote = %peer, "connect_peer: promoted transient after admission check");
    } else {
        tracing::warn!(tx = %tx, remote = %peer, "connect_peer: transient failed admission on promotion; dropping");
        // Connection will be dropped naturally when callback returns Ok
    }
}

crates/core/tests/test_network_integration.rs

Lines 10-21: Simpler test network setup

  • Good: Uses freenet-test-network crate for cleaner tests
  • Good: Removes static network, avoiding Sync issues

crates/core/tests/token_expiration.rs

Lines 135-137: Transient config added

  • Good: Tests now properly configure transient settings
  • Minor: The None values preserve existing defaults

Summary of Findings

Critical Issues

  1. ❌ Missing admission re-check on promotion (p2p_protoc.rs:1292-1296)

    • PR description claims "Re-apply admission logic on promotion" but this doesn't happen
    • Transients can bypass admission checks when promoted
    • Fix: Call should_accept() before add_connection() during promotion
  2. ⚠️ Potential race in promotion vs TTL expiration (p2p_protoc.rs:1286-1296)

    • TTL cleanup task could fire while promotion is in progress
    • Could lead to connection being dropped mid-promotion
    • Fix: Add atomic flag or use better synchronization

Design Concerns

  1. ⚠️ Pending connections now routable (connection_manager.rs:300-320)

    • Behavior change: record_pending_location now writes to location_for_peer instead of pending_locations
    • Makes in-handshake peers visible to routing layer
    • Question: Is this intentional? Could cause routing to incomplete connections
  2. ⚠️ Routing exclusion could starve gateways (connection_manager.rs:592-594)

    • Transients are excluded from routing
    • If promotion is slow or fails, gateways may have no routable peers
    • Recommendation: Monitor promotion latency

Testing Concerns

  1. ❌ Important tests deleted (connection_manager.rs:626-733)
    • pending_connections_hidden_from_known_locations - validated isolation
    • should_accept_does_not_leak_reservations_for_duplicate_peer - validated safety
    • Fix: Update and restore these tests to validate new behavior

Recommendations

Must Fix (Before Merge)

  1. Add admission re-check during promotion - Critical for security
  2. Fix promotion/TTL race condition - Prevents connection corruption
  3. Restore and update deleted tests - Maintains test coverage

Should Consider

  1. Clarify pending peer routability - Document whether this is intentional
  2. Add metrics - Track transient count, promotion rate, expiration rate
  3. Add test - Verify promotion respects admission logic
  4. Add test - Verify TTL expiration vs promotion race handling

Nice to Have

  1. Add integration test - Full transient lifecycle (accept → promote → route)
  2. Document transient behavior - Add module-level docs explaining transient concept

Overall Assessment: The architectural direction is sound (separate transient tracking with budget/TTL), but there are critical gaps in the implementation:

  • Missing admission re-check contradicts PR description
  • Race conditions in promotion path
  • Reduced test coverage

With the recommended fixes, this PR will properly harden transient connection handling as intended.


@sanity sanity marked this pull request as ready for review November 23, 2025 02:50
Copilot finished reviewing on behalf of sanity November 23, 2025 02:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to harden transient (gateway) connection handling to prevent admission bypasses, reservation leaks, and failed promotions. However, several bugs were introduced that undermine these goals.

Key Changes

  • Merged pending_locations tracking into location_for_peer to simplify connection state management
  • Added explicit transient connection budget enforcement with TTL-based expiration
  • Restructured should_accept admission logic to check for duplicate peers after reservation increment
  • Added new config fields transient_budget and transient_ttl_secs to test configurations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/core/tests/token_expiration.rs Added transient config fields to test setup (None values use defaults)
crates/core/tests/test_network_integration.rs Changed from static singleton network to fresh network per test
crates/core/src/ring/mod.rs Removed unused SocketAddr import (now in connection_manager.rs)
crates/core/src/ring/connection_manager.rs Merged pending/active location tracking, restructured admission logic, removed 118 lines of tests
crates/core/src/node/network_bridge/p2p_protoc.rs Added transient promotion path and budget enforcement checks
Comments suppressed due to low confidence (1)

crates/core/src/node/network_bridge/p2p_protoc.rs:1295

  • Transient promotion bypasses admission checks. When promoting a transient connection to a normal connection, add_connection is called directly without re-applying should_accept. This could allow connections that would normally be rejected by admission logic (e.g., max connections reached, topology manager rejection) to bypass these checks. According to the PR description, admission logic should be "re-applied on promotion".
            if let Some(entry) = connection_manager.drop_transient(&peer) {
                let loc = entry
                    .location
                    .unwrap_or_else(|| Location::from_address(&peer.addr));
                self.bridge
                    .op_manager
                    .ring
                    .add_connection(loc, peer.clone(), false)
                    .await;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reservation leak when peer is already pending/connected. When should_accept is called for a peer already in location_for_peer, the function increments reserved_connections at the start (line 189-205) but returns early at line 251 without decrementing it. This will leak the reservation counter. The counter should be decremented before returning, similar to lines 241-242 and 286-287.

Suggested change
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
self.reserved_connections
.fetch_sub(1, std::sync::atomic::Ordering::SeqCst);

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it though? if it already got reserved and we return true doesnt sound right we should decrement it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the reservation leak: duplicate/pending hits now release the reservation immediately and the promotion path now calls add_connection(.., was_reserved = true) so the counter is decremented when the connection is established. First-connection acceptance also records the pending location so we keep the bookkeeping consistent without leaking slots.

Comment on lines 297 to 299
/// This makes the peer discoverable to the routing layer even before the connection
/// is fully established. The entry is removed automatically if the handshake fails
/// via `prune_in_transit_connection`.
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading documentation: pending peers are not routable. The comment claims "This makes the peer discoverable to the routing layer even before the connection is fully established," but this is incorrect. The routing method (line 589) only uses connections_by_location, which is populated by add_connection, not record_pending_location. Pending locations are only in location_for_peer and are not used for routing until the connection is fully established.

Suggested change
/// This makes the peer discoverable to the routing layer even before the connection
/// is fully established. The entry is removed automatically if the handshake fails
/// via `prune_in_transit_connection`.
/// This records the peer's advertised location before the connection is fully established.
/// The entry is removed automatically if the handshake fails via `prune_in_transit_connection`.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment to state that pending locations live in location_for_peer purely for handshake de-duplication; routing still keys off connections_by_location, so pending peers are not routable until promotion.


let accepted = if open == 0 {
if open == 0 {
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First connection bypasses pending location tracking. When open == 0 (first connection), the function returns early at line 227 after incrementing reserved_connections, but without calling record_pending_location(peer_id, location). This means the first connection's location won't be tracked in location_for_peer until add_connection is called. This is inconsistent with the normal flow (line 290) where pending locations are recorded immediately upon acceptance.

Suggested change
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
self.record_pending_location(peer_id, location);

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered this in the latest push: the first-connection fast path now records the pending location before returning, and promotion paths call add_connection with was_reserved = true so the reservation bookkeeping stays balanced.

@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

Thanks for the thorough review. I’ve added an admission/cap recheck when promoting an existing transient transport (so promotions can’t bypass should_accept or max-connections). I’m keeping the pending-location visibility and TTL handling as-is in this slice to avoid scope creep; the later stacked PRs continue to tighten transient lifecycle, so we can revisit the sweep vs per-connection timer trade-off there. Tests remain in their current form here because the later PRs add broader coverage around the new semantics.

@iduartgomez iduartgomez self-requested a review November 23, 2025 11:47
// Only insert if connection doesn't already exist to avoid dropping existing channel
let mut newly_inserted = false;
if !self.connections.contains_key(&peer_id) {
if is_transient {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this logic leaking into here cannot be avoided? sounds like an architectural issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it is not ideal for the bridge to know so much about transient promotion; for now the admission/cap logic lives in ConnectionManager and this block just orchestrates the promotion for the existing transport. If we keep expanding this path we should extract a helper on the ring side and have the bridge call that instead of threading the details here.

#[derive(Clone)]
pub(crate) struct TransientEntry {
/// Entry tracking a transient connection that hasn't been added to the ring topology yet.
/// Transient connections are typically unsolicited inbound connections to gateways.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove if not used, why is not being used? I think this was used for potential timeouts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused opened_at field from TransientEntry—the TTL enforcement in the bridge handles expiry, so we do not need to keep the timestamp here.


if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it though? if it already got reserved and we return true doesnt sound right we should decrement it

Comment on lines 297 to 299
/// This makes the peer discoverable to the routing layer even before the connection
/// is fully established. The entry is removed automatically if the handshake fails
/// via `prune_in_transit_connection`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

/// via `prune_in_transit_connection`.
pub fn record_pending_location(&self, peer_id: &PeerId, location: Location) {
let mut locations = self.pending_locations.write();
let mut locations = self.location_for_peer.write();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait why are we writting a pending location into the definitive map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location_for_peer now explicitly holds pending entries only for de-duping in-flight handshakes; routing still keys off connections_by_location, so these pending inserts are just to keep promotion bookkeeping consistent and avoid duplicate acceptances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location_for_peer containe also definitive connections too though no? Can we document this throughtfully for future reference?

"connect_peer: transient promotion rejected due to capacity"
);
} else {
self.bridge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per Claude suggestion, beware of races between dropping and adding:

Apply the suggested fix to transient connections: Add atomic flag or use better synchronization OR check if should_accept again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promotion flow now re-runs should_accept after dropping the transient and then calls add_connection(.., was_reserved = true) so the reservation counters and topology bookkeeping stay in sync. The drop + add happen on the same task (no concurrent promotion), so we are not racing another add; if we start doing concurrent promotions we can wrap this in a small guard, but for the current single-task path this keeps the state consistent.

@iduartgomez iduartgomez added this pull request to the merge queue Nov 23, 2025
Merged via the queue into main with commit 592c668 Nov 23, 2025
12 checks passed
@iduartgomez iduartgomez deleted the fix/transient-split-1 branch November 23, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants