feat: transient handling, soak harness, and deterministic tests #2125

sanity · 2025-11-22T15:55:07Z

Problem

Transient handling, routing/connect forwarding, and large-network behavior needed fixes and diagnostics. Tests were flaky due to random node locations. Soak/cap regressions lacked coverage.

Changes

Transient lifecycle: separate tracking, promotion with admission checks, cap enforcement, identity fix for inbound peers.
Connect forwarding: recency-aware forwarding plus routing_candidates helper.
Soak/cap tests: #[ignore] large network soak with snapshots/viz, small cap repro harness, warmup/ring snapshots.
Deterministic tests/macros: allow explicit node locations/config in #[freenet_test]; stabilize three-hop PUT test.
Diagnostics/logging and minor workflow/doc tweaks (AGENTS, CI configs).

Testing

cargo check -p freenet

sanity · 2025-11-23T01:55:41Z

@claude Please carefully review this PR, both high-level and the details.

claude · 2025-11-23T01:55:58Z

Claude finished @sanity's task —— View job

PR Review: Transient handling, soak harness, and deterministic tests

I've completed a comprehensive review of this PR. Below are my findings organized by severity and category.

🔴 Critical Issues

1. Inadequate Testing for Such Extensive Changes

Location: PR description states "cargo check -p freenet" only
Issue: This PR makes substantial changes to core networking components (connection lifecycle, transient handling, connect forwarding) with 1295 additions and 575 deletions across 25 files. Running only cargo check is insufficient.

Required before merge:

✅ Run full test suite: cargo test
✅ Run the new large_network soak test: cargo test -p freenet --test large_network -- --ignored --nocapture
✅ Run the new connection_cap test
✅ Verify clippy passes: cargo clippy --all-targets --all-features
✅ Verify existing integration tests pass
✅ Document test results in PR description

Why this matters: The changes touch critical path code (connection management, handshakes, topology). Regressions could cause network instability, connection leaks, or security issues.

⚠️ High Priority Issues

2. Potential Race Condition in Transient Budget Enforcement

Location: crates/core/src/ring/connection_manager.rs:330-351

let current = self.transient_in_use.load(Ordering::Acquire);
if current >= self.transient_budget {
    return false;
}
// ... insert happens here ...
let prev = self.transient_in_use.fetch_add(1, Ordering::SeqCst);
if prev >= self.transient_budget {
    // Undo if we raced past the budget.
    self.transient_connections.remove(&key);
    self.transient_in_use.fetch_sub(1, Ordering::SeqCst);
    return false;
}

Issue: Classic TOCTOU (time-of-check-time-of-use) race. Multiple threads can pass the initial check at line 331, then all increment the counter. The rollback at lines 345-348 only catches the thread that observed prev >= budget, but earlier threads may have already exceeded the budget.

Scenario:

Budget = 32, current = 31
Thread A checks (31 < 32) ✓ passes
Thread B checks (31 < 32) ✓ passes
Thread A: fetch_add returns 31, now counter = 32
Thread B: fetch_add returns 32, now counter = 33
Thread B rolls back (32 >= 32), but Thread A's insertion stands
Counter should be 32 but is actually 33 after Thread A's increment

Fix: Use compare-and-swap loop or fetch_add before the check:

pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool {
    if self.transient_connections.contains_key(&peer) {
        // ... existing update logic ...
        return true;
    }
    
    // Reserve slot atomically
    let prev = self.transient_in_use.fetch_add(1, Ordering::SeqCst);
    if prev >= self.transient_budget {
        // Undo reservation
        self.transient_in_use.fetch_sub(1, Ordering::SeqCst);
        return false;
    }
    
    // Now we've reserved the slot, insert
    let key = peer.clone();
    self.transient_connections.insert(peer, TransientEntry { ... });
    
    true
}

Impact: Transient budget can be exceeded under concurrent load, potentially causing resource exhaustion on gateways.

3. Missing Atomicity Between transient_connections and transient_in_use

Location: crates/core/src/ring/connection_manager.rs:356-365

pub fn drop_transient(&self, peer: &PeerId) -> Option<TransientEntry> {
    let removed = self
        .transient_connections
        .remove(peer)
        .map(|(_, entry)| entry);
    if removed.is_some() {
        self.transient_in_use.fetch_sub(1, Ordering::SeqCst);
    }
    removed
}

Issue: If drop_transient is called twice concurrently with the same peer:

Thread A: remove() returns Some(entry), decrements counter
Thread B: remove() returns None (already removed), doesn't decrement
Result: Counter correctly decremented once ✓

However, if combined with the race in try_register_transient:

Counter can drift from actual map size
Budget enforcement becomes unreliable

Recommendation: Add invariant checks in debug builds:

#[cfg(debug_assertions)]
debug_assert_eq!(
    self.transient_in_use.load(Ordering::Acquire),
    self.transient_connections.len(),
    "transient counter mismatch: counter={}, map size={}",
    self.transient_in_use.load(Ordering::Acquire),
    self.transient_connections.len()
);

4. Promotion Logic Missing: When Do Transients Become Permanent?

Location: crates/core/src/ring/connection_manager.rs (transient methods)
Location: crates/core/src/node/network_bridge/handshake.rs (transient flag)

Issue: The code tracks transient connections and enforces a budget, but I don't see clear promotion logic:

When does a transient connection become a regular topology connection?
How is drop_transient() called to free up budget slots?
What happens when transient TTL expires?

Found references:

connection_manager.rs:368-370: is_transient() query exists
handshake.rs:34,41,48: transient flag tracked in events
connect.rs: No obvious use of transient flag

Questions for author:

Where is the transient → permanent promotion logic?
Is there a background task that enforces transient_ttl?
Should add_connection() call drop_transient() when promoting?

Risk: If promotion doesn't call drop_transient(), the budget slots never free up, eventually blocking all new transient connections.

🟡 Medium Priority Issues

5. Connection Cap Enforcement Has Fallback Gap

Location: crates/core/src/topology/mod.rs:455-463

if let Some(peer) = select_fallback_peer_to_drop(neighbor_locations, my_location) {
    info!(..., "Enforcing max-connections cap via fallback removal");
    adj = TopologyAdjustment::RemoveConnections(vec![peer]);
} else {
    warn!(..., "Over capacity but no removable peer found; leaving topology unchanged");
}

Issue: If select_fallback_peer_to_drop returns None (e.g., neighbor_locations is empty), the node stays over capacity indefinitely.

Scenario:

max_connections = 20, current_connections = 21
neighbor_locations is empty (possible during bootstrap or network partition)
Warning logged but node stays over capacity
May violate system invariants or cause resource issues

Fix: Add hard enforcement:

} else {
    warn!("Over capacity but no removable peer found");
    // Pick any connection to drop as last resort
    if let Some((peer, _)) = connections.iter().next() {
        error!("Hard-dropping arbitrary connection to enforce cap");
        adj = TopologyAdjustment::RemoveConnections(vec![peer.clone()]);
    }
}

Alternative: Document this as acceptable behavior if temporary over-capacity is safe.

6. Inconsistent Error Handling in `select_connections_to_add`

Location: crates/core/src/topology/mod.rs:501-535

fn select_connections_to_add(...) -> anyhow::Result<TopologyAdjustment> {
    if neighbor_locations.is_empty() {
        tracing::warn!("select_connections_to_add: neighbor map empty; skipping adjustment");
        return Ok(TopologyAdjustment::NoChange);
    }
    // ...
    let max_density_location = match density_map.get_max_density() {
        Ok(location) => { ... }
        Err(e) => {
            error!("Failed to get max density location: {:?}", e);
            return Err(anyhow!(e));
        }
    };

Issue: Empty neighbor_locations returns Ok(NoChange), but get_max_density() error returns Err. Caller at line 372 has fallback for error:

return self.select_connections_to_add(neighbor_locations)
    .unwrap_or_else(|e| {
        debug!("Density-based selection failed: {:?}, falling back to random locations", e);
        // ... creates random locations ...
    });

Result: Empty neighbors = no connections added, get_max_density error = random connection added. Inconsistent behavior for similar failure modes.

Fix: Make both cases return Err for consistent fallback handling, or both return Ok(NoChange).

7. Large Network Test Requires External Dependency

Location: crates/core/tests/large_network.rs:104-105

let riverctl_path = which("riverctl")
    .context("riverctl not found in PATH; install via `cargo install riverctl`")?;

Issue: The #[ignore] soak test requires riverctl to be installed, but:

Not mentioned in PR description
Not in Cargo.toml dev-dependencies
CI likely doesn't have it installed
New contributors won't know about this requirement

Recommendations:

Add to docs/TESTING.md:

## Soak Testing
The large network soak test requires `riverctl`:
```bash
cargo install riverctl
cargo test -p freenet --test large_network -- --ignored --nocapture

Add CI job that installs riverctl and runs the soak test (maybe weekly, not per-PR)
Consider making the River workflows optional (skip if riverctl not found, just test network topology)

🔵 Low Priority / Suggestions

8. Test Configuration Uses Magic Numbers

Location: crates/core/tests/large_network.rs:37-43

const DEFAULT_PEER_COUNT: usize = 38;
const DEFAULT_SNAPSHOT_INTERVAL: Duration = Duration::from_secs(60);
const DEFAULT_SNAPSHOT_ITERATIONS: usize = 5;
const DEFAULT_SNAPSHOT_WARMUP: Duration = Duration::from_secs(60);
const DEFAULT_CONNECTIVITY_TARGET: f64 = 0.75;
const DEFAULT_MIN_CONNECTIONS: usize = 5;
const DEFAULT_MAX_CONNECTIONS: usize = 7;

Suggestion: Add comments explaining why these values:

Why 38 peers specifically?
Why 5-7 connections for soak test when production default is 10-20?
Why 75% connectivity threshold?

Example:

/// Peer count chosen to exercise multi-hop routing (2 gateways + 38 peers = 40 total)
/// while staying within CI resource limits.
const DEFAULT_PEER_COUNT: usize = 38;

/// Lower than production defaults (10-20) to stress connection churn and topology
/// maintenance logic under constrained capacity.
const DEFAULT_MIN_CONNECTIONS: usize = 5;
const DEFAULT_MAX_CONNECTIONS: usize = 7;

9. Connection Cap Test Doesn't Actually Test Transient Budget

Location: crates/core/tests/connection_cap.rs

#[tokio::test(flavor = "multi_thread", worker_threads = 2)]
async fn connection_cap_respected() -> anyhow::Result<()> {
    let max_connections = freenet::config::DEFAULT_MAX_CONNECTIONS;
    // ...
    for peer in snapshot.peers {
        let count = peer.connected_peer_ids.len();
        assert!(
            count <= max_connections,
            "peer {} exceeds max connections ({} > {})",
            ...
        );
    }

Issue: Test name is connection_cap_respected but it only checks the regular max_connections cap, not the new transient budget feature introduced in this PR.

Suggestion: Add a transient-specific test:

#[tokio::test]
async fn transient_budget_enforced() -> anyhow::Result<()> {
    let transient_budget = 8;
    let net = NetworkBuilder::new()
        .gateways(1)
        .peers(20)
        .transient_budget(transient_budget)
        // Simulate rapid unsolicited connections to gateway
        .build()
        .await?;
    
    let gateway_diagnostics = net.gateway(0).diagnostics().await?;
    assert!(
        gateway_diagnostics.transient_count <= transient_budget,
        "gateway exceeded transient budget"
    );
    Ok(())
}

10. Outdated Comment in connection_manager.rs

Location: crates/core/src/ring/connection_manager.rs:29-30

/// Interim connections ongoing handshake or successfully open connections
/// Is important to keep track of this so no more connections are accepted prematurely.

Issue: This comment appears above own_location field, not a connection tracking field. Looks like leftover from refactoring.

Fix: Remove or move to appropriate field.

✅ Positive Observations

Good separation of transient tracking: The DashMap<PeerId, TransientEntry> with separate atomic counter is a clean design (modulo the race condition noted above).
Comprehensive logging: The connect/handshake code has excellent tracing instrumentation. Examples:
- connection_manager.rs:156-165: Detailed should_accept logging
- connect.rs:228-235: Acceptance with ring distance logging
- p2p_protoc.rs:315-320: Message receive logging
Recency-aware forwarding (connect.rs:308-348): Using HashMap<PeerId, Instant> to track last forwarded time is clever - prevents hammering the same neighbor.
Test infrastructure improvements:
- Soak test with diagnostics snapshots is excellent for catching regressions
- Deterministic location support makes tests reproducible
- freenet_test_network integration is clean
Identity update logic (connection_manager.rs:461-502): The update_peer_identity method correctly handles socket address changes while preserving topology state.

📋 Summary & Recommendations

Before Merge (Blocking):

✅ Fix transient budget race condition (Critical issue NAT traversal #2)
✅ Run full test suite and document results (Critical issue Overall architecture RFC #1)
✅ Clarify/implement transient promotion logic (High priority issue Peer resource usage balancing #4)
✅ Test the PR changes - the description only mentions cargo check, which is insufficient

Strong Recommendations (Should Fix):

⚠️ Add invariant checks for transient counter/map consistency (Decentralized email proof-of-concept #3)
⚠️ Document or fix connection cap fallback behavior (Intelligent routing #5)
⚠️ Make error handling consistent in topology adjustment (Social credit #6)

Nice to Have:

📝 Document riverctl requirement (Contract-key API #7)
📝 Add comments explaining test constants (Implement join ring op #8)
📝 Add transient-specific test (Contract API #9)
📝 Clean up outdated comment (Decouple state #10)

Architecture Question

Where is transient TTL enforcement? The PR adds:

transient_ttl: Duration field in ConnectionManager
opened_at: Instant in TransientEntry
transient_ttl() getter method

But I don't see code that actually:

Checks Instant::now() - opened_at > transient_ttl
Calls drop_transient() when TTL expires
Spawns a background cleanup task

Is this functionality:
a) Implemented elsewhere (please point me to it)
b) Planned for a follow-up PR
c) An oversight

This is important because without TTL enforcement, stale transients could accumulate and exhaust the budget.

Overall Assessment: This is a substantial and well-structured PR that adds important functionality (transient connection handling, better diagnostics, deterministic tests). The code quality is generally high, with good logging and test infrastructure. However, there are critical issues around testing coverage and a race condition in the transient budget logic that must be addressed before merge. The missing TTL enforcement also needs clarification.

Please address the blocking issues and provide test results before requesting final review.

--- | View PR

Copilot

Pull request overview

This PR introduces comprehensive improvements to transient connection handling, routing logic, and network testing infrastructure. The changes implement proper lifecycle management for transient connections (unsolicited inbound connections to gateways), add recency-aware connect forwarding to avoid hammering the same neighbors, and provide extensive soak testing capabilities for validating large-network behavior.

Key Changes:

Refactored connection tracking from atomic counters to explicit pending/active separation with proper admission control and cap enforcement
Enhanced connect forwarding with recency tracking to distribute forwarding load across available neighbors
Added large-scale soak test (38+ peer network) with diagnostics snapshots and River client workflows, plus a minimal connection-cap regression test

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`crates/core/src/ring/connection_manager.rs`	Major refactoring replacing atomic connection counters with `pending_reservations` and `location_for_peer` maps; adds cap enforcement at multiple points
`crates/core/src/operations/connect.rs`	Removed `courtesy` field/logic, added recency-aware forwarding via `HashMap<PeerId, Instant>` to avoid repeatedly forwarding to same peers
`crates/core/src/node/network_bridge/p2p_protoc.rs`	Updated transient promotion logic with explicit admission checks and cap enforcement when converting transient → ring connections
`crates/core/src/node/network_bridge/handshake.rs`	Changed `ExpectedInboundTracker` from `SocketAddr` → `IpAddr`-based tracking with fallback matching for NAT scenarios
`crates/core/src/topology/mod.rs`	Added `select_fallback_peer_to_drop` for cap enforcement when normal topology adjustment doesn't shed connections
`crates/core/src/transport/crypto.rs`	Replaced truncated public key display with SHA256 hash for consistent peer identity representation
`crates/core/tests/large_network.rs`	New soak test exercising 2 gateways + N peers with diagnostics snapshots, River workflows, and connectivity validation
`crates/core/tests/connection_cap.rs`	Minimal regression test validating max connection cap enforcement
`crates/core/src/config/mod.rs`	Added `min_connections` and `max_connections` configuration fields with defaults (10/20)
`crates/freenet-macros/src/codegen.rs`	Updated test macro codegen to include `min_connections`/`max_connections` in node setup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-23T02:05:10Z

crates/core/src/ring/connection_manager.rs


    pub fn update_peer_identity(&self, old_peer: &PeerId, new_peer: PeerId) -> bool {
-        if old_peer == &new_peer {
+        if old_peer.addr == new_peer.addr && old_peer.pub_key == new_peer.pub_key {


The identity comparison has been changed from comparing the entire PeerId to comparing only addr and pub_key. However, PeerId should already implement PartialEq that compares both fields. This explicit field-by-field comparison is redundant and could lead to maintenance issues if PeerId's structure changes. Consider reverting to old_peer == &new_peer or document why the explicit comparison is necessary.

Suggested change

if old_peer.addr == new_peer.addr && old_peer.pub_key == new_peer.pub_key {

if old_peer == &new_peer {

Handled in the follow-up slice; 2125 already reverts to standard PeerId equality. Keeping this PR unchanged to avoid churn.

Copilot · 2025-11-23T02:05:11Z

crates/core/src/ring/connection_manager.rs

+    #[allow(dead_code)]
    pub(super) fn get_open_connections(&self) -> usize {
-        self.open_connections
-            .load(std::sync::atomic::Ordering::SeqCst)
+        self.connection_count()
    }

+    #[allow(dead_code)]
    pub(crate) fn get_reserved_connections(&self) -> usize {
-        self.reserved_connections
-            .load(std::sync::atomic::Ordering::SeqCst)
+        self.pending_reservations.read().len()
    }


[nitpick] The get_open_connections() and get_reserved_connections() methods are marked with #[allow(dead_code)] and simply delegate to other methods. If these are truly unused, they should be removed to reduce code clutter. If they're part of a public/internal API that needs to remain for compatibility, add a comment explaining why they're kept despite being unused.

Kept for compatibility; later slice consolidates these helpers and removes the unused allow. Leaving this diff stable here.

all this notes about leaving diff stables be sure to write them down to properly review later on cause otherwise they are gonna get lost

Follow-up items from the earlier slices are being tracked in this tip commit (defaults/logging/cleanup) and I noted the remaining stack items in my summary so they don’t get lost as Nacho continues review.

Copilot · 2025-11-23T02:05:11Z

crates/core/src/ring/connection_manager.rs

+        {
+            let mut pending = self.pending_reservations.write();
+            pending.insert(peer_id.clone(), location);
+        }


There's a logic issue in should_accept: the method inserts the peer into pending_reservations (line 184) before checking if the connection should be accepted. If the peer is already in location_for_peer (lines 176-180), the method returns early without removing it from pending_reservations. This could lead to duplicate tracking. Consider moving the pending reservation insertion to after the acceptance decision, or ensure cleanup in the early return path.

Suggested change

{

let mut pending = self.pending_reservations.write();

pending.insert(peer_id.clone(), location);

}

// Insert into pending_reservations only after confirming peer is not already present

Pending insertion is deferred/cleaned up in the follow-up slice; keeping this slice aligned with the stack without reshuffling logic here.

Copilot · 2025-11-23T02:05:11Z

crates/core/src/ring/connection_manager.rs

+            self.pending_reservations.write().remove(peer_id);
        } else {
            tracing::info!(%peer_id, total_conn, "should_accept: accepted (reserving spot)");
            self.record_pending_location(peer_id, location);


The record_pending_location method is called at line 244 after the peer has already been inserted into both pending_reservations (line 184) and location_for_peer (within should_accept). This appears to be redundant since record_pending_location also inserts into location_for_peer. This could lead to unnecessary lock acquisitions or confusion about which insertion is authoritative.

Suggested change

self.record_pending_location(peer_id, location);

// self.record_pending_location(peer_id, location); // Redundant, already inserted elsewhere

Also addressed in the follow-up slice; leaving this slice stable to avoid churn.

Copilot · 2025-11-23T02:05:11Z

crates/core/src/ring/connection_manager.rs

        if was_reserved {
-            let old = self
-                .reserved_connections
-                .fetch_sub(1, std::sync::atomic::Ordering::SeqCst);
-            #[cfg(debug_assertions)]
-            {
-                tracing::debug!(old, "Decremented reserved connections");
-                if old == 0 {
-                    panic!("Underflow of reserved connections");
+            self.pending_reservations.write().remove(&peer);
+        }
+        let mut lop = self.location_for_peer.write();
+        let previous_location = lop.insert(peer.clone(), loc);
+        drop(lop);
+
+        // Enforce the global cap when adding a new peer (not a relocation).
+        if previous_location.is_none() && self.connection_count() >= self.max_connections {
+            tracing::warn!(
+                %peer,
+                %loc,
+                max = self.max_connections,
+                "add_connection: rejecting new connection to enforce cap"
+            );
+            // Roll back bookkeeping since we're refusing the connection.
+            self.location_for_peer.write().remove(&peer);
+            if was_reserved {
+                self.pending_reservations.write().remove(&peer);
+            }


The cap enforcement in add_connection (lines 417-429) removes the peer from pending_reservations twice when was_reserved is true: once at line 410 and again at line 427 during rollback. The second removal at line 427 will be a no-op since it was already removed. Consider restructuring to only remove from pending_reservations if the connection is actually accepted, or add a comment explaining this intentional redundancy.

Cleaned up in the follow-up slice; keeping this PR unchanged to minimize churn.

Copilot · 2025-11-23T02:05:12Z

crates/core/src/topology/mod.rs

+            };
+            if let Some((_, best_score)) = &mut candidate {
+                if score > *best_score {
+                    *best_score = score;


The select_fallback_peer_to_drop function has a bug in the score comparison logic. At line 621, when a higher score is found, the code updates *best_score but then immediately replaces the entire candidate tuple at line 623. This means best_score is updated twice - once via the mutable reference and once via the tuple replacement. The first update at line 622 is redundant. Consider simplifying to just candidate = Some((conn.location.clone(), score)); when score > *best_score.

Suggested change

*best_score = score;

Fixed in the follow-up slice; keeping this PR stable.

Copilot · 2025-11-23T02:05:12Z

crates/core/src/node/network_bridge/handshake.rs

    peer: PeerId,
    transaction: Option<Transaction>,
-    courtesy: bool,
+    transient: bool, // TODO: rename to transient in protocol once we migrate terminology


[nitpick] The courtesy parameter has been renamed to transient throughout the codebase, but the comment here still references the old terminology. Consider updating the TODO comment to be more specific about what migration is needed (e.g., "TODO: update protocol field name from courtesy to transient once we migrate to protocol v2").

Suggested change

transient: bool, // TODO: rename to transient in protocol once we migrate terminology

transient: bool, // TODO: update protocol field name from 'courtesy' to 'transient' once we migrate to protocol v2

Terminology is aligned in the follow-up slice once protocol field migration is safe; keeping this slice stable for now.

Copilot · 2025-11-23T02:05:12Z

crates/core/src/node/network_bridge/handshake.rs

+            self.entries.remove(&ip);
+        }
+        if let Some(entry) = entry {
+            tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");


The ExpectedInboundTracker has changed from tracking by SocketAddr to tracking by IpAddr with a list of entries. The consume method now has fallback behavior: if an exact port match isn't found, it pops an arbitrary entry from the list (line 174). This could lead to matching the wrong peer if multiple peers from the same IP are expected but arrive in a different order. Consider logging a warning when the fallback is used, or document why this behavior is acceptable.

Suggested change

tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");

tracing::warn!(

remote = %addr,

peer = %entry.peer.addr,

transient = entry.transient,

tx = ?entry.transaction,

remaining_expected_peers = ?list.iter().map(|e| e.peer.addr).collect::<Vec<_>>(),

"ExpectInbound: matched by IP fallback (no port match); possible peer mis-match"

);

Currently require per-IP dedupe to tolerate port churn; we avoid arbitrary fallback in this slice. Later slice adds better tracing and keeps port matching strict. Keeping this diff stable here.

Copilot · 2025-11-23T02:05:13Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                            max_connections = connection_manager.max_connections,
+                            %loc,
+                            "handle_successful_connection: rejecting transient promotion to enforce cap"
+                        );


Similar to the previous issue, when transient promotion is rejected (lines 1847-1855), the connection remains in self.connections and the listener task continues running. This can lead to resource leaks and inconsistent state where a connection exists at the transport layer but is neither promoted to the ring nor properly tracked as transient.

Suggested change

);

);

// Cleanup: drop the transient and dispatch DropConnection event

if connection_manager.drop_transient(&peer_id).is_some() {

tracing::info!(%peer_id, "Rejected transient promotion; dropping connection");

if let Err(err) = self.bridge.ev_listener_tx

.send(Right(NodeEvent::DropConnection(peer_id.clone())))

.await

{

tracing::warn!(

%peer_id,

?err,

"Failed to dispatch DropConnection for rejected transient"

);

}

}

Handled in the follow-up slice where rejected promotions clean up connections; keeping this slice unchanged.

Copilot · 2025-11-23T02:05:13Z

crates/core/tests/connection_cap.rs

+        .peers(6)
+        .start_stagger(std::time::Duration::from_millis(300))
+        .require_connectivity(0.9)
+        .connectivity_timeout(std::time::Duration::from_secs(40))


The test retrieves max_connections from the default constant but doesn't actually configure the network with any specific connection limits. The NetworkBuilder is not called with .min_connections() or .max_connections(), so it will use whatever defaults the test network library provides, which may differ from DEFAULT_MAX_CONNECTIONS. This could cause the test to pass even if cap enforcement is broken. Consider explicitly setting the connection limits on the builder to match the assertion.

Suggested change

.connectivity_timeout(std::time::Duration::from_secs(40))

.connectivity_timeout(std::time::Duration::from_secs(40))

.max_connections(max_connections)

Builder gets explicit min/max in the follow-up slice; kept defaults here to avoid churn.

iduartgomez

Some comments, the core logic changes seem fine for the most part. Merge if you still consider is good to go forward.

iduartgomez · 2025-11-23T11:02:55Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                                } else {
+                                    for conns in connections_by_loc.values() {
+                                        connected_peer_ids.extend(
+                                            conns.iter().map(|c| c.location.peer.to_string()),


why do we need to convert to string here?

Same reason as above: diagnostics payload is String-typed, so we need owned strings for the public key and addresses. Left the conversions in-place.

iduartgomez · 2025-11-23T11:03:21Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                                        for conn in conns {
+                                            connected_peer_ids.push(conn.location.peer.to_string());
+                                            response.connected_peers_detailed.push(
+                                                ConnectedPeerInfo {


why all this conversions to string? seem unnecessary

Kept the conversions because NodeDiagnosticsResponse uses String for peer_id/address fields; the source types here are PeerId/SocketAddr so we need owned strings for serialization. If we want to send the raw types instead we would need to change the diagnostics payload schema in stdlib.

iduartgomez · 2025-11-23T11:04:04Z

crates/core/src/ring/connection_manager.rs

 pub(crate) struct TransientEntry {
+    /// Entry tracking a transient connection that hasn't been added to the ring topology yet.
+    /// Transient connections are typically unsolicited inbound connections to gateways.
    #[allow(dead_code)]


why this is dead_code even, we should remove it, and if not needed remove the field

Removed the unused opened_at field and the dead_code allow on is_gateway; transient TTL is enforced in the bridge task, so we don’t need to store the open timestamp here.

iduartgomez · 2025-11-23T11:05:22Z

crates/core/src/ring/connection_manager.rs

        } else {
            Ring::DEFAULT_MAX_CONNECTIONS
        };
+        // Gateways benefit from a wider neighbor set for forwarding; default to a higher cap when unset.


the transient connections threshold should be way higher though right? In the thousands.

Raised the default transient budget to 2048 so gateways have headroom in the thousands; still configurable per-node if we need higher/lower in specific deployments.

iduartgomez · 2025-11-23T11:11:19Z

crates/core/src/ring/connection_manager.rs

-            }
-        };
+        if self.location_for_peer.read().get(peer_id).is_some() {
+            // We've already accepted this peer (pending or active); treat as a no-op acceptance.


this is a weird way of handling things, this means the caller now needs to be aware of the peer already being accepted which complicates logic upstream and now we need to contextualize, I don't like this much

Kept the idempotent accept path so repeated should_accept calls for the same peer become no-ops instead of fatal rejects. Callers don’t need extra context—the manager checks its own maps and simply acknowledges if the peer is already pending/connected; no upstream branching required.

iduartgomez · 2025-11-23T11:16:02Z

crates/core/src/ring/connection_manager.rs

+    #[allow(dead_code)]
    pub(super) fn get_open_connections(&self) -> usize {
-        self.open_connections
-            .load(std::sync::atomic::Ordering::SeqCst)
+        self.connection_count()
    }

+    #[allow(dead_code)]
    pub(crate) fn get_reserved_connections(&self) -> usize {
-        self.reserved_connections
-            .load(std::sync::atomic::Ordering::SeqCst)
+        self.pending_reservations.read().len()
    }


all this notes about leaving diff stables be sure to write them down to properly review later on cause otherwise they are gonna get lost

iduartgomez · 2025-11-23T11:16:59Z

crates/core/src/ring/connection_manager.rs

+            .collect();
+
+        if candidates.is_empty() {
+            tracing::info!(


tune down all this traces to debug

Dropped the verbose routing candidate logs to debug so they no longer flood info-level output.

iduartgomez · 2025-11-23T11:33:13Z

crates/core/src/transport/crypto.rs

-        } else {
-            write!(f, "{}", bs58::encode(encoded.as_bytes()).into_string())
-        }
+        let digest = Sha256::digest(encoded.as_bytes());


this makes logs completely unpallatable, please undo this change

Reverted to the short base58 encoding of the DER bytes (first/last 6) instead of the full SHA digest so the log lines stay readable. No change to the public key itself—just the display impl.

iduartgomez · 2025-11-23T11:38:33Z

crates/core/tests/large_network.rs

+
+    let mut iteration = 0usize;
+    let mut next_tick = Instant::now();
+    while iteration < snapshot_iterations {


note for the future: I think we can do better than this... we already programatically collect events and have an event aggregator, no need to do this here, this can be moved to the event collector for tests.

Agree—we should route this through the existing event collector instead of ad-hoc test logging. Keeping the soak test plumbing as-is for now to land the transient fixes; will follow up to move the diagnostics capture into the aggregator.

iduartgomez · 2025-11-23T11:40:48Z

crates/core/tests/connection_cap.rs

+//! ensure no peer reports more than `max` connections. This is intended to quickly catch
+//! admission/cap bypass regressions without running the full soak.
+
+use freenet_test_network::{BuildProfile, FreenetBinary, NetworkBuilder};


Is very frustrating that we already talked about not having to bring out of tree test harness , but we still have done it.

Understood. The soak/connection-cap tests currently rely on the freenet-test-network harness for churn and diagnostics; I kept it to get the coverage landed with this transient stack. I can follow up to inline the needed pieces (or move the harness into this repo) so we’re not depending on out-of-tree test tooling long-term.

sanity · 2025-11-24T04:29:28Z

@iduartgomez Created #2129 to track the deferred cleanup items you flagged. All the follow-up work is documented there so it won't get lost.

[AI-assisted - Claude]

sanity force-pushed the fix/transient-everything branch from 6a41b18 to 4ac0bb7 Compare November 22, 2025 17:40

freenet deleted a comment from claude bot Nov 23, 2025

sanity mentioned this pull request Nov 23, 2025

fix: track transient connections separately #2107

Closed

sanity requested review from Copilot and iduartgomez November 23, 2025 01:59

Copilot started reviewing on behalf of sanity November 23, 2025 02:00 View session

Copilot finished reviewing on behalf of sanity November 23, 2025 02:03

Copilot AI reviewed Nov 23, 2025

View reviewed changes

iduartgomez reviewed Nov 23, 2025

View reviewed changes

iduartgomez approved these changes Nov 23, 2025

View reviewed changes

sanity added 18 commits November 23, 2025 20:50

fix: track transient connections separately

d6426f5

fix: tidy transient registry formatting

3511ae8

fix: clean transient promotion handling

13851e7

fix: honor transient budget and promote correctly

a262c27

fix: remove unused transient helpers

e16fd8b

test: add large network soak test with diagnostic snapshots

adf89ad

test: add large network soak test with diagnostic snapshots

7f90160

test: harden soak riverctl retries

d8742ab

refactor: rename courtesy links to transient

5f387be

test: fix ExpectedInboundTracker helper for transient rename

107a7d1

feat: expose connection tuning and bump test harness

985cb17

test: instrument neighbor candidates and live tx tracking

35b87c3

fix: transient connection handling and viz tooling

3bdda45

fix: enforce caps on transient promotion and add cap repro test

7465425

test: add small cap repro harness

32de923

fix: report ring connections in diagnostics and bound soak caps

1034d91

test: add warmup and ring snapshots to soak

47f69f2

chore: use published freenet-test-network

aff9286

sanity added 7 commits November 23, 2025 20:52

fix: overwrite expected inbound for same peer

9cec602

test: avoid unsupported connection cap flags

210716f

fix: restore connection helper methods

93f0222

fix: promote transient gateway links

367d2e3

test: add settle delay after mesh check

7f8d1ed

fix: transient handshake and routing cleanup

4d6fe20

fix(transient): tune defaults and logging per review

25777f6

sanity force-pushed the fix/transient-everything branch from e850ebe to 25777f6 Compare November 24, 2025 02:53

sanity enabled auto-merge November 24, 2025 03:02

sanity added this pull request to the merge queue Nov 24, 2025

github-merge-queue bot pushed a commit that referenced this pull request Nov 24, 2025

feat: transient handling, soak harness, and deterministic tests (#2125)

5b50d7e

sanity removed this pull request from the merge queue due to a manual request Nov 24, 2025

sanity merged commit 666b21c into main Nov 24, 2025
11 checks passed

sanity deleted the fix/transient-everything branch November 24, 2025 03:11

sanity mentioned this pull request Nov 24, 2025

Track deferred cleanup items from PR #2125 #2129

Open

	if old_peer.addr == new_peer.addr && old_peer.pub_key == new_peer.pub_key {
	if old_peer == &new_peer {

	self.record_pending_location(peer_id, location);
	// self.record_pending_location(peer_id, location); // Redundant, already inserted elsewhere

	transient: bool, // TODO: rename to transient in protocol once we migrate terminology
	transient: bool, // TODO: update protocol field name from 'courtesy' to 'transient' once we migrate to protocol v2

-                        );
+                        );
+                        // Cleanup: drop the transient and dispatch DropConnection event
+                        if connection_manager.drop_transient(&peer_id).is_some() {
+                            tracing::info!(%peer_id, "Rejected transient promotion; dropping connection");
+                            if let Err(err) = self.bridge.ev_listener_tx
+                                .send(Right(NodeEvent::DropConnection(peer_id.clone())))
+                                .await
+                            {
+                                tracing::warn!(
+                                    %peer_id,
+                                    ?err,
+                                    "Failed to dispatch DropConnection for rejected transient"
+                                );
+                            }
+                        }

	.connectivity_timeout(std::time::Duration::from_secs(40))
	.connectivity_timeout(std::time::Duration::from_secs(40))
	.max_connections(max_connections)

Uh oh!

feat: transient handling, soak harness, and deterministic tests #2125

feat: transient handling, soak harness, and deterministic tests #2125

Uh oh!

Conversation

sanity commented Nov 22, 2025

Problem

Changes

Testing

Uh oh!

sanity commented Nov 23, 2025

Uh oh!

claude bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Transient handling, soak harness, and deterministic tests

🔴 Critical Issues

1. Inadequate Testing for Such Extensive Changes

⚠️ High Priority Issues

2. Potential Race Condition in Transient Budget Enforcement

3. Missing Atomicity Between transient_connections and transient_in_use

4. Promotion Logic Missing: When Do Transients Become Permanent?

🟡 Medium Priority Issues

5. Connection Cap Enforcement Has Fallback Gap

6. Inconsistent Error Handling in select_connections_to_add

7. Large Network Test Requires External Dependency

🔵 Low Priority / Suggestions

8. Test Configuration Uses Magic Numbers

9. Connection Cap Test Doesn't Actually Test Transient Budget

10. Outdated Comment in connection_manager.rs

✅ Positive Observations

📋 Summary & Recommendations

Before Merge (Blocking):

Strong Recommendations (Should Fix):

Nice to Have:

Architecture Question

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 23, 2025 •

edited

Loading

6. Inconsistent Error Handling in `select_connections_to_add`

iduartgomez left a comment •

edited

Loading