Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 22, 2025

Problem

We need an opt-in large-network soak to debug topology formation and transient handling. Existing tooling lacked diagnostics, retries, and warmup snapshots.

This change

  • Add the #[ignore] large_network soak test (38 peers) with snapshotting and optional HTML viz output.
  • Harden the soak harness with riverctl retries, warmup, and bounded caps in diagnostics.
  • Improve transient handling/viz integration and align courtesy fields for this slice.

Testing

  • cargo check -p freenet (local).

@sanity sanity marked this pull request as draft November 22, 2025 05:34
@freenet freenet deleted a comment from claude bot Nov 23, 2025
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

@claude Please carefully review this PR, both high-level and the details.

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @sanity's task —— View job


PR Review: test: soak harness and viz improvements

I've completed a comprehensive review of this PR. Overall, this is a well-structured improvement to the testing infrastructure and transient connection handling. The changes are significant but focused, with good test coverage. Below are my findings organized by category.


✅ High-Level Assessment

Strengths:

  • Excellent new test infrastructure with the large_network soak test
  • Meaningful refactoring of connection tracking (atomic counters → BTreeMap)
  • Good diagnostics improvements for debugging topology issues
  • Terminology cleanup (courtesytransient) improves clarity
  • All CI checks passing ✅

Areas for improvement:

  • Some edge cases in connection manager need attention
  • Documentation could be more explicit in a few places
  • Minor potential race condition

🔍 Detailed Findings

1. Connection Manager Refactoring (connection_manager.rs) ⚠️

Issue: Potential duplicate acceptance logic
crates/core/src/ring/connection_manager.rs:172-177

if self.location_for_peer.read().get(peer_id).is_some() {
    // We've already accepted this peer (pending or active); treat as a no-op acceptance.
    tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
    return true;
}

This check happens after the new code inserts into pending_reservations (lines 178-181). This creates a potential issue:

  1. Thread A checks location_for_peer → not found
  2. Thread B checks location_for_peer → not found
  3. Thread A inserts into pending_reservations
  4. Thread B inserts into pending_reservations (duplicate!)
  5. Both proceed

Recommendation: Move the location_for_peer check before inserting into pending_reservations, or make the check-and-insert atomic.


Issue: Overflow check after mutation
crates/core/src/ring/connection_manager.rs:183-198

The code inserts into pending_reservations first, then checks for overflow. If overflow is detected, it removes the entry. This works, but is unconventional and could leave a brief window where the count is incorrect.

Recommendation: Consider checking overflow before insertion for cleaner semantics:

let total_conn = reserved_before
    .checked_add(1)
    .and_then(|val| val.checked_add(open))
    .ok_or_else(|| {
        tracing::error!(%peer_id, reserved_before, open, 
            "connection counters would overflow; rejecting connection");
    });

if total_conn.is_err() {
    return false;
}

{
    let mut pending = self.pending_reservations.write();
    pending.insert(peer_id.clone(), location);
}

Positive: Removal of atomic counters

The shift from AtomicUsize counters (open_connections, reserved_connections) to a BTreeMap<PeerId, Location> for pending_reservations is a good design improvement:

  • Eliminates compare-and-swap retry loops
  • Provides better visibility into which peers are pending
  • Easier to debug and reason about

However, there's now more lock contention on the RwLock. This is probably fine given the relatively low frequency of connection events, but worth monitoring in production.


2. Gateway Inbound Handling (p2p_protoc.rs) ✅

Good fix: Lines 638-644 add immediate pruning when a peer connection fails:

// Immediately prune topology counters so we don't leak open connection slots.
ctx.bridge
    .op_manager
    .ring
    .prune_connection(peer.clone())
    .await;

This prevents the connection slot leak mentioned in the gateway_inbound_identity.rs test. Well done!


Minor: Diagnostics now use topology connections

Lines 837-847 change diagnostics to report topology-backed connections instead of raw transport connections. This is more accurate, but could be confusing if transient connections exist that haven't been promoted yet.

Recommendation: Consider adding a separate field like transient_connections: usize to the diagnostics response to give visibility into the full picture.


3. Connect Operation Changes (connect.rs) ✅

Good: Terminology consistency

The rename from courtesytransient throughout is excellent for clarity. The term "transient" better conveys the temporary nature of these connections.


Good: Enhanced logging

Lines 224-232 add detailed logging when issuing acceptances:

tracing::info!(
    acceptor_peer = %acceptor.peer,
    joiner_peer = %self.request.joiner.peer,
    acceptor_loc = ?acceptor.location,
    joiner_loc = ?self.request.joiner.location,
    ring_distance = ?dist,
    transient,
    "connect: acceptance issued"
);

This will be invaluable for debugging topology formation issues.


4. Ring Maintenance (ring/mod.rs) ⚠️

Issue: Overly aggressive fallback

crates/core/src/ring/mod.rs:492-505

if neighbor_locations.is_empty() && connections_considered > 0 {
    tracing::warn!(
        current_connections,
        connections_considered,
        live_tx_peers = live_tx_tracker.len(),
        "Neighbor filtering removed all candidates; using all connections"
    );

    neighbor_locations = peers
        .iter()
        .map(|(loc, conns)| (*loc, conns.clone()))
        .filter(|(_, conns)| !conns.is_empty())
        .collect();
}

This fallback triggers when all connections have live transactions. While this prevents starvation, it means the topology manager might make poor decisions (e.g., removing a connection that's actively in use).

Recommendation: Add a comment explaining this trade-off, or consider a more nuanced approach (e.g., only remove the connection with the oldest live transaction if over capacity).


Good: Max connections enforcement

Lines 507-510 ensure that when over capacity, all connections are considered for removal:

if current_connections > self.connection_manager.max_connections {
    // When over capacity, consider all connections for removal regardless of live_tx filter.
    neighbor_locations = peers.clone();
}

Combined with the topology changes (topology/mod.rs:452-471), this properly enforces max connection limits. Good work!


5. Topology Manager (topology/mod.rs) ✅

Good: Max connections enforcement

The new select_fallback_peer_to_drop function (lines 604-625) provides a sensible fallback when over capacity: drop the peer furthest from your own location.

Potential improvement: The function uses a simple distance metric. For a more sophisticated approach, consider:

  • Dropping peers with the lowest recent request density
  • Dropping peers with duplicate coverage (i.e., another peer is very close to them)

But the current approach is reasonable as a first pass.


6. Test Quality ✅✅

Excellent: large_network.rs

This is a very well-designed soak test:

  • Configurable via environment variables
  • Captures diagnostics snapshots for post-mortem analysis
  • Exercises real application flows (River messaging)
  • Includes retry logic for riverctl (lines 290-326)
  • Good documentation in the file header

Minor suggestion: Consider adding a CLI flag or env var to optionally generate HTML visualizations automatically (currently requires manual post-processing).


Good: connection_cap.rs

Simple, focused regression test. Does exactly what it says on the tin.


Good: gateway_inbound_identity.rs

Excellent regression test with clear documentation explaining the bug it prevents (lines 1-6). This is a model for how regression tests should be written.


7. Transport Changes (connection_handler.rs) ⚠️

Issue: Logic inversion

crates/core/src/transport/connection_handler.rs:419-453

The old code had a complex nested condition for non-gateway peers. The new code inverts this to:

if self.is_gateway || is_known_gateway {
    // Handle gateway-intro packets
    // ...
} else {
    // Non-gateway peers: mark as expected
    self.expected_non_gateway.insert(remote_addr.ip());
    tracing::debug!(%remote_addr, "unexpected peer intro; marking expected_non_gateway");
    continue;
}

This is cleaner, but the behavior change is subtle:

Old behavior: Non-gateway peers would reject unexpected peer intros unless from known gateways
New behavior: Non-gateway peers accept (mark as expected) all peer intros

Question: Is this intentional? If so, please add a comment explaining why the restriction was relaxed. If not, this could allow unsolicited peer connections.


8. Documentation & Comments ✅

Good: The PR description explains the why (debugging topology formation and transient handling) and the what (soak test, diagnostics, hardening).

Suggestion: Add a brief note about the connection_manager refactoring (atomic counters → BTreeMap) in the PR description, as it's a significant architectural change.


🐛 Bugs & Security

No security issues found. The changes don't introduce obvious vulnerabilities.

Potential bugs:

  1. Duplicate peer acceptance (connection_manager.rs, described above)
  2. Transport logic inversion (connection_handler.rs, needs clarification)

📊 Code Quality

Strengths:

  • Good use of tracing/logging for debugging
  • Consistent error handling
  • Tests are well-documented

Minor issues:

  • Some functions are growing long (e.g., P2pConnManager::run in p2p_protoc.rs is ~1200 lines). Consider breaking into smaller functions.
  • A few #[allow(dead_code)] attributes suggest incomplete cleanup or features in progress

✅ Recommendations Summary

Must Fix (before merge):

  1. connection_manager.rs:172: Fix potential race in duplicate peer check
  2. connection_handler.rs:419: Clarify whether transport logic change is intentional

Should Fix (if time permits):

  1. Add transient_connections field to diagnostics
  2. Document the live_tx fallback trade-off in ring/mod.rs
  3. Add comment about connection_manager refactoring in PR description

Nice to Have:

  1. Consider more sophisticated peer-drop heuristics in topology manager
  2. Add auto-visualization option to large_network test
  3. Refactor long functions in p2p_protoc.rs

🎯 Verdict

This is a solid PR that meaningfully improves the testing infrastructure and fixes important bugs. The soak test is particularly valuable for catching regressions in topology formation.

The issues I've identified are relatively minor and can likely be addressed quickly. Once the "Must Fix" items are resolved, this should be safe to merge.

Great work on the test coverage and diagnostics improvements! 🎉


| PR branch: fix/transient-split-4b

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a large-scale soak test infrastructure to debug topology formation and transient connection handling in networks with 38+ peers. The changes include a new opt-in test harness with diagnostics snapshotting, riverctl integration with retry logic, and significant refactoring of connection management to replace atomic counters with map-based tracking.

Key Changes

  • Added large_network.rs soak test with configurable peer count, snapshot intervals, and River client workflows to exercise contract operations during the soak
  • Refactored connection management to eliminate atomic counters (open_connections, reserved_connections) in favor of unified location_for_peer and pending_reservations maps
  • Improved transient connection handling with budget enforcement, TTL-based expiration, and clearer admission/promotion logic

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
crates/core/tests/large_network.rs New ignored soak test with 38-peer network, riverctl integration, diagnostics snapshots, and connectivity validation
crates/core/tests/gateway_inbound_identity.rs New regression test ensuring gateways track real peer IDs on inbound connections
crates/core/tests/connection_cap.rs New test validating max-connections enforcement
crates/core/tests/test_network_integration.rs Simplified network initialization by removing static OnceCell
crates/core/tests/connectivity.rs Increased retry limit from 30 to 60 for CI stability
crates/core/src/transport/crypto.rs Changed TransportPublicKey display to use SHA256 hash instead of first/last 6 bytes
crates/core/src/transport/connection_handler.rs Restructured gateway intro packet handling to prioritize known gateways
crates/core/src/topology/mod.rs Added fallback peer removal when over max-connections capacity
crates/core/src/ring/mod.rs Replaced connection-age filtering with live-tx filtering and added over-capacity handling
crates/core/src/ring/live_tx.rs Added len() method for tracking count
crates/core/src/ring/connection_manager.rs Major refactoring replacing atomic counters with map-based tracking; removed unit tests
crates/core/src/ring/connection.rs Removed open_at timestamp field from Connection struct
crates/core/src/operations/connect.rs Renamed courtesy to transient throughout; simplified bootstrap loop
crates/core/src/node/network_bridge/p2p_protoc.rs Enhanced transient promotion logic with re-admission checks and budget enforcement
crates/core/Cargo.toml Updated freenet-test-network to 0.1.3, added sha2 and regex dependencies
AGENTS.md Added documentation for running the large_network soak test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +919 to +936
select_all
.for_each(|(res, gateway)| async move {
if let Err(error) = res {
if !matches!(
error,
OpError::ConnError(
crate::node::ConnectionError::UnwantedConnection
)
) {
tracing::error!(
%gateway,
%error,
"Failed while attempting connection to gateway"
);
}
}
}
in_flight_gateways.remove(&gateway.peer);
}
} else if open_conns < BOOTSTRAP_THRESHOLD && available_count == 0 {
tracing::debug!(
open_connections = open_conns,
inflight = in_flight_gateways.len(),
"Below threshold but all gateways are already connected or in-flight"
);
})
.await;
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for_each combinator consumes the stream without propagating errors. If any gateway connection fails with an error other than UnwantedConnection, it's logged but the error is not propagated to the caller. This could hide failures in the bootstrap process. Consider using try_for_each instead, or collecting results to ensure all errors are properly handled.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is refactored in the later slice (PR 2125) where bootstrap uses error-propagating combinators; keeping this slice stable to avoid churn.

Comment on lines +508 to +511
if current_connections > self.connection_manager.max_connections {
// When over capacity, consider all connections for removal regardless of live_tx filter.
neighbor_locations = peers.clone();
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic on lines 508-511 appears to bypass max connections enforcement for over-capacity scenarios. The comment states "when over capacity, consider all connections for removal," but this unconditionally replaces neighbor_locations with all peers, potentially interfering with the removal logic that follows. This could prevent the topology manager from properly enforcing the connection cap.

Suggested change
if current_connections > self.connection_manager.max_connections {
// When over capacity, consider all connections for removal regardless of live_tx filter.
neighbor_locations = peers.clone();
}
// Do not unconditionally replace neighbor_locations with all peers when over capacity.
// The fallback to all peers is already handled above if neighbor_locations is empty.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the later slice where cap enforcement/over-cap selection was rewritten; leaving this slice unchanged to keep the stack diff stable.

};
if let Some((_, best_score)) = &mut candidate {
if score > *best_score {
*best_score = score;
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function select_fallback_peer_to_drop has a bug in its candidate selection logic. On line 618, when a better candidate is found, the code updates best_score but then immediately replaces the entire candidate tuple with a new one constructed from conn.location. This means the update to *best_score on line 617 is pointless since the tuple is replaced on line 618. The correct approach would be to just update the candidate without the intermediate mutation: candidate = Some((conn.location.clone(), score));

Suggested change
*best_score = score;

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the later slice when we reworked fallback selection; keeping this diff stable here.

tracing::debug!(old, "Decremented reserved connections");
if old == 0 {
panic!("Underflow of reserved connections");
self.pending_reservations.write().remove(&peer);
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_connection method now unconditionally removes the peer from pending_reservations when was_reserved is true (line 406), but this parameter's value depends on how the caller invokes the method. If a connection is added without being properly reserved first, or if the reservation was already cleared, this could lead to inconsistent state. The method should verify that a reservation actually exists before removing it, or document the invariants more clearly.

Suggested change
self.pending_reservations.write().remove(&peer);
let mut pending = self.pending_reservations.write();
if pending.contains_key(&peer) {
pending.remove(&peer);
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reservation tracking is cleaned up in the later slice; keeping this slice unchanged to avoid stack churn.

}

let provided_peer = peer.clone();
let _provided_peer = peer.clone();
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, _provided_peer is assigned but never used. This appears to be leftover from refactoring and should be removed.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in the later slice; keeping this slice unchanged to minimize conflicts.

} => {
let conn_manager = &self.bridge.op_manager.ring.connection_manager;
tracing::info!(provided = ?peer, transient = courtesy, tx = ?transaction, "InboundConnection event");
let _conn_manager = &self.bridge.op_manager.ring.connection_manager;
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable _conn_manager is assigned but never used (indicated by the leading underscore). This suggests incomplete refactoring or dead code that should be removed to improve code clarity.

Suggested change
let _conn_manager = &self.bridge.op_manager.ring.connection_manager;

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in the later slice; leaving this slice unchanged.

.add_connection(loc, peer_id.clone(), true)
.await;
if is_transient {
connection_manager.drop_transient(&peer_id);
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After dropping a transient connection on line 1778, the code does not prune the topology counters or clean up the connection from the connections map. This could leave the connection in an inconsistent state where it's been removed from the transient pool but still exists in the transport layer. Consider calling prune_connection or DropConnection to ensure complete cleanup.

Suggested change
connection_manager.drop_transient(&peer_id);
connection_manager.drop_transient(&peer_id);
if let Err(err) = self.bridge.ev_listener_tx
.send(Right(NodeEvent::DropConnection(peer_id.clone())))
.await
{
tracing::warn!(
%peer_id,
?err,
"Failed to dispatch DropConnection for transient"
);
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup is handled in the later slice; keeping this diff focused here.

self.pending_reservations.write().remove(peer_id);
} else {
tracing::info!(%peer_id, total_conn, "should_accept: accepted (reserving spot)");
self.record_pending_location(peer_id, location);
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method calls record_pending_location() at line 240, which inserts the peer into location_for_peer (line 251). However, the reservation was already inserted into pending_reservations at line 180. This creates a redundant tracking of the same peer in two different data structures (pending_reservations and location_for_peer). This could lead to inconsistencies and makes the state management more complex than necessary.

Suggested change
self.record_pending_location(peer_id, location);
// self.record_pending_location(peer_id, location); // Redundant: already tracked in pending_reservations

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in the later slice; keeping this diff minimal here.

.op_manager
.ring
.add_connection(loc, peer_id.clone(), false)
.add_connection(loc, peer_id.clone(), true)
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection is promoted into the ring with was_reserved=true (line 1775), but the comment on line 1750 states "Re-apply admission logic on promotion to avoid bypassing capacity/heuristic checks." This creates a logical inconsistency: if admission logic is being re-applied via should_accept(), then the connection should be treated as a new reservation, not an existing one. Passing was_reserved=true suggests the slot was already reserved, which contradicts the re-application of admission checks. This should likely be was_reserved=false unless the reservation semantics are different than expected.

Suggested change
.add_connection(loc, peer_id.clone(), true)
.add_connection(loc, peer_id.clone(), false)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promotion semantics adjusted in the later slice; keeping this slice aligned with the stack.

Comment on lines 447 to 451
self.expected_non_gateway.insert(remote_addr.ip());
tracing::debug!(
%remote_addr,
"unexpected peer intro; marking expected_non_gateway"
);
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The logic has been restructured but now non-gateway peers that receive unexpected intro packets will mark them as expected and continue (lines 446-452), which silently ignores potentially invalid packets. The old code had a warning about "unexpected packet from non-gateway node; dropping intro packet". The new behavior may be intentional for allowing peer-to-peer connections, but the lack of any warning or debug logging about this state change makes it harder to diagnose connection issues.

Suggested change
self.expected_non_gateway.insert(remote_addr.ip());
tracing::debug!(
%remote_addr,
"unexpected peer intro; marking expected_non_gateway"
);
tracing::warn!(
%remote_addr,
"unexpected intro packet from non-gateway node; marking as expected and continuing"
);
self.expected_non_gateway.insert(remote_addr.ip());

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warn-level log so unexpected intros from non-gateway peers stand out while still letting the handshake proceed.

@sanity sanity marked this pull request as ready for review November 23, 2025 02:50
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

Thanks for the thorough review. The duplicate acceptance race and transport intro behavior you highlighted are carried forward into the top-of-stack follow-ups; I’ll keep this slice stable to avoid reshuffling the stack, and address those clarifications/tests in the final aggregate PR.

@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

Fixed CI failure in this slice by letting expected inbound entries be keyed by IP and deduped per port, so newer expectations win and NAT/port changes don’t get dropped. Three-hop put test now passes locally; CI should re-run on push.

@sanity sanity force-pushed the fix/transient-split-4b branch from a17f265 to de9e648 Compare November 24, 2025 00:25
@sanity sanity added this pull request to the merge queue Nov 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 24, 2025
@sanity sanity added this pull request to the merge queue Nov 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 24, 2025
@sanity sanity added this pull request to the merge queue Nov 24, 2025
Merged via the queue into main with commit 4c318ee Nov 24, 2025
11 checks passed
@sanity sanity deleted the fix/transient-split-4b branch November 24, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants