-
-
Notifications
You must be signed in to change notification settings - Fork 107
test: soak harness and viz improvements #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@claude Please carefully review this PR, both high-level and the details. |
|
Claude finished @sanity's task —— View job PR Review: test: soak harness and viz improvementsI've completed a comprehensive review of this PR. Overall, this is a well-structured improvement to the testing infrastructure and transient connection handling. The changes are significant but focused, with good test coverage. Below are my findings organized by category. ✅ High-Level AssessmentStrengths:
Areas for improvement:
🔍 Detailed Findings1. Connection Manager Refactoring (connection_manager.rs)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a large-scale soak test infrastructure to debug topology formation and transient connection handling in networks with 38+ peers. The changes include a new opt-in test harness with diagnostics snapshotting, riverctl integration with retry logic, and significant refactoring of connection management to replace atomic counters with map-based tracking.
Key Changes
- Added
large_network.rssoak test with configurable peer count, snapshot intervals, and River client workflows to exercise contract operations during the soak - Refactored connection management to eliminate atomic counters (
open_connections,reserved_connections) in favor of unifiedlocation_for_peerandpending_reservationsmaps - Improved transient connection handling with budget enforcement, TTL-based expiration, and clearer admission/promotion logic
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
crates/core/tests/large_network.rs |
New ignored soak test with 38-peer network, riverctl integration, diagnostics snapshots, and connectivity validation |
crates/core/tests/gateway_inbound_identity.rs |
New regression test ensuring gateways track real peer IDs on inbound connections |
crates/core/tests/connection_cap.rs |
New test validating max-connections enforcement |
crates/core/tests/test_network_integration.rs |
Simplified network initialization by removing static OnceCell |
crates/core/tests/connectivity.rs |
Increased retry limit from 30 to 60 for CI stability |
crates/core/src/transport/crypto.rs |
Changed TransportPublicKey display to use SHA256 hash instead of first/last 6 bytes |
crates/core/src/transport/connection_handler.rs |
Restructured gateway intro packet handling to prioritize known gateways |
crates/core/src/topology/mod.rs |
Added fallback peer removal when over max-connections capacity |
crates/core/src/ring/mod.rs |
Replaced connection-age filtering with live-tx filtering and added over-capacity handling |
crates/core/src/ring/live_tx.rs |
Added len() method for tracking count |
crates/core/src/ring/connection_manager.rs |
Major refactoring replacing atomic counters with map-based tracking; removed unit tests |
crates/core/src/ring/connection.rs |
Removed open_at timestamp field from Connection struct |
crates/core/src/operations/connect.rs |
Renamed courtesy to transient throughout; simplified bootstrap loop |
crates/core/src/node/network_bridge/p2p_protoc.rs |
Enhanced transient promotion logic with re-admission checks and budget enforcement |
crates/core/Cargo.toml |
Updated freenet-test-network to 0.1.3, added sha2 and regex dependencies |
AGENTS.md |
Added documentation for running the large_network soak test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| select_all | ||
| .for_each(|(res, gateway)| async move { | ||
| if let Err(error) = res { | ||
| if !matches!( | ||
| error, | ||
| OpError::ConnError( | ||
| crate::node::ConnectionError::UnwantedConnection | ||
| ) | ||
| ) { | ||
| tracing::error!( | ||
| %gateway, | ||
| %error, | ||
| "Failed while attempting connection to gateway" | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| in_flight_gateways.remove(&gateway.peer); | ||
| } | ||
| } else if open_conns < BOOTSTRAP_THRESHOLD && available_count == 0 { | ||
| tracing::debug!( | ||
| open_connections = open_conns, | ||
| inflight = in_flight_gateways.len(), | ||
| "Below threshold but all gateways are already connected or in-flight" | ||
| ); | ||
| }) | ||
| .await; |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for_each combinator consumes the stream without propagating errors. If any gateway connection fails with an error other than UnwantedConnection, it's logged but the error is not propagated to the caller. This could hide failures in the bootstrap process. Consider using try_for_each instead, or collecting results to ensure all errors are properly handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is refactored in the later slice (PR 2125) where bootstrap uses error-propagating combinators; keeping this slice stable to avoid churn.
| if current_connections > self.connection_manager.max_connections { | ||
| // When over capacity, consider all connections for removal regardless of live_tx filter. | ||
| neighbor_locations = peers.clone(); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic on lines 508-511 appears to bypass max connections enforcement for over-capacity scenarios. The comment states "when over capacity, consider all connections for removal," but this unconditionally replaces neighbor_locations with all peers, potentially interfering with the removal logic that follows. This could prevent the topology manager from properly enforcing the connection cap.
| if current_connections > self.connection_manager.max_connections { | |
| // When over capacity, consider all connections for removal regardless of live_tx filter. | |
| neighbor_locations = peers.clone(); | |
| } | |
| // Do not unconditionally replace neighbor_locations with all peers when over capacity. | |
| // The fallback to all peers is already handled above if neighbor_locations is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in the later slice where cap enforcement/over-cap selection was rewritten; leaving this slice unchanged to keep the stack diff stable.
| }; | ||
| if let Some((_, best_score)) = &mut candidate { | ||
| if score > *best_score { | ||
| *best_score = score; |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function select_fallback_peer_to_drop has a bug in its candidate selection logic. On line 618, when a better candidate is found, the code updates best_score but then immediately replaces the entire candidate tuple with a new one constructed from conn.location. This means the update to *best_score on line 617 is pointless since the tuple is replaced on line 618. The correct approach would be to just update the candidate without the intermediate mutation: candidate = Some((conn.location.clone(), score));
| *best_score = score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the later slice when we reworked fallback selection; keeping this diff stable here.
| tracing::debug!(old, "Decremented reserved connections"); | ||
| if old == 0 { | ||
| panic!("Underflow of reserved connections"); | ||
| self.pending_reservations.write().remove(&peer); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add_connection method now unconditionally removes the peer from pending_reservations when was_reserved is true (line 406), but this parameter's value depends on how the caller invokes the method. If a connection is added without being properly reserved first, or if the reservation was already cleared, this could lead to inconsistent state. The method should verify that a reservation actually exists before removing it, or document the invariants more clearly.
| self.pending_reservations.write().remove(&peer); | |
| let mut pending = self.pending_reservations.write(); | |
| if pending.contains_key(&peer) { | |
| pending.remove(&peer); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reservation tracking is cleaned up in the later slice; keeping this slice unchanged to avoid stack churn.
| } | ||
|
|
||
| let provided_peer = peer.clone(); | ||
| let _provided_peer = peer.clone(); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, _provided_peer is assigned but never used. This appears to be leftover from refactoring and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up in the later slice; keeping this slice unchanged to minimize conflicts.
| } => { | ||
| let conn_manager = &self.bridge.op_manager.ring.connection_manager; | ||
| tracing::info!(provided = ?peer, transient = courtesy, tx = ?transaction, "InboundConnection event"); | ||
| let _conn_manager = &self.bridge.op_manager.ring.connection_manager; |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable _conn_manager is assigned but never used (indicated by the leading underscore). This suggests incomplete refactoring or dead code that should be removed to improve code clarity.
| let _conn_manager = &self.bridge.op_manager.ring.connection_manager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in the later slice; leaving this slice unchanged.
| .add_connection(loc, peer_id.clone(), true) | ||
| .await; | ||
| if is_transient { | ||
| connection_manager.drop_transient(&peer_id); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After dropping a transient connection on line 1778, the code does not prune the topology counters or clean up the connection from the connections map. This could leave the connection in an inconsistent state where it's been removed from the transient pool but still exists in the transport layer. Consider calling prune_connection or DropConnection to ensure complete cleanup.
| connection_manager.drop_transient(&peer_id); | |
| connection_manager.drop_transient(&peer_id); | |
| if let Err(err) = self.bridge.ev_listener_tx | |
| .send(Right(NodeEvent::DropConnection(peer_id.clone()))) | |
| .await | |
| { | |
| tracing::warn!( | |
| %peer_id, | |
| ?err, | |
| "Failed to dispatch DropConnection for transient" | |
| ); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup is handled in the later slice; keeping this diff focused here.
| self.pending_reservations.write().remove(peer_id); | ||
| } else { | ||
| tracing::info!(%peer_id, total_conn, "should_accept: accepted (reserving spot)"); | ||
| self.record_pending_location(peer_id, location); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method calls record_pending_location() at line 240, which inserts the peer into location_for_peer (line 251). However, the reservation was already inserted into pending_reservations at line 180. This creates a redundant tracking of the same peer in two different data structures (pending_reservations and location_for_peer). This could lead to inconsistencies and makes the state management more complex than necessary.
| self.record_pending_location(peer_id, location); | |
| // self.record_pending_location(peer_id, location); // Redundant: already tracked in pending_reservations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified in the later slice; keeping this diff minimal here.
| .op_manager | ||
| .ring | ||
| .add_connection(loc, peer_id.clone(), false) | ||
| .add_connection(loc, peer_id.clone(), true) |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection is promoted into the ring with was_reserved=true (line 1775), but the comment on line 1750 states "Re-apply admission logic on promotion to avoid bypassing capacity/heuristic checks." This creates a logical inconsistency: if admission logic is being re-applied via should_accept(), then the connection should be treated as a new reservation, not an existing one. Passing was_reserved=true suggests the slot was already reserved, which contradicts the re-application of admission checks. This should likely be was_reserved=false unless the reservation semantics are different than expected.
| .add_connection(loc, peer_id.clone(), true) | |
| .add_connection(loc, peer_id.clone(), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promotion semantics adjusted in the later slice; keeping this slice aligned with the stack.
| self.expected_non_gateway.insert(remote_addr.ip()); | ||
| tracing::debug!( | ||
| %remote_addr, | ||
| "unexpected peer intro; marking expected_non_gateway" | ||
| ); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The logic has been restructured but now non-gateway peers that receive unexpected intro packets will mark them as expected and continue (lines 446-452), which silently ignores potentially invalid packets. The old code had a warning about "unexpected packet from non-gateway node; dropping intro packet". The new behavior may be intentional for allowing peer-to-peer connections, but the lack of any warning or debug logging about this state change makes it harder to diagnose connection issues.
| self.expected_non_gateway.insert(remote_addr.ip()); | |
| tracing::debug!( | |
| %remote_addr, | |
| "unexpected peer intro; marking expected_non_gateway" | |
| ); | |
| tracing::warn!( | |
| %remote_addr, | |
| "unexpected intro packet from non-gateway node; marking as expected and continuing" | |
| ); | |
| self.expected_non_gateway.insert(remote_addr.ip()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a warn-level log so unexpected intros from non-gateway peers stand out while still letting the handshake proceed.
|
Thanks for the thorough review. The duplicate acceptance race and transport intro behavior you highlighted are carried forward into the top-of-stack follow-ups; I’ll keep this slice stable to avoid reshuffling the stack, and address those clarifications/tests in the final aggregate PR. |
|
Fixed CI failure in this slice by letting expected inbound entries be keyed by IP and deduped per port, so newer expectations win and NAT/port changes don’t get dropped. Three-hop put test now passes locally; CI should re-run on push. |
a17f265 to
de9e648
Compare
Problem
We need an opt-in large-network soak to debug topology formation and transient handling. Existing tooling lacked diagnostics, retries, and warmup snapshots.
This change
Testing