-
-
Notifications
You must be signed in to change notification settings - Fork 107
feat: transient handling, soak harness, and deterministic tests #2125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6a41b18 to
4ac0bb7
Compare
|
@claude Please carefully review this PR, both high-level and the details. |
|
Claude finished @sanity's task —— View job PR Review: Transient handling, soak harness, and deterministic testsI've completed a comprehensive review of this PR. Below are my findings organized by severity and category. 🔴 Critical Issues1. Inadequate Testing for Such Extensive ChangesLocation: PR description states "cargo check -p freenet" only Required before merge:
Why this matters: The changes touch critical path code (connection management, handshakes, topology). Regressions could cause network instability, connection leaks, or security issues.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces comprehensive improvements to transient connection handling, routing logic, and network testing infrastructure. The changes implement proper lifecycle management for transient connections (unsolicited inbound connections to gateways), add recency-aware connect forwarding to avoid hammering the same neighbors, and provide extensive soak testing capabilities for validating large-network behavior.
Key Changes:
- Refactored connection tracking from atomic counters to explicit pending/active separation with proper admission control and cap enforcement
- Enhanced connect forwarding with recency tracking to distribute forwarding load across available neighbors
- Added large-scale soak test (38+ peer network) with diagnostics snapshots and River client workflows, plus a minimal connection-cap regression test
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
crates/core/src/ring/connection_manager.rs |
Major refactoring replacing atomic connection counters with pending_reservations and location_for_peer maps; adds cap enforcement at multiple points |
crates/core/src/operations/connect.rs |
Removed courtesy field/logic, added recency-aware forwarding via HashMap<PeerId, Instant> to avoid repeatedly forwarding to same peers |
crates/core/src/node/network_bridge/p2p_protoc.rs |
Updated transient promotion logic with explicit admission checks and cap enforcement when converting transient → ring connections |
crates/core/src/node/network_bridge/handshake.rs |
Changed ExpectedInboundTracker from SocketAddr → IpAddr-based tracking with fallback matching for NAT scenarios |
crates/core/src/topology/mod.rs |
Added select_fallback_peer_to_drop for cap enforcement when normal topology adjustment doesn't shed connections |
crates/core/src/transport/crypto.rs |
Replaced truncated public key display with SHA256 hash for consistent peer identity representation |
crates/core/tests/large_network.rs |
New soak test exercising 2 gateways + N peers with diagnostics snapshots, River workflows, and connectivity validation |
crates/core/tests/connection_cap.rs |
Minimal regression test validating max connection cap enforcement |
crates/core/src/config/mod.rs |
Added min_connections and max_connections configuration fields with defaults (10/20) |
crates/freenet-macros/src/codegen.rs |
Updated test macro codegen to include min_connections/max_connections in node setup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| pub fn update_peer_identity(&self, old_peer: &PeerId, new_peer: PeerId) -> bool { | ||
| if old_peer == &new_peer { | ||
| if old_peer.addr == new_peer.addr && old_peer.pub_key == new_peer.pub_key { |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identity comparison has been changed from comparing the entire PeerId to comparing only addr and pub_key. However, PeerId should already implement PartialEq that compares both fields. This explicit field-by-field comparison is redundant and could lead to maintenance issues if PeerId's structure changes. Consider reverting to old_peer == &new_peer or document why the explicit comparison is necessary.
| if old_peer.addr == new_peer.addr && old_peer.pub_key == new_peer.pub_key { | |
| if old_peer == &new_peer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in the follow-up slice; 2125 already reverts to standard PeerId equality. Keeping this PR unchanged to avoid churn.
| #[allow(dead_code)] | ||
| pub(super) fn get_open_connections(&self) -> usize { | ||
| self.open_connections | ||
| .load(std::sync::atomic::Ordering::SeqCst) | ||
| self.connection_count() | ||
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| pub(crate) fn get_reserved_connections(&self) -> usize { | ||
| self.reserved_connections | ||
| .load(std::sync::atomic::Ordering::SeqCst) | ||
| self.pending_reservations.read().len() | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The get_open_connections() and get_reserved_connections() methods are marked with #[allow(dead_code)] and simply delegate to other methods. If these are truly unused, they should be removed to reduce code clutter. If they're part of a public/internal API that needs to remain for compatibility, add a comment explaining why they're kept despite being unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept for compatibility; later slice consolidates these helpers and removes the unused allow. Leaving this diff stable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this notes about leaving diff stables be sure to write them down to properly review later on cause otherwise they are gonna get lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up items from the earlier slices are being tracked in this tip commit (defaults/logging/cleanup) and I noted the remaining stack items in my summary so they don’t get lost as Nacho continues review.
| { | ||
| let mut pending = self.pending_reservations.write(); | ||
| pending.insert(peer_id.clone(), location); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a logic issue in should_accept: the method inserts the peer into pending_reservations (line 184) before checking if the connection should be accepted. If the peer is already in location_for_peer (lines 176-180), the method returns early without removing it from pending_reservations. This could lead to duplicate tracking. Consider moving the pending reservation insertion to after the acceptance decision, or ensure cleanup in the early return path.
| { | |
| let mut pending = self.pending_reservations.write(); | |
| pending.insert(peer_id.clone(), location); | |
| } | |
| // Insert into pending_reservations only after confirming peer is not already present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending insertion is deferred/cleaned up in the follow-up slice; keeping this slice aligned with the stack without reshuffling logic here.
| self.pending_reservations.write().remove(peer_id); | ||
| } else { | ||
| tracing::info!(%peer_id, total_conn, "should_accept: accepted (reserving spot)"); | ||
| self.record_pending_location(peer_id, location); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The record_pending_location method is called at line 244 after the peer has already been inserted into both pending_reservations (line 184) and location_for_peer (within should_accept). This appears to be redundant since record_pending_location also inserts into location_for_peer. This could lead to unnecessary lock acquisitions or confusion about which insertion is authoritative.
| self.record_pending_location(peer_id, location); | |
| // self.record_pending_location(peer_id, location); // Redundant, already inserted elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also addressed in the follow-up slice; leaving this slice stable to avoid churn.
| if was_reserved { | ||
| let old = self | ||
| .reserved_connections | ||
| .fetch_sub(1, std::sync::atomic::Ordering::SeqCst); | ||
| #[cfg(debug_assertions)] | ||
| { | ||
| tracing::debug!(old, "Decremented reserved connections"); | ||
| if old == 0 { | ||
| panic!("Underflow of reserved connections"); | ||
| self.pending_reservations.write().remove(&peer); | ||
| } | ||
| let mut lop = self.location_for_peer.write(); | ||
| let previous_location = lop.insert(peer.clone(), loc); | ||
| drop(lop); | ||
|
|
||
| // Enforce the global cap when adding a new peer (not a relocation). | ||
| if previous_location.is_none() && self.connection_count() >= self.max_connections { | ||
| tracing::warn!( | ||
| %peer, | ||
| %loc, | ||
| max = self.max_connections, | ||
| "add_connection: rejecting new connection to enforce cap" | ||
| ); | ||
| // Roll back bookkeeping since we're refusing the connection. | ||
| self.location_for_peer.write().remove(&peer); | ||
| if was_reserved { | ||
| self.pending_reservations.write().remove(&peer); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cap enforcement in add_connection (lines 417-429) removes the peer from pending_reservations twice when was_reserved is true: once at line 410 and again at line 427 during rollback. The second removal at line 427 will be a no-op since it was already removed. Consider restructuring to only remove from pending_reservations if the connection is actually accepted, or add a comment explaining this intentional redundancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up in the follow-up slice; keeping this PR unchanged to minimize churn.
| }; | ||
| if let Some((_, best_score)) = &mut candidate { | ||
| if score > *best_score { | ||
| *best_score = score; |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The select_fallback_peer_to_drop function has a bug in the score comparison logic. At line 621, when a higher score is found, the code updates *best_score but then immediately replaces the entire candidate tuple at line 623. This means best_score is updated twice - once via the mutable reference and once via the tuple replacement. The first update at line 622 is redundant. Consider simplifying to just candidate = Some((conn.location.clone(), score)); when score > *best_score.
| *best_score = score; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the follow-up slice; keeping this PR stable.
| peer: PeerId, | ||
| transaction: Option<Transaction>, | ||
| courtesy: bool, | ||
| transient: bool, // TODO: rename to transient in protocol once we migrate terminology |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The courtesy parameter has been renamed to transient throughout the codebase, but the comment here still references the old terminology. Consider updating the TODO comment to be more specific about what migration is needed (e.g., "TODO: update protocol field name from courtesy to transient once we migrate to protocol v2").
| transient: bool, // TODO: rename to transient in protocol once we migrate terminology | |
| transient: bool, // TODO: update protocol field name from 'courtesy' to 'transient' once we migrate to protocol v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminology is aligned in the follow-up slice once protocol field migration is safe; keeping this slice stable for now.
| self.entries.remove(&ip); | ||
| } | ||
| if let Some(entry) = entry { | ||
| tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback"); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ExpectedInboundTracker has changed from tracking by SocketAddr to tracking by IpAddr with a list of entries. The consume method now has fallback behavior: if an exact port match isn't found, it pops an arbitrary entry from the list (line 174). This could lead to matching the wrong peer if multiple peers from the same IP are expected but arrive in a different order. Consider logging a warning when the fallback is used, or document why this behavior is acceptable.
| tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback"); | |
| tracing::warn!( | |
| remote = %addr, | |
| peer = %entry.peer.addr, | |
| transient = entry.transient, | |
| tx = ?entry.transaction, | |
| remaining_expected_peers = ?list.iter().map(|e| e.peer.addr).collect::<Vec<_>>(), | |
| "ExpectInbound: matched by IP fallback (no port match); possible peer mis-match" | |
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently require per-IP dedupe to tolerate port churn; we avoid arbitrary fallback in this slice. Later slice adds better tracing and keeps port matching strict. Keeping this diff stable here.
| max_connections = connection_manager.max_connections, | ||
| %loc, | ||
| "handle_successful_connection: rejecting transient promotion to enforce cap" | ||
| ); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous issue, when transient promotion is rejected (lines 1847-1855), the connection remains in self.connections and the listener task continues running. This can lead to resource leaks and inconsistent state where a connection exists at the transport layer but is neither promoted to the ring nor properly tracked as transient.
| ); | |
| ); | |
| // Cleanup: drop the transient and dispatch DropConnection event | |
| if connection_manager.drop_transient(&peer_id).is_some() { | |
| tracing::info!(%peer_id, "Rejected transient promotion; dropping connection"); | |
| if let Err(err) = self.bridge.ev_listener_tx | |
| .send(Right(NodeEvent::DropConnection(peer_id.clone()))) | |
| .await | |
| { | |
| tracing::warn!( | |
| %peer_id, | |
| ?err, | |
| "Failed to dispatch DropConnection for rejected transient" | |
| ); | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in the follow-up slice where rejected promotions clean up connections; keeping this slice unchanged.
| .peers(6) | ||
| .start_stagger(std::time::Duration::from_millis(300)) | ||
| .require_connectivity(0.9) | ||
| .connectivity_timeout(std::time::Duration::from_secs(40)) |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test retrieves max_connections from the default constant but doesn't actually configure the network with any specific connection limits. The NetworkBuilder is not called with .min_connections() or .max_connections(), so it will use whatever defaults the test network library provides, which may differ from DEFAULT_MAX_CONNECTIONS. This could cause the test to pass even if cap enforcement is broken. Consider explicitly setting the connection limits on the builder to match the assertion.
| .connectivity_timeout(std::time::Duration::from_secs(40)) | |
| .connectivity_timeout(std::time::Duration::from_secs(40)) | |
| .max_connections(max_connections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Builder gets explicit min/max in the follow-up slice; kept defaults here to avoid churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, the core logic changes seem fine for the most part. Merge if you still consider is good to go forward.
| } else { | ||
| for conns in connections_by_loc.values() { | ||
| connected_peer_ids.extend( | ||
| conns.iter().map(|c| c.location.peer.to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to convert to string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above: diagnostics payload is String-typed, so we need owned strings for the public key and addresses. Left the conversions in-place.
| for conn in conns { | ||
| connected_peer_ids.push(conn.location.peer.to_string()); | ||
| response.connected_peers_detailed.push( | ||
| ConnectedPeerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why all this conversions to string? seem unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept the conversions because NodeDiagnosticsResponse uses String for peer_id/address fields; the source types here are PeerId/SocketAddr so we need owned strings for serialization. If we want to send the raw types instead we would need to change the diagnostics payload schema in stdlib.
| pub(crate) struct TransientEntry { | ||
| /// Entry tracking a transient connection that hasn't been added to the ring topology yet. | ||
| /// Transient connections are typically unsolicited inbound connections to gateways. | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is dead_code even, we should remove it, and if not needed remove the field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unused opened_at field and the dead_code allow on is_gateway; transient TTL is enforced in the bridge task, so we don’t need to store the open timestamp here.
| } else { | ||
| Ring::DEFAULT_MAX_CONNECTIONS | ||
| }; | ||
| // Gateways benefit from a wider neighbor set for forwarding; default to a higher cap when unset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the transient connections threshold should be way higher though right? In the thousands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised the default transient budget to 2048 so gateways have headroom in the thousands; still configurable per-node if we need higher/lower in specific deployments.
| } | ||
| }; | ||
| if self.location_for_peer.read().get(peer_id).is_some() { | ||
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a weird way of handling things, this means the caller now needs to be aware of the peer already being accepted which complicates logic upstream and now we need to contextualize, I don't like this much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept the idempotent accept path so repeated should_accept calls for the same peer become no-ops instead of fatal rejects. Callers don’t need extra context—the manager checks its own maps and simply acknowledges if the peer is already pending/connected; no upstream branching required.
| #[allow(dead_code)] | ||
| pub(super) fn get_open_connections(&self) -> usize { | ||
| self.open_connections | ||
| .load(std::sync::atomic::Ordering::SeqCst) | ||
| self.connection_count() | ||
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| pub(crate) fn get_reserved_connections(&self) -> usize { | ||
| self.reserved_connections | ||
| .load(std::sync::atomic::Ordering::SeqCst) | ||
| self.pending_reservations.read().len() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this notes about leaving diff stables be sure to write them down to properly review later on cause otherwise they are gonna get lost
| .collect(); | ||
|
|
||
| if candidates.is_empty() { | ||
| tracing::info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tune down all this traces to debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped the verbose routing candidate logs to debug so they no longer flood info-level output.
crates/core/src/transport/crypto.rs
Outdated
| } else { | ||
| write!(f, "{}", bs58::encode(encoded.as_bytes()).into_string()) | ||
| } | ||
| let digest = Sha256::digest(encoded.as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes logs completely unpallatable, please undo this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to the short base58 encoding of the DER bytes (first/last 6) instead of the full SHA digest so the log lines stay readable. No change to the public key itself—just the display impl.
|
|
||
| let mut iteration = 0usize; | ||
| let mut next_tick = Instant::now(); | ||
| while iteration < snapshot_iterations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for the future: I think we can do better than this... we already programatically collect events and have an event aggregator, no need to do this here, this can be moved to the event collector for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree—we should route this through the existing event collector instead of ad-hoc test logging. Keeping the soak test plumbing as-is for now to land the transient fixes; will follow up to move the diagnostics capture into the aggregator.
| //! ensure no peer reports more than `max` connections. This is intended to quickly catch | ||
| //! admission/cap bypass regressions without running the full soak. | ||
| use freenet_test_network::{BuildProfile, FreenetBinary, NetworkBuilder}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is very frustrating that we already talked about not having to bring out of tree test harness , but we still have done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. The soak/connection-cap tests currently rely on the freenet-test-network harness for churn and diagnostics; I kept it to get the coverage landed with this transient stack. I can follow up to inline the needed pieces (or move the harness into this repo) so we’re not depending on out-of-tree test tooling long-term.
e850ebe to
25777f6
Compare
|
@iduartgomez Created #2129 to track the deferred cleanup items you flagged. All the follow-up work is documented there so it won't get lost. [AI-assisted - Claude] |
Problem
Transient handling, routing/connect forwarding, and large-network behavior needed fixes and diagnostics. Tests were flaky due to random node locations. Soak/cap regressions lacked coverage.
Changes
Testing