-
-
Notifications
You must be signed in to change notification settings - Fork 107
fix: harden transient connection handling #2120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@claude Please carefully review this PR, both high-level and the details. |
|
Claude finished @sanity's task —— View job PR Review: fix: harden transient connection handlingTodo List
High-Level Design ReviewOverviewThis PR addresses a critical issue where transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring. The solution introduces explicit transient connection tracking with budget enforcement and TTL, separate from the main ring topology. Strengths:
Concerns:
Detailed Code Reviewcrates/core/src/ring/connection_manager.rs✅ Positive Changes:Lines 14-21: TransientEntry documentation
Lines 124-141: Simplified ConnectionManager initialization
Lines 247-252: Early return for first connection if open == 0 {
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
return true;
}
Lines 254-276: Gateway direct-accept limit enforcement
Lines 278-282: Duplicate peer detection if self.location_for_peer.read().get(peer_id).is_some() {
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
return true;
}
Lines 300-320: record_pending_location simplified
Lines 365-430: Transient connection management pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool
pub fn drop_transient(&self, peer: &PeerId) -> Option<TransientEntry>
pub fn is_transient(&self, peer: &PeerId) -> bool
Lines 451-481: add_connection cleanup
Lines 528-561: prune_connection unified
Lines 589-604: routing() excludes transients if self.is_transient(&conn.location.peer) {
return None;
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to harden transient (gateway) connection handling to prevent admission bypasses, reservation leaks, and failed promotions. However, several bugs were introduced that undermine these goals.
Key Changes
- Merged
pending_locationstracking intolocation_for_peerto simplify connection state management - Added explicit transient connection budget enforcement with TTL-based expiration
- Restructured
should_acceptadmission logic to check for duplicate peers after reservation increment - Added new config fields
transient_budgetandtransient_ttl_secsto test configurations
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
crates/core/tests/token_expiration.rs |
Added transient config fields to test setup (None values use defaults) |
crates/core/tests/test_network_integration.rs |
Changed from static singleton network to fresh network per test |
crates/core/src/ring/mod.rs |
Removed unused SocketAddr import (now in connection_manager.rs) |
crates/core/src/ring/connection_manager.rs |
Merged pending/active location tracking, restructured admission logic, removed 118 lines of tests |
crates/core/src/node/network_bridge/p2p_protoc.rs |
Added transient promotion path and budget enforcement checks |
Comments suppressed due to low confidence (1)
crates/core/src/node/network_bridge/p2p_protoc.rs:1295
- Transient promotion bypasses admission checks. When promoting a transient connection to a normal connection,
add_connectionis called directly without re-applyingshould_accept. This could allow connections that would normally be rejected by admission logic (e.g., max connections reached, topology manager rejection) to bypass these checks. According to the PR description, admission logic should be "re-applied on promotion".
if let Some(entry) = connection_manager.drop_transient(&peer) {
let loc = entry
.location
.unwrap_or_else(|| Location::from_address(&peer.addr));
self.bridge
.op_manager
.ring
.add_connection(loc, peer.clone(), false)
.await;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if self.location_for_peer.read().get(peer_id).is_some() { | ||
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. | ||
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reservation leak when peer is already pending/connected. When should_accept is called for a peer already in location_for_peer, the function increments reserved_connections at the start (line 189-205) but returns early at line 251 without decrementing it. This will leak the reservation counter. The counter should be decremented before returning, similar to lines 241-242 and 286-287.
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); | |
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); | |
| self.reserved_connections | |
| .fetch_sub(1, std::sync::atomic::Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it though? if it already got reserved and we return true doesnt sound right we should decrement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the reservation leak: duplicate/pending hits now release the reservation immediately and the promotion path now calls add_connection(.., was_reserved = true) so the counter is decremented when the connection is established. First-connection acceptance also records the pending location so we keep the bookkeeping consistent without leaking slots.
| /// This makes the peer discoverable to the routing layer even before the connection | ||
| /// is fully established. The entry is removed automatically if the handshake fails | ||
| /// via `prune_in_transit_connection`. |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading documentation: pending peers are not routable. The comment claims "This makes the peer discoverable to the routing layer even before the connection is fully established," but this is incorrect. The routing method (line 589) only uses connections_by_location, which is populated by add_connection, not record_pending_location. Pending locations are only in location_for_peer and are not used for routing until the connection is fully established.
| /// This makes the peer discoverable to the routing layer even before the connection | |
| /// is fully established. The entry is removed automatically if the handshake fails | |
| /// via `prune_in_transit_connection`. | |
| /// This records the peer's advertised location before the connection is fully established. | |
| /// The entry is removed automatically if the handshake fails via `prune_in_transit_connection`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment to state that pending locations live in location_for_peer purely for handshake de-duplication; routing still keys off connections_by_location, so pending peers are not routable until promotion.
|
|
||
| let accepted = if open == 0 { | ||
| if open == 0 { | ||
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First connection bypasses pending location tracking. When open == 0 (first connection), the function returns early at line 227 after incrementing reserved_connections, but without calling record_pending_location(peer_id, location). This means the first connection's location won't be tracked in location_for_peer until add_connection is called. This is inconsistent with the normal flow (line 290) where pending locations are recorded immediately upon acceptance.
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); | |
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); | |
| self.record_pending_location(peer_id, location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered this in the latest push: the first-connection fast path now records the pending location before returning, and promotion paths call add_connection with was_reserved = true so the reservation bookkeeping stays balanced.
|
Thanks for the thorough review. I’ve added an admission/cap recheck when promoting an existing transient transport (so promotions can’t bypass should_accept or max-connections). I’m keeping the pending-location visibility and TTL handling as-is in this slice to avoid scope creep; the later stacked PRs continue to tighten transient lifecycle, so we can revisit the sweep vs per-connection timer trade-off there. Tests remain in their current form here because the later PRs add broader coverage around the new semantics. |
| // Only insert if connection doesn't already exist to avoid dropping existing channel | ||
| let mut newly_inserted = false; | ||
| if !self.connections.contains_key(&peer_id) { | ||
| if is_transient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this logic leaking into here cannot be avoided? sounds like an architectural issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it is not ideal for the bridge to know so much about transient promotion; for now the admission/cap logic lives in ConnectionManager and this block just orchestrates the promotion for the existing transport. If we keep expanding this path we should extract a helper on the ring side and have the bridge call that instead of threading the details here.
| #[derive(Clone)] | ||
| pub(crate) struct TransientEntry { | ||
| /// Entry tracking a transient connection that hasn't been added to the ring topology yet. | ||
| /// Transient connections are typically unsolicited inbound connections to gateways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove if not used, why is not being used? I think this was used for potential timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unused opened_at field from TransientEntry—the TTL enforcement in the bridge handles expiry, so we do not need to keep the timestamp here.
|
|
||
| if self.location_for_peer.read().get(peer_id).is_some() { | ||
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. | ||
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it though? if it already got reserved and we return true doesnt sound right we should decrement it
| /// This makes the peer discoverable to the routing layer even before the connection | ||
| /// is fully established. The entry is removed automatically if the handshake fails | ||
| /// via `prune_in_transit_connection`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| /// via `prune_in_transit_connection`. | ||
| pub fn record_pending_location(&self, peer_id: &PeerId, location: Location) { | ||
| let mut locations = self.pending_locations.write(); | ||
| let mut locations = self.location_for_peer.write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait why are we writting a pending location into the definitive map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location_for_peer now explicitly holds pending entries only for de-duping in-flight handshakes; routing still keys off connections_by_location, so these pending inserts are just to keep promotion bookkeeping consistent and avoid duplicate acceptances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location_for_peer containe also definitive connections too though no? Can we document this throughtfully for future reference?
| "connect_peer: transient promotion rejected due to capacity" | ||
| ); | ||
| } else { | ||
| self.bridge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per Claude suggestion, beware of races between dropping and adding:
Apply the suggested fix to transient connections: Add atomic flag or use better synchronization OR check if should_accept again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The promotion flow now re-runs should_accept after dropping the transient and then calls add_connection(.., was_reserved = true) so the reservation counters and topology bookkeeping stay in sync. The drop + add happen on the same task (no concurrent promotion), so we are not racing another add; if we start doing concurrent promotions we can wrap this in a small guard, but for the current single-task path this keeps the state consistent.
Problem
Transient (gateway) connections could bypass admission checks, leak reservations, or fail to promote into the ring, leaving gateways without routable neighbors and starving joiners.
This change
Testing