-
-
Notifications
You must be signed in to change notification settings - Fork 107
refactor: transient terminology and tuning defaults #2121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@claude Please carefully review this PR, both high-level and the details. |
|
Claude finished @sanity's task —— View job PR Review: Transient Terminology and Tuning DefaultsSummaryThis PR successfully addresses the stated goals:
The changes are well-structured and the terminology refactoring is comprehensive. However, I've identified several areas for improvement. High-Level Architecture ReviewPositive Aspects
Concerns
Detailed Implementation Reviewconnection_manager.rsLines 348-380: Race condition in pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool {
if self.transient_connections.contains_key(&peer) {
// ... update existing
return true;
}
let current = self.transient_in_use.load(Ordering::Acquire);
if current >= self.transient_budget { // ❌ TOCTOU race
return false;
}
// ... insert ...
let prev = self.transient_in_use.fetch_add(1, Ordering::SeqCst);
if prev >= self.transient_budget { // ⚠️ Mitigation, not prevention
// Undo...
return false;
}
true
}Suggested Fix: pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool {
if self.transient_connections.contains_key(&peer) {
if let Some(mut entry) = self.transient_connections.get_mut(&peer) {
entry.location = location;
}
return true;
}
// Atomically reserve a slot
loop {
let current = self.transient_in_use.load(Ordering::Acquire);
if current >= self.transient_budget {
return false;
}
match self.transient_in_use.compare_exchange(
current,
current + 1,
Ordering::SeqCst,
Ordering::Acquire,
) {
Ok(_) => break,
Err(_) => continue, // Retry if another thread modified
}
}
// Now we have a reserved slot, insert the entry
self.transient_connections.insert(peer, TransientEntry { /* ... */ });
true
}Lines 434-465:
handshake.rsLines 125: TODO comment transient: bool, // TODO: rename to transient in protocol once we migrate terminologyThis TODO is now stale since the PR completes the terminology migration. Should be removed. Lines 134-149:
p2p_protoc.rsLines 1474, 1714: Transient flag inconsistency // Line 1474: Inbound connections
let is_transient = transient; // Respects the flag
// Line 1714: Promotion logic
let promote_to_ring = !is_transient || connection_manager.is_gateway();The promotion logic treats gateway connections specially, which makes sense. However:
Lines 1686-1698: Budget check duplication // First check at line 1609
if is_transient && !connection_manager.try_register_transient(peer_id.clone(), None) {
// ... drop connection ...
}
// Second check at line 1686
if is_transient {
let cm = &self.bridge.op_manager.ring.connection_manager;
let current = cm.transient_count();
if current >= cm.transient_budget() {
// ... drop connection again ...
}
}Recommendation: Remove the second check at line 1686 since config/mod.rsLines 48-49, 723-737: Good defaults const DEFAULT_TRANSIENT_BUDGET: usize = 32;
const DEFAULT_TRANSIENT_TTL_SECS: u64 = 30;These defaults are reasonable. However:
Lines 694-713: Serialization naming topology/mod.rsLines 483-486: Empty neighbor guard if neighbor_locations.is_empty() {
tracing::warn!("select_connections_to_add: neighbor map empty; skipping adjustment");
return Ok(TopologyAdjustment::NoChange);
}✅ This correctly prevents the panic that could occur when calling Question: Can this method be called when operations/connect.rsLines 302-307: Transient hint logic fn transient_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool {
// Courtesy slots still piggyback on regular connections...
self.op_manager.ring.open_connections() == 0
}
Test Coverage ReviewTest UpdatesThe test files show appropriate updates:
Missing Test Coverage:
Recommendation: Add integration tests for: #[test]
fn test_transient_ttl_expiration() {
// Create a transient connection
// Wait for TTL to expire
// Verify connection is dropped
}
#[test]
fn test_transient_budget_limit() {
// Fill up transient budget
// Attempt one more connection
// Verify it's rejected
// Drop a transient
// Verify new connection is accepted
}Documentation and CommentsIssues
Recommendations
Performance and ScalabilityConcerns
Security ConsiderationsPotential Issues
Recommendations SummaryCritical (Should Fix Before Merge)
Important (Consider Fixing)
Nice to Have
ConclusionThis is a solid refactoring that improves code clarity and configuration flexibility. The core changes are well-implemented, but there are some edge cases and potential race conditions that should be addressed. Recommendation: Approve with requested changes, focusing on the race condition fix and test coverage improvements. |
|
Appreciate the detailed pass. Given this PR’s scope, I’m leaving the transient CAS/timer reshaping for the later slices (2123/2125 already build on top of this). The terminology nits and duplicate budget check are addressed downstream as well. If we want to backport the CAS change here too, I can do that, but I’d like to avoid churn on this slice unless necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors connection terminology from "courtesy" to "transient" throughout the codebase, exposes previously hidden connection tuning parameters (min/max connections), and adds a guard to prevent topology adjustments when the neighbor map is empty.
Key Changes:
- Renamed "courtesy" to "transient" across core networking code, tests, and public APIs
- Exposed
min_connectionsandmax_connectionsconfiguration options with defaults of 10 and 20 respectively - Added empty neighbor map check in topology manager to prevent crashes
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/core/src/ring/connection_manager.rs | Consolidated pending/active peer tracking into single map; removed pending_locations; updated terminology; added transient routing filter |
| crates/core/src/operations/connect.rs | Renamed courtesy to transient in all structs, functions, and messages |
| crates/core/src/node/network_bridge/p2p_protoc.rs | Updated connection handling to use transient terminology; adjusted promotion logic |
| crates/core/src/node/network_bridge/handshake.rs | Refactored ExpectedInboundTracker to key by IP address with port matching; renamed courtesy to transient |
| crates/core/src/topology/mod.rs | Added guard to skip topology adjustments when neighbor map is empty |
| crates/core/src/config/mod.rs | Added min_connections and max_connections configuration fields with defaults |
| crates/core/src/node/mod.rs | Wired connection limit configs into NodeConfig |
| crates/core/tests/operations.rs | Added retry logic for flaky three-hop GET test |
| crates/core/tests/connectivity.rs | Increased retry attempts and delays for CI stability |
| crates/core/tests/test_network_integration.rs | Changed from static network to fresh instance per test |
| crates/freenet-macros/src/codegen.rs | Updated generated test configs with new connection fields |
| crates/core/Cargo.toml | Upgraded freenet-test-network to 0.1.3; added regex dev-dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| spawn_outbound(outbound.clone(), events_tx.clone(), peer, transaction, transient, peer_ready.clone()); | ||
| } | ||
| Some(Command::ExpectInbound { peer, transaction, transient }) => { | ||
| expected_inbound.register(peer, transaction, transient /* transient */); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a redundant comment /* transient */ at the end of line 209 that should be removed. It appears to be leftover from the refactoring.
| expected_inbound.register(peer, transaction, transient /* transient */); | |
| expected_inbound.register(peer, transaction, transient); |
| let _conn_manager = &self.bridge.op_manager.ring.connection_manager; | ||
| let remote_addr = connection.remote_addr(); | ||
|
|
||
| if let Some(blocked_addrs) = &self.blocked_addresses { | ||
| if blocked_addrs.contains(&remote_addr) { | ||
| tracing::info!( | ||
| remote = %remote_addr, | ||
| courtesy, | ||
| transient = transient, | ||
| transaction = ?transaction, | ||
| "Inbound connection blocked by local policy" | ||
| ); | ||
| return Ok(()); | ||
| } | ||
| } | ||
|
|
||
| let provided_peer = peer.clone(); | ||
| let _provided_peer = peer.clone(); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable _conn_manager at line 1430 is assigned but never used (indicated by the underscore prefix). Similarly, _provided_peer at line 1445 is unused. These appear to be leftover from debugging or refactoring. Consider removing them to clean up the code.
| // Allow routing to settle and retry GET a few times to deflake under CI load. | ||
| const GET_RETRIES: usize = 3; | ||
| let mut last_err = None; | ||
| for attempt in 1..=GET_RETRIES { | ||
| tracing::info!("Attempt {attempt}/{GET_RETRIES} to GET from peer C"); | ||
| tokio::time::sleep(Duration::from_secs(2)).await; | ||
| match get_contract(&mut client_api_c, contract_key, &peer_c.temp_dir_path).await { | ||
| Ok((response_contract, response_state)) => { | ||
| assert_eq!(response_contract, contract); | ||
| assert_eq!(response_state, wrapped_state); | ||
| break; | ||
| } | ||
| Err(e) => { | ||
| last_err = Some(e); | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| if let Some(err) = last_err { | ||
| bail!("GET from peer C failed after retries: {err}"); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry loop will only report an error if all retries fail AND the last attempt resulted in an error. However, if the last iteration succeeds (via break at line 2070), but last_err was set from a previous attempt, the check at line 2078 could incorrectly fail the test.
The logic should be restructured to track success explicitly. For example:
let mut succeeded = false;
for attempt in 1..=GET_RETRIES {
match get_contract(...).await {
Ok(...) => {
succeeded = true;
break;
}
Err(e) => last_err = Some(e),
}
}
if !succeeded {
bail!("GET from peer C failed after retries: {}", last_err.unwrap());
}|
|
||
| fn courtesy_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool { | ||
| fn transient_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool { | ||
| // Courtesy slots still piggyback on regular connections. Flag the first acceptance so the |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at lines 303-305 still references "Courtesy slots" which should be updated to "Transient slots" for consistency with the terminology refactoring in this PR.
| // Courtesy slots still piggyback on regular connections. Flag the first acceptance so the | |
| // Transient slots still piggyback on regular connections. Flag the first acceptance so the |
| let entry = list.pop(); | ||
| if list.is_empty() { | ||
| self.entries.remove(&ip); | ||
| } | ||
| if let Some(entry) = entry { | ||
| tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback"); | ||
| return Some(entry); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consume method has a potential race condition issue with the IP fallback logic. When an exact port match is not found (lines 163-173), it falls back to popping an arbitrary entry from the IP's list (line 174). This could match the wrong peer if multiple connections from the same IP are expected with different ports.
For example, if peer A (IP:port1) and peer B (IP:port2) both register expectations, and a connection arrives from IP:port3, the fallback will match one of them incorrectly. Consider logging a warning when the fallback path is taken, or returning None to reject unexpected port combinations.
| let entry = list.pop(); | |
| if list.is_empty() { | |
| self.entries.remove(&ip); | |
| } | |
| if let Some(entry) = entry { | |
| tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback"); | |
| return Some(entry); | |
| } | |
| tracing::warn!(remote = %addr, list_len = list.len(), "ExpectInbound: no exact port match for inbound connection; rejecting unexpected port combination"); |
| let is_transient = transient; | ||
|
|
||
| self.handle_successful_connection(peer_id, connection, state, None, is_transient) |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The assignment let is_transient = transient; at line 1474 is redundant - it simply copies the parameter to a new variable with the same semantic meaning. Consider using the transient parameter directly in the function call at line 1476.
| let is_transient = transient; | |
| self.handle_successful_connection(peer_id, connection, state, None, is_transient) | |
| self.handle_successful_connection(peer_id, connection, state, None, transient) |
| # console-subscriber = { version = "0.4" } | ||
| ureq = { version = "3.1", features = ["json"] } | ||
| which = "8.0" | ||
| regex = "1" |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The regex = "1" dependency was added to dev-dependencies but doesn't appear to be used in any test files yet. Consider removing it if it's not needed, or add a comment explaining its intended future use.
| regex = "1" |
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); | ||
| return true; | ||
| } | ||
|
|
||
| if self.location_for_peer.read().get(peer_id).is_some() { | ||
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. | ||
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reservation counter is incremented unconditionally at line 189-204, but when open == 0 (first connection), the function returns early at line 227 without recording the pending location. This means the first connection bypasses record_pending_location, which could lead to routing issues.
Additionally, when the peer is already pending/connected (line 230-234), the function returns true but doesn't decrement the reservation counter that was incremented earlier. This will leak reservation slots.
Both early returns should either:
- Call
record_pending_locationif not already recorded, or - Decrement the reservation counter before returning
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); | |
| return true; | |
| } | |
| if self.location_for_peer.read().get(peer_id).is_some() { | |
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. | |
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); | |
| tracing::debug!(%peer_id, "should_accept: first connection -> accepting"); | |
| self.reserved_connections.fetch_sub(1, std::sync::atomic::Ordering::SeqCst); | |
| return true; | |
| } | |
| if self.location_for_peer.read().get(peer_id).is_some() { | |
| // We've already accepted this peer (pending or active); treat as a no-op acceptance. | |
| tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance"); | |
| self.reserved_connections.fetch_sub(1, std::sync::atomic::Ordering::SeqCst); |
| .ring | ||
| .add_connection(loc, peer_id.clone(), false) | ||
| .await; | ||
| if is_transient { |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic at lines 1727-1729 appears incorrect. When is_transient is true but promote_to_ring is also true (which happens for gateways per line 1714), the code promotes the connection to the ring AND then immediately drops it from the transient pool. This seems contradictory.
If the connection is being promoted to the ring, it should no longer be tracked as transient, so dropping it makes sense. However, this suggests the transient was registered earlier but is now being "upgraded". The issue is that try_register_transient is only called in the else branch (line 1732), so if we're in the if promote_to_ring branch, the transient was never registered, making the drop_transient call at line 1728 a no-op or potentially incorrect.
Consider either:
- Only calling
drop_transientif the peer was actually registered as transient - Clarifying the comment to explain this is defensive cleanup
| if is_transient { | |
| if is_transient { | |
| // Defensive cleanup: In current logic, a transient is never registered in this branch, | |
| // so this is a no-op. This ensures that if the logic changes in the future and a transient | |
| // was registered before promotion, it will be properly cleaned up. |
This 2 changes sound good. |
26abc12 to
45af682
Compare
Problem
Terminology and config around transient/courtesy links was inconsistent, and default connection tuning for gateways/test harness was hidden. Topology add could run when neighbor map was empty.
This change
Testing