Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 22, 2025

Problem

Terminology and config around transient/courtesy links was inconsistent, and default connection tuning for gateways/test harness was hidden. Topology add could run when neighbor map was empty.

This change

  • Rename courtesy links to transient across core and tests; update helper naming.
  • Expose connection tuning knobs (min/max) in configs and bump test harness defaults (gateway max).
  • Avoid topology add when neighbor map is empty.

Testing

  • cargo check -p freenet (local).

@sanity sanity marked this pull request as draft November 22, 2025 05:34
@freenet freenet deleted a comment from claude bot Nov 23, 2025
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

@claude Please carefully review this PR, both high-level and the details.

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @sanity's task —— View job


PR Review: Transient Terminology and Tuning Defaults

Summary

This PR successfully addresses the stated goals:

  1. ✅ Renames "courtesy" links to "transient" for consistency
  2. ✅ Exposes connection tuning configuration (min/max connections, transient budget/ttl)
  3. ✅ Fixes topology add when neighbor map is empty

The changes are well-structured and the terminology refactoring is comprehensive. However, I've identified several areas for improvement.


High-Level Architecture Review

Positive Aspects

  1. Consistent Terminology: The rename from "courtesy" to "transient" is thorough and improves code clarity. The term better describes the ephemeral nature of these connections.

  2. Config Exposure: Exposing min_connections, max_connections, transient_budget, and transient_ttl as configuration options is good. This enables better testing and production tuning.

  3. Topology Guard Fix: The guard in select_connections_to_add (topology/mod.rs:483-486) correctly prevents panics when the neighbor map is empty.

  4. Transient Promotion Logic: The gateway promotion logic in p2p_protoc.rs:1714 correctly treats gateway connections specially.

Concerns

  1. Incomplete Transient Lifecycle Management (connection_manager.rs:348-413):

    • The try_register_transient method has a race condition between checking the budget (line 359) and incrementing the counter (line 371).
    • The double-check after increment (lines 372-377) mitigates but doesn't eliminate the race.
    • Recommendation: Use a compare-and-swap loop similar to the reservation logic in should_accept (lines 178-205).
  2. Transient TTL Enforcement (p2p_protoc.rs:1741-1756):

    • The TTL enforcement spawns an async task per transient connection, which could create many concurrent timers.
    • No cleanup if the connection is dropped before TTL expires (the task will still fire).
    • Recommendation: Consider a single background task that periodically sweeps expired transients, or at minimum document why per-connection tasks are preferred.

Detailed Implementation Review

connection_manager.rs

Lines 348-380: Race condition in try_register_transient

pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool {
    if self.transient_connections.contains_key(&peer) {
        // ... update existing
        return true;
    }

    let current = self.transient_in_use.load(Ordering::Acquire);
    if current >= self.transient_budget {  // ❌ TOCTOU race
        return false;
    }

    // ... insert ...
    let prev = self.transient_in_use.fetch_add(1, Ordering::SeqCst);
    if prev >= self.transient_budget {  // ⚠️ Mitigation, not prevention
        // Undo...
        return false;
    }
    true
}

Suggested Fix:

pub fn try_register_transient(&self, peer: PeerId, location: Option<Location>) -> bool {
    if self.transient_connections.contains_key(&peer) {
        if let Some(mut entry) = self.transient_connections.get_mut(&peer) {
            entry.location = location;
        }
        return true;
    }

    // Atomically reserve a slot
    loop {
        let current = self.transient_in_use.load(Ordering::Acquire);
        if current >= self.transient_budget {
            return false;
        }
        match self.transient_in_use.compare_exchange(
            current,
            current + 1,
            Ordering::SeqCst,
            Ordering::Acquire,
        ) {
            Ok(_) => break,
            Err(_) => continue,  // Retry if another thread modified
        }
    }

    // Now we have a reserved slot, insert the entry
    self.transient_connections.insert(peer, TransientEntry { /* ... */ });
    true
}

Lines 434-465: add_connection doesn't check for duplicate

  • The method doesn't check if the peer is already in location_for_peer before inserting.
  • This could cause unexpected behavior if called twice for the same peer.
  • Recommendation: Add an assertion or early return if the peer already exists.

handshake.rs

Lines 125: TODO comment

transient: bool, // TODO: rename to transient in protocol once we migrate terminology

This TODO is now stale since the PR completes the terminology migration. Should be removed.

Lines 134-149: register method
The retain call (line 143) before push ensures no duplicates by port, which is good. However:

  • No limit on how many expectations can be registered per IP
  • Could theoretically exhaust memory with many expectations
  • Recommendation: Consider adding a per-IP limit or total limit on expected inbound connections.

p2p_protoc.rs

Lines 1474, 1714: Transient flag inconsistency

// Line 1474: Inbound connections
let is_transient = transient;  // Respects the flag

// Line 1714: Promotion logic
let promote_to_ring = !is_transient || connection_manager.is_gateway();

The promotion logic treats gateway connections specially, which makes sense. However:

  • The comment at line 1472-1474 says "Treat only transient connections as transient" but doesn't explain the gateway exception
  • Recommendation: Add a comment explaining why gateways always promote transients.

Lines 1686-1698: Budget check duplication
There are two budget checks for transient connections (lines 1609 and 1686). This creates redundancy:

// First check at line 1609
if is_transient && !connection_manager.try_register_transient(peer_id.clone(), None) {
    // ... drop connection ...
}

// Second check at line 1686
if is_transient {
    let cm = &self.bridge.op_manager.ring.connection_manager;
    let current = cm.transient_count();
    if current >= cm.transient_budget() {
        // ... drop connection again ...
    }
}

Recommendation: Remove the second check at line 1686 since try_register_transient already enforces the budget.

config/mod.rs

Lines 48-49, 723-737: Good defaults

const DEFAULT_TRANSIENT_BUDGET: usize = 32;
const DEFAULT_TRANSIENT_TTL_SECS: u64 = 30;

These defaults are reasonable. However:

  • No documentation explaining why 32 connections and 30 seconds were chosen
  • Recommendation: Add comments explaining the rationale, especially for production deployments.

Lines 694-713: Serialization naming
The TOML serialization uses kebab-case (transient-budget, min-number-of-connections) which is consistent. Good job maintaining naming conventions.

topology/mod.rs

Lines 483-486: Empty neighbor guard

if neighbor_locations.is_empty() {
    tracing::warn!("select_connections_to_add: neighbor map empty; skipping adjustment");
    return Ok(TopologyAdjustment::NoChange);
}

✅ This correctly prevents the panic that could occur when calling create_density_map with empty neighbors.

Question: Can this method be called when neighbor_locations is empty in production? If this is only a bootstrap edge case, should it be a debug log instead of warn?

operations/connect.rs

Lines 302-307: Transient hint logic

fn transient_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool {
    // Courtesy slots still piggyback on regular connections...
    self.op_manager.ring.open_connections() == 0
}
  • The comment still says "Courtesy slots" (should be "Transient connections")
  • The logic returns true only when there are no connections, which seems backwards - wouldn't you want transient connections when you're at capacity, not when empty?
  • Recommendation: Clarify the intended behavior and update the comment.

Test Coverage Review

Test Updates

The test files show appropriate updates:

  1. connectivity.rs: Added transient budget configuration
  2. error_notification.rs: Added configuration for transient settings
  3. operations.rs: Updated to use transient terminology
  4. test_network_integration.rs: Reduced duplication, good cleanup

Missing Test Coverage:

  1. No test verifying transient TTL expiration
  2. No test for transient budget exhaustion and recovery
  3. No test for the race condition in try_register_transient
  4. No test for the topology guard when neighbor map is empty

Recommendation: Add integration tests for:

#[test]
fn test_transient_ttl_expiration() {
    // Create a transient connection
    // Wait for TTL to expire
    // Verify connection is dropped
}

#[test]
fn test_transient_budget_limit() {
    // Fill up transient budget
    // Attempt one more connection
    // Verify it's rejected
    // Drop a transient
    // Verify new connection is accepted
}

Documentation and Comments

Issues

  1. Stale TODO (handshake.rs:125): Remove since terminology is now migrated
  2. Unclear comment (connect.rs:303): Still refers to "courtesy" instead of "transient"
  3. Missing rationale: No explanation for default values (32 budget, 30s TTL)

Recommendations

  1. Update all comments to use "transient" terminology
  2. Add module-level documentation explaining transient connection lifecycle
  3. Document the relationship between transient budget and max connections

Performance and Scalability

Concerns

  1. Per-Connection Timers (p2p_protoc.rs:1741): With a default budget of 32, you could have 32 concurrent timer tasks. Each spawned task has overhead.

    • Impact: Marginal at 32, but could be significant if budget is increased
    • Recommendation: Profile timer overhead or switch to a sweep-based approach
  2. DashMap Usage (connection_manager.rs:35): Using DashMap for transients is good for concurrent access, but the contains_key + get_mut pattern (lines 351-355) is inefficient.

    • Recommendation: Use entry API to avoid double lookup
  3. Atomic Counter Contention (connection_manager.rs:371): fetch_add on transient_in_use could cause contention under high connection churn.

    • Impact: Likely negligible unless thousands of connections/sec
    • Recommendation: Monitor in production, consider batching if needed

Security Considerations

Potential Issues

  1. DOS via Transient Exhaustion: An attacker could exhaust the transient budget by opening many connections, preventing legitimate joiners from connecting.

    • Mitigation: The budget limit helps, but no IP-based rate limiting
    • Recommendation: Consider per-IP limits or exponential backoff for repeated transient attempts
  2. TTL Manipulation: Since TTL is configurable, setting it too high could allow transients to persist indefinitely.

    • Recommendation: Add validation to ensure TTL is reasonable (e.g., max 300 seconds)

Recommendations Summary

Critical (Should Fix Before Merge)

  1. Fix race condition in try_register_transient - Use compare-and-swap loop
  2. Remove duplicate budget check in p2p_protoc.rs (lines 1686-1698)
  3. Update stale comments - Remove TODO, fix "courtesy" references

Important (Consider Fixing)

  1. Add tests for transient TTL expiration and budget exhaustion
  2. Clarify transient promotion logic - Add comments explaining gateway exception
  3. Document default values - Explain why 32/30s were chosen

Nice to Have

  1. Optimize DashMap usage - Use entry API instead of contains_key + get_mut
  2. Consider sweep-based TTL instead of per-connection timers
  3. Add per-IP limits to prevent transient budget exhaustion attacks

Conclusion

This is a solid refactoring that improves code clarity and configuration flexibility. The core changes are well-implemented, but there are some edge cases and potential race conditions that should be addressed.

Recommendation: Approve with requested changes, focusing on the race condition fix and test coverage improvements.


@sanity sanity requested a review from Copilot November 23, 2025 02:49
@sanity sanity marked this pull request as ready for review November 23, 2025 02:50
Copilot finished reviewing on behalf of sanity November 23, 2025 02:54
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

Appreciate the detailed pass. Given this PR’s scope, I’m leaving the transient CAS/timer reshaping for the later slices (2123/2125 already build on top of this). The terminology nits and duplicate budget check are addressed downstream as well. If we want to backport the CAS change here too, I can do that, but I’d like to avoid churn on this slice unless necessary.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors connection terminology from "courtesy" to "transient" throughout the codebase, exposes previously hidden connection tuning parameters (min/max connections), and adds a guard to prevent topology adjustments when the neighbor map is empty.

Key Changes:

  • Renamed "courtesy" to "transient" across core networking code, tests, and public APIs
  • Exposed min_connections and max_connections configuration options with defaults of 10 and 20 respectively
  • Added empty neighbor map check in topology manager to prevent crashes

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
crates/core/src/ring/connection_manager.rs Consolidated pending/active peer tracking into single map; removed pending_locations; updated terminology; added transient routing filter
crates/core/src/operations/connect.rs Renamed courtesy to transient in all structs, functions, and messages
crates/core/src/node/network_bridge/p2p_protoc.rs Updated connection handling to use transient terminology; adjusted promotion logic
crates/core/src/node/network_bridge/handshake.rs Refactored ExpectedInboundTracker to key by IP address with port matching; renamed courtesy to transient
crates/core/src/topology/mod.rs Added guard to skip topology adjustments when neighbor map is empty
crates/core/src/config/mod.rs Added min_connections and max_connections configuration fields with defaults
crates/core/src/node/mod.rs Wired connection limit configs into NodeConfig
crates/core/tests/operations.rs Added retry logic for flaky three-hop GET test
crates/core/tests/connectivity.rs Increased retry attempts and delays for CI stability
crates/core/tests/test_network_integration.rs Changed from static network to fresh instance per test
crates/freenet-macros/src/codegen.rs Updated generated test configs with new connection fields
crates/core/Cargo.toml Upgraded freenet-test-network to 0.1.3; added regex dev-dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spawn_outbound(outbound.clone(), events_tx.clone(), peer, transaction, transient, peer_ready.clone());
}
Some(Command::ExpectInbound { peer, transaction, transient }) => {
expected_inbound.register(peer, transaction, transient /* transient */);
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a redundant comment /* transient */ at the end of line 209 that should be removed. It appears to be leftover from the refactoring.

Suggested change
expected_inbound.register(peer, transaction, transient /* transient */);
expected_inbound.register(peer, transaction, transient);

Copilot uses AI. Check for mistakes.
Comment on lines 1430 to 1445
let _conn_manager = &self.bridge.op_manager.ring.connection_manager;
let remote_addr = connection.remote_addr();

if let Some(blocked_addrs) = &self.blocked_addresses {
if blocked_addrs.contains(&remote_addr) {
tracing::info!(
remote = %remote_addr,
courtesy,
transient = transient,
transaction = ?transaction,
"Inbound connection blocked by local policy"
);
return Ok(());
}
}

let provided_peer = peer.clone();
let _provided_peer = peer.clone();
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable _conn_manager at line 1430 is assigned but never used (indicated by the underscore prefix). Similarly, _provided_peer at line 1445 is unused. These appear to be leftover from debugging or refactoring. Consider removing them to clean up the code.

Copilot uses AI. Check for mistakes.
Comment on lines +2060 to +2080
// Allow routing to settle and retry GET a few times to deflake under CI load.
const GET_RETRIES: usize = 3;
let mut last_err = None;
for attempt in 1..=GET_RETRIES {
tracing::info!("Attempt {attempt}/{GET_RETRIES} to GET from peer C");
tokio::time::sleep(Duration::from_secs(2)).await;
match get_contract(&mut client_api_c, contract_key, &peer_c.temp_dir_path).await {
Ok((response_contract, response_state)) => {
assert_eq!(response_contract, contract);
assert_eq!(response_state, wrapped_state);
break;
}
Err(e) => {
last_err = Some(e);
continue;
}
}
}
if let Some(err) = last_err {
bail!("GET from peer C failed after retries: {err}");
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop will only report an error if all retries fail AND the last attempt resulted in an error. However, if the last iteration succeeds (via break at line 2070), but last_err was set from a previous attempt, the check at line 2078 could incorrectly fail the test.

The logic should be restructured to track success explicitly. For example:

let mut succeeded = false;
for attempt in 1..=GET_RETRIES {
    match get_contract(...).await {
        Ok(...) => {
            succeeded = true;
            break;
        }
        Err(e) => last_err = Some(e),
    }
}
if !succeeded {
    bail!("GET from peer C failed after retries: {}", last_err.unwrap());
}

Copilot uses AI. Check for mistakes.

fn courtesy_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool {
fn transient_hint(&self, _acceptor: &PeerKeyLocation, _joiner: &PeerKeyLocation) -> bool {
// Courtesy slots still piggyback on regular connections. Flag the first acceptance so the
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at lines 303-305 still references "Courtesy slots" which should be updated to "Transient slots" for consistency with the terminology refactoring in this PR.

Suggested change
// Courtesy slots still piggyback on regular connections. Flag the first acceptance so the
// Transient slots still piggyback on regular connections. Flag the first acceptance so the

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +181
let entry = list.pop();
if list.is_empty() {
self.entries.remove(&ip);
}
if let Some(entry) = entry {
tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");
return Some(entry);
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consume method has a potential race condition issue with the IP fallback logic. When an exact port match is not found (lines 163-173), it falls back to popping an arbitrary entry from the IP's list (line 174). This could match the wrong peer if multiple connections from the same IP are expected with different ports.

For example, if peer A (IP:port1) and peer B (IP:port2) both register expectations, and a connection arrives from IP:port3, the fallback will match one of them incorrectly. Consider logging a warning when the fallback path is taken, or returning None to reject unexpected port combinations.

Suggested change
let entry = list.pop();
if list.is_empty() {
self.entries.remove(&ip);
}
if let Some(entry) = entry {
tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");
return Some(entry);
}
tracing::warn!(remote = %addr, list_len = list.len(), "ExpectInbound: no exact port match for inbound connection; rejecting unexpected port combination");

Copilot uses AI. Check for mistakes.
Comment on lines 1474 to 1476
let is_transient = transient;

self.handle_successful_connection(peer_id, connection, state, None, is_transient)
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The assignment let is_transient = transient; at line 1474 is redundant - it simply copies the parameter to a new variable with the same semantic meaning. Consider using the transient parameter directly in the function call at line 1476.

Suggested change
let is_transient = transient;
self.handle_successful_connection(peer_id, connection, state, None, is_transient)
self.handle_successful_connection(peer_id, connection, state, None, transient)

Copilot uses AI. Check for mistakes.
# console-subscriber = { version = "0.4" }
ureq = { version = "3.1", features = ["json"] }
which = "8.0"
regex = "1"
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The regex = "1" dependency was added to dev-dependencies but doesn't appear to be used in any test files yet. Consider removing it if it's not needed, or add a comment explaining its intended future use.

Suggested change
regex = "1"

Copilot uses AI. Check for mistakes.
Comment on lines 226 to 231
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
return true;
}

if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reservation counter is incremented unconditionally at line 189-204, but when open == 0 (first connection), the function returns early at line 227 without recording the pending location. This means the first connection bypasses record_pending_location, which could lead to routing issues.

Additionally, when the peer is already pending/connected (line 230-234), the function returns true but doesn't decrement the reservation counter that was incremented earlier. This will leak reservation slots.

Both early returns should either:

  1. Call record_pending_location if not already recorded, or
  2. Decrement the reservation counter before returning
Suggested change
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
return true;
}
if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
tracing::debug!(%peer_id, "should_accept: first connection -> accepting");
self.reserved_connections.fetch_sub(1, std::sync::atomic::Ordering::SeqCst);
return true;
}
if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
self.reserved_connections.fetch_sub(1, std::sync::atomic::Ordering::SeqCst);

Copilot uses AI. Check for mistakes.
.ring
.add_connection(loc, peer_id.clone(), false)
.await;
if is_transient {
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic at lines 1727-1729 appears incorrect. When is_transient is true but promote_to_ring is also true (which happens for gateways per line 1714), the code promotes the connection to the ring AND then immediately drops it from the transient pool. This seems contradictory.

If the connection is being promoted to the ring, it should no longer be tracked as transient, so dropping it makes sense. However, this suggests the transient was registered earlier but is now being "upgraded". The issue is that try_register_transient is only called in the else branch (line 1732), so if we're in the if promote_to_ring branch, the transient was never registered, making the drop_transient call at line 1728 a no-op or potentially incorrect.

Consider either:

  1. Only calling drop_transient if the peer was actually registered as transient
  2. Clarifying the comment to explain this is defensive cleanup
Suggested change
if is_transient {
if is_transient {
// Defensive cleanup: In current logic, a transient is never registered in this branch,
// so this is a no-op. This ensures that if the logic changes in the future and a transient
// was registered before promotion, it will be properly cleaned up.

Copilot uses AI. Check for mistakes.
@iduartgomez
Copy link
Collaborator

Rename courtesy links to transient across core and tests; update helper naming.
Expose connection tuning knobs (min/max) in configs and bump test harness defaults (gateway max).

This 2 changes sound good.

@sanity sanity force-pushed the fix/transient-split-2 branch from 26abc12 to 45af682 Compare November 23, 2025 20:17
@sanity sanity enabled auto-merge November 23, 2025 20:17
@sanity sanity added this pull request to the merge queue Nov 23, 2025
Merged via the queue into main with commit 920dfc8 Nov 23, 2025
11 checks passed
@sanity sanity deleted the fix/transient-split-2 branch November 23, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants