Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 23, 2025

Problem

Connect forwarding is capacity-blind during join bursts. Gateways/early peers keep routing ConnectRequests to already-full neighbors, leaving joiners stuck with gateway-only links and slow small-world formation.

Solution

  • Track per-op forward attempts (success on ConnectResponse, failure on timeout) and learn a monotonic success-vs-distance curve via an isotonic estimator with per-peer adjustments.
  • Score routing candidates with that estimate while preserving ring-distance bias and a small recency cooldown to avoid hammering one neighbor.
  • Failures are recorded automatically when attempts age out (20s), so the model learns even without responses. All logic stays inside ConnectOp; cold start falls back to existing distance routing.

Testing

  • cargo test -p freenet --tests --no-run
  • 40-peer soak: SOAK_PEER_COUNT=40 SOAK_SNAPSHOT_ITERATIONS=2 SOAK_SNAPSHOT_INTERVAL_SECS=60 cargo test -p freenet --test large_network -- --ignored --nocapture (all peers healthy; snapshots preserved under the test network run root).

Notes

  • Draft PR stacked on the transient stack (base fix/transient-connection-tracking); intended to augment/replace PR 2125 with learned, capacity-aware forwarding.

@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

@claude Please review, both high-level and in-detail

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @sanity's task —— View job


Code Review Complete

Reviewed PR #2126: feat: learn connect forwarding success with isotonic scoring

High-Level Assessment

This PR introduces a significant enhancement to the connect forwarding mechanism using machine learning (isotonic regression) to make capacity-aware routing decisions. The approach is sound and addresses a real problem (capacity-blind forwarding during join bursts), but there are several concerns around complexity, potential bugs, and testing.

Overall Verdict: Needs revisions before merge. The core algorithm is solid, but implementation details need attention.


Architectural Review

✅ Strengths

  1. Well-scoped solution: The isotonic learning stays entirely within ConnectOp, keeping concerns localized
  2. Graceful degradation: Falls back to distance-based routing when the model lacks data
  3. Automatic failure detection: 20s timeout mechanism learns from unresponsive peers without explicit negative feedback
  4. Clever recency cooldown: Prevents hammering the same neighbor repeatedly

⚠️ Concerns

  1. Global static state: FORWARD_ESTIMATOR (lines 257-258 in connect.rs) is a global LazyLock<RwLock<ForwardEstimator>>. This creates several issues:

    • Thread contention: All connect operations across all peers compete for the same lock
    • No isolation: All transactions share the same estimator, so a bad peer affects everyone
    • Testing nightmares: Global state makes tests non-deterministic and prevents parallel test execution
    • Memory leak potential: Never freed, grows unbounded
  2. Seed events use dummy peer: Lines 213-232 create a dummy peer with hardcoded 127.0.0.1:0 just to bootstrap the estimator. This is a code smell suggesting the estimator API needs rethinking.

  3. Missing capacity signal: The estimator learns success/failure of forwarding attempts, but doesn't directly observe peer capacity. A peer might accept the forward but still be nearly full, leading to suboptimal small-world formation.


Detailed Code Review

crates/core/src/operations/connect.rs

Critical Issues

Lines 257-258: Global mutable state

static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
    Lazy::new(|| RwLock::new(ForwardEstimator::new()));

Problem: This creates a global shared estimator that all operations fight over.

Recommendation: Move estimator ownership to OpManager or Ring and pass it by reference. Each node should have its own estimator, not share one globally.


Lines 523-535: Timeout cleanup logic

fn expire_forward_attempts(&mut self, now: Instant) {
    let mut expired = Vec::new();
    for (peer, attempt) in self.forward_attempts.iter() {
        if now.duration_since(attempt.sent_at) >= FORWARD_ATTEMPT_TIMEOUT {
            expired.push((peer.clone(), attempt.desired));
        }
    }
    for (peer, desired) in expired {
        if let Some(attempt) = self.forward_attempts.remove(&peer) {
            self.record_forward_outcome(&attempt.peer, desired, false);
        }
    }
}

⚠️ Potential inefficiency: Double HashMap lookup (line 526 and 531). Consider using drain_filter (once stabilized) or collecting keys first.

Recommendation:

fn expire_forward_attempts(&mut self, now: Instant) {
    let expired: Vec<PeerId> = self.forward_attempts
        .iter()
        .filter(|(_, attempt)| now.duration_since(attempt.sent_at) >= FORWARD_ATTEMPT_TIMEOUT)
        .map(|(peer, _)| peer.clone())
        .collect();
    
    for peer in expired {
        if let Some(attempt) = self.forward_attempts.remove(&peer) {
            self.record_forward_outcome(&attempt.peer, attempt.desired, false);
        }
    }
}

Lines 412-454: Scoring and fallback logic

let mut scored: Vec<(f64, PeerKeyLocation)> = Vec::new();
let mut fallback: Vec<PeerKeyLocation> = Vec::new();

for cand in candidates {
    if let Some(ts) = recency.get(&cand.peer) {
        if now.duration_since(*ts) < RECENCY_COOLDOWN {
            continue;
        }
    }
    if cand.location.is_some() {
        if let Some(score) = estimator.estimate(&cand, desired_location) {
            scored.push((score, cand.clone()));
            continue;
        }
    }
    fallback.push(cand.clone());
}

Good: Clear separation between scored and fallback candidates.

⚠️ Concern: Line 441 - multiple candidates can tie for best_score. The code collects all ties and picks randomly via select_peer. This is correct but worth documenting.

Suggestion: Add a comment explaining tie-breaking strategy:

// If multiple candidates have the same best score, collect all of them and let
// the router's distance-based logic break ties (lines 437-444).

Lines 884-888: Success recording on response

if let Some(fwd) = forwarded {
    self.record_forward_outcome(&fwd, desired, true);
}

Correct: Records success when a ConnectResponse arrives.

Question: What if multiple responses arrive (rare but possible if TTL allows multiple acceptors)? Current code only records the first. Is this intentional?


Minor Issues

Line 30: Hardcoded timeout

const FORWARD_ATTEMPT_TIMEOUT: Duration = Duration::from_secs(20);

📝 Suggestion: Make this configurable or at least document why 20s was chosen.

Line 31: Recency cooldown

const RECENCY_COOLDOWN: Duration = Duration::from_secs(5);

📝 Suggestion: Same as above. Consider making these part of ConnectionTuning config.


crates/core/src/router/isotonic_estimator.rs

Issues

Lines 23-24: Magic number

const ADJUSTMENT_PRIOR_SIZE: u64 = 10;

And again at line 51:

let adjustment_prior_size = 20;

Inconsistency: The constant is defined as 10 but overridden to 20 in the function. This is confusing.

Recommendation: Use the constant consistently or document why the override is necessary.


Lines 129: Unwrap on location

let distance: f64 = contract_location.distance(peer.location.unwrap()).as_f64();

⚠️ Potential panic: If peer.location is None, this will panic.

Check lines 237-238 in connect.rs:

if peer.location.is_none() {
    return;
}

The caller checks before calling estimator.record(), but estimate_retrieval_time() should also guard against this.

Recommendation:

let peer_location = peer.location.ok_or(EstimationError::InsufficientData)?;
let distance: f64 = contract_location.distance(peer_location).as_f64();

Lines 98-100: Duplicate magic number

let adjustment_prior_size = 20;
let global_regression_big_enough_to_estimate_peer_adjustments =
    self.global_regression.len() >= adjustment_prior_size;

Same issue as above.


crates/core/src/config/mod.rs

Issues

Lines 103-104: New config fields

transient_budget: Some(DEFAULT_TRANSIENT_BUDGET),
transient_ttl_secs: Some(DEFAULT_TRANSIENT_TTL_SECS),

Good: Transient connection limits are now configurable.

Question: Where are min_connections and max_connections tuning parameters that PR description mentions? I see the config changes but no new parameters for isotonic scoring itself (e.g., FORWARD_ATTEMPT_TIMEOUT).


crates/core/src/ring/connection_manager.rs

Issues

Lines 147-260: should_accept() changes
The logic has become quite complex with multiple early returns, gateway-specific limits, and topology manager integration.

Good:

  • Checks for duplicate peers (lines 172-176)
  • Reserves capacity atomically (lines 179-181)
  • Gateway-specific limit (lines 206-219)

⚠️ Concern: Line 183-198 - overflow check is good, but this is defensive programming around a scenario that shouldn't happen in practice. If counters overflow, we have bigger problems.

Lines 206-219: Gateway direct-accept limit

const GATEWAY_DIRECT_ACCEPT_LIMIT: usize = 2;
if self.is_gateway {
    let direct_total = open + reserved_before;
    if direct_total >= GATEWAY_DIRECT_ACCEPT_LIMIT {
        tracing::info!(..., "Gateway reached direct-accept limit; forwarding join request instead");
        self.pending_reservations.write().remove(peer_id);
        return false;
    }
}

Good: This addresses the capacity-blind problem at the gateway level.

Question: Why hardcode 2? Should this be configurable or derived from max_connections?


Lines 419-443: add_connection() capacity check

if previous_location.is_none() && self.connection_count() >= self.max_connections {
    tracing::warn!(..., "add_connection: rejecting new connection to enforce cap");
    self.location_for_peer.write().remove(&peer);
    if was_reserved {
        self.pending_reservations.write().remove(&peer);
    }
    return;
}

Good: Enforces global cap on new connections.

⚠️ Concern: This is the third place we check capacity (after should_accept() and topology manager). Could this create race conditions where we accept in should_accept() but reject here?


crates/core/src/node/network_bridge/p2p_protoc.rs

Lines 324-336: Observed address tagging

if let (
    Some(remote_addr),
    NetMessage::V1(NetMessageV1::Connect(ConnectMsg::Request {
        payload,
        ..
    })),
) = (remote, &mut msg)
{
    if payload.observed_addr.is_none() {
        payload.observed_addr = Some(remote_addr);
    }
}

Excellent: This solves the NAT traversal problem cleanly by having the gateway/first hop tag the request with the observed UDP source.


Testing Analysis

Existing Tests

crates/core/src/operations/connect.rs:1152-1428 - Unit tests look comprehensive:

  • ✅ Relay accepts when policy allows (lines 1226-1254)
  • ✅ Relay forwards when not accepting (lines 1257-1288)
  • ✅ Observed address propagation (lines 1291-1324)
  • ✅ Joiner acceptance tracking (lines 1327-1345)
  • ✅ Multi-hop forwarding with TTL (lines 1379-1427)

Missing Tests

Critical gap: No tests for the isotonic estimator integration in ConnectOp!

Missing test scenarios:

  1. Forward attempt times out → failure recorded in estimator
  2. Forward attempt succeeds → success recorded in estimator
  3. Scoring logic selects high-success-rate peer over closer peer
  4. Fallback to distance-based routing when estimator has insufficient data
  5. Recency cooldown prevents immediate re-selection of same peer
  6. Concurrent operations don't corrupt forward_attempts HashMap

Recommendation: Add integration tests in crates/core/tests/ that:

  • Simulate join bursts with capacity constraints
  • Verify that isotonic scoring improves joiner connection diversity
  • Test timeout-based failure learning

Documentation Gaps

  1. No explanation of isotonic regression: The PR description mentions it, but the code has no comments explaining what isotonic regression is or why it's suitable here.

  2. No tuning guidance: The timeout (20s), cooldown (5s), and adjustment prior (10/20) are magic numbers with no justification.

  3. No capacity vs. success distinction: The code learns "forward success" (did the peer respond?) but the problem is about capacity (can the peer accept more connections?). This distinction should be documented.


Security Considerations

No obvious vulnerabilities, but consider:

  1. Resource exhaustion: Global estimator grows unbounded. A malicious peer could cause memory bloat by generating many unique PeerKeyLocation entries.

  2. Poisoning attack: A malicious peer could strategically fail/succeed to manipulate routing. The per-peer adjustments mitigate this somewhat, but a coordinated attack across multiple peer identities could be effective.

Mitigation: Add a cap on estimator size and/or age out old entries.


Performance Concerns

  1. Global lock contention: FORWARD_ESTIMATOR RwLock is acquired on every forward decision (line 412). Under heavy join load, this could become a bottleneck.

  2. HashMap clones: Lines 423-424 clone PeerKeyLocation repeatedly. Consider using Rc or Arc for these.

  3. Scoring overhead: Isotonic interpolation is O(log n) per candidate. For 20 candidates, this is negligible, but worth profiling.


Recommendations Summary

Must Fix

  1. Eliminate global FORWARD_ESTIMATOR static - Move to per-node instance in OpManager
  2. Fix magic number inconsistency - ADJUSTMENT_PRIOR_SIZE used inconsistently
  3. Add test coverage - Integration tests for isotonic scoring behavior
  4. Guard against panic - Check peer.location before unwrap in estimator

Should Fix

  1. Document isotonic regression - Add module-level doc explaining the approach
  2. Make timeouts configurable - Move magic numbers to config
  3. Explain capacity vs. success - Document what the estimator actually learns
  4. Add estimator size limit - Prevent unbounded growth

Nice to Have

  1. Optimize clones - Use Arc<PeerKeyLocation> if profiling shows allocation pressure
  2. Tune gateway direct-accept limit - Make GATEWAY_DIRECT_ACCEPT_LIMIT configurable

Conclusion

The isotonic scoring approach is architecturally sound and addresses a real pain point. However, the implementation has several issues (global state, missing tests, magic numbers) that should be resolved before merge.

Action Items:

  1. Refactor FORWARD_ESTIMATOR from global static to instance in OpManager
  2. Add comprehensive integration tests (especially timeout-based learning)
  3. Fix magic number inconsistencies
  4. Add documentation explaining the algorithm

Once these are addressed, this will be a strong contribution to the project. The complexity is justified by the problem being solved.


|

Copilot finished reviewing on behalf of sanity November 23, 2025 16:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements capacity-aware connection forwarding for the connect operation using learned success patterns via isotonic regression. The implementation tracks forward attempts with timeouts, learns a monotonic success-vs-distance curve, and uses it to score routing candidates while preserving ring-distance bias and recency cooldowns.

Key changes:

  • Introduces min_connections and max_connections configuration parameters with defaults (10 and 20 respectively)
  • Replaces atomic connection counters with a pending_reservations map for clearer admission control
  • Adds forward attempt tracking in ConnectOp with automatic timeout-based failure recording (20s)
  • Implements a global isotonic estimator for learning per-peer forward success rates

Major concerns identified:

  • Reservation counting and capacity enforcement logic has race conditions and potential leaks
  • Path dependency in Cargo.toml will break external builds
  • Global static estimator shared across all node instances (problematic for testing)
  • IP-based inbound matching with arbitrary fallback when port doesn't match

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
crates/core/src/config/mod.rs Adds min/max connection configuration parameters with defaults
crates/core/src/operations/connect.rs Implements forward attempt tracking and isotonic-based routing selection
crates/core/src/ring/connection_manager.rs Replaces atomic counters with pending_reservations map; adds capacity enforcement
crates/core/src/ring/mod.rs Updates connection maintenance to filter by live transactions and capacity
crates/core/src/node/network_bridge/p2p_protoc.rs Adds immediate prune on drop; enforces capacity on transient promotion
crates/core/src/node/network_bridge/handshake.rs Changes expected inbound tracking from socket-based to IP-based with port fallback
crates/core/src/router/isotonic_estimator.rs Changes visibility from super to crate for use in connect operation
crates/core/Cargo.toml Adds local path dependency for freenet-test-network (problematic)
Test configuration files Adds min/max_connections: None to all test node configs
apps/freenet-ping/app/tests/run_app_blocked_peers.rs Updates ignore attribute with TODO explaining WebSocket teardown issue
Comments suppressed due to low confidence (2)

crates/core/src/ring/connection_manager.rs:185

  • The reserved_before variable is read at line 150 before inserting into pending_reservations at line 180. This means when calculating total_conn at line 183-185, the newly inserted reservation is not included in the count. This could lead to accepting more connections than max_connections allows during concurrent admission checks.

Consider moving the read of reserved_before to after the early returns (lines 172-176, 200-203, etc.) but before the insertion, or restructure to ensure the count is accurate when making the acceptance decision.

        let open = self.connection_count();
        let reserved_before = self.pending_reservations.read().len();

        tracing::info!(
            %peer_id,
            open,
            reserved_before,
            is_gateway = self.is_gateway,
            min = self.min_connections,
            max = self.max_connections,
            rnd_if_htl_above = self.rnd_if_htl_above,
            "should_accept: evaluating direct acceptance guard"
        );

        if self.is_gateway && (open > 0 || reserved_before > 0) {
            tracing::info!(
                %peer_id,
                open,
                reserved_before,
                "Gateway evaluating additional direct connection (post-bootstrap)"
            );
        }

        if self.location_for_peer.read().get(peer_id).is_some() {
            // We've already accepted this peer (pending or active); treat as a no-op acceptance.
            tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
            return true;
        }

        {
            let mut pending = self.pending_reservations.write();
            pending.insert(peer_id.clone(), location);
        }

        let total_conn = match reserved_before
            .checked_add(1)
            .and_then(|val| val.checked_add(open))

crates/core/src/ring/connection_manager.rs:614

  • Both connection_count() (line 555-562) and num_connections() (line 605-614) compute the same value: the total number of connections across all location buckets. This duplication is confusing and error-prone.

Consider consolidating to a single method or clearly documenting why both exist if there's a semantic difference.

    pub(crate) fn connection_count(&self) -> usize {
        // Count only established connections tracked by location buckets.
        self.connections_by_location
            .read()
            .values()
            .map(|conns| conns.len())
            .sum()
    }

    pub(crate) fn get_connections_by_location(&self) -> BTreeMap<Location, Vec<Connection>> {
        self.connections_by_location.read().clone()
    }

    pub(super) fn get_known_locations(&self) -> BTreeMap<PeerId, Location> {
        self.location_for_peer.read().clone()
    }

    /// Route an op to the most optimal target.
    pub fn routing(
        &self,
        target: Location,
        requesting: Option<&PeerId>,
        skip_list: impl Contains<PeerId>,
        router: &Router,
    ) -> Option<PeerKeyLocation> {
        let connections = self.connections_by_location.read();
        tracing::debug!(
            total_locations = connections.len(),
            self_peer = self
                .get_peer_key()
                .as_ref()
                .map(|id| id.to_string())
                .unwrap_or_else(|| "unknown".into()),
            "routing: considering connections"
        );
        let peers = connections.values().filter_map(|conns| {
            let conn = conns.choose(&mut rand::rng())?;
            if self.is_transient(&conn.location.peer) {
                return None;
            }
            if let Some(requester) = requesting {
                if requester == &conn.location.peer {
                    return None;
                }
            }
            (!skip_list.has_element(conn.location.peer.clone())).then_some(&conn.location)
        });
        router.select_peer(peers, target).cloned()
    }

    pub fn num_connections(&self) -> usize {
        let connections = self.connections_by_location.read();
        let total: usize = connections.values().map(|v| v.len()).sum();
        tracing::debug!(
            unique_locations = connections.len(),
            total_connections = total,
            "num_connections called"
        );
        total
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

freenet-stdlib = { features = ["net", "testing"], workspace = true }
freenet-macros = { path = "../freenet-macros" }
freenet-test-network = "0.1.1"
freenet-test-network = { version = "0.1.2", path = "../../../../freenet-test-network" }
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping deps unchanged in this slice to avoid churn; the path patching is from the underlying stack and already addressed on the base. If needed we can follow up once the stack lands.

Copilot uses AI. Check for mistakes.
Comment on lines 174 to 181
let entry = list.pop();
if list.is_empty() {
self.entries.remove(&ip);
}
if let Some(entry) = entry {
tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");
return Some(entry);
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IP-only fallback existed before this slice; to keep this PR scoped to forwarding we’re leaving it unchanged. Agree it merits cleanup—tracking that in issue #2127 for a follow-up refactor once the stack lands.

Copilot uses AI. Check for mistakes.
peer: PeerId,
transaction: Option<Transaction>,
courtesy: bool,
transient: bool, // TODO: rename to transient in protocol once we migrate terminology
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment says "rename to transient in protocol" but the field is already named transient. This comment appears to be stale or incorrectly worded. If the intention is to rename something in the protocol layer, the comment should clarify what specifically needs to be renamed.

Suggested change
transient: bool, // TODO: rename to transient in protocol once we migrate terminology
transient: bool,

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale note—the field is already named transient here. Keeping code as-is and will drop the TODO when we do the terminology cleanup tracked in #2127.

Comment on lines 256 to 253

static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
Lazy::new(|| RwLock::new(ForwardEstimator::new()));

Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FORWARD_ESTIMATOR is a global static shared across all ConnectOp instances and potentially all node instances in the same process. This means:

  1. Multiple nodes in the same process (e.g., during testing) will share and contaminate each other's forward success data
  2. The estimator's state persists across operation instances, which may be desired but should be documented
  3. There's no mechanism to reset or clear the estimator if needed

Consider either:

  • Making this per-node or per-OpManager instance
  • Adding clear documentation about the intended global sharing semantics
  • Providing a way to reset/clear for testing purposes
Suggested change
static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
Lazy::new(|| RwLock::new(ForwardEstimator::new()));
/// Global estimator shared across all ConnectOp instances and all nodes in the process.
///
/// # Warning
/// This estimator is global and shared. Its state will be visible to all users in the process,
/// including across tests and multiple nodes. Use `reset_forward_estimator()` to clear it in tests.
static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
Lazy::new(|| RwLock::new(ForwardEstimator::new()));
impl ForwardEstimator {
/// Reset the estimator to its initial state.
fn reset(&mut self) {
*self = ForwardEstimator::new();
}
}
/// Reset the global FORWARD_ESTIMATOR. Intended for use in tests.
#[cfg(test)]
pub fn reset_forward_estimator() {
let mut estimator = FORWARD_ESTIMATOR.write();
estimator.reset();
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed the global static isn’t ideal. In practice we run one node per process so cross-node contamination hasn’t bitten us, but it should live on node/OpManager state with reset/bounds. I’ve logged that in #2127 and kept this slice scoped to forwarding; happy to move it in the follow-up.

Copy link
Collaborator

@iduartgomez iduartgomez Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests run by cargo tests are not run in multiple processes, nodes all run as part of the same process, so be careful with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed: the estimator is no longer global. Each node now owns a per-node connect_forward_estimator (on OpManager) that’s shared only across ConnectOps for that node, so single-process multi-node tests don’t contaminate each other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are now on feat/isotonic-forwarding-pr (pushed 48bdf25 + 2f8e7f70). The estimator is per-node and scoped to Connect ops as discussed—thanks for the approval and the heads-up.

Comment on lines 1336 to 1366
let current = connection_manager.connection_count();
if current >= connection_manager.max_connections {
tracing::warn!(
tx = %tx,
%peer,
current_connections = current,
max_connections = connection_manager.max_connections,
%loc,
"connect_peer: rejecting transient promotion to enforce cap"
);
callback
.send_result(Err(()))
.await
.inspect_err(|err| {
tracing::debug!(
tx = %tx,
remote = %peer,
?err,
"connect_peer: failed to notify cap-rejection callback"
);
})
.ok();
return Ok(());
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After dropping the transient connection at line 1309, the code calls should_accept at line 1314, which internally inserts into pending_reservations and checks capacity limits. Then at lines 1336-1359, it redundantly checks connection_count() against max_connections again.

This is problematic because:

  1. The capacity check is duplicated (already done inside should_accept)
  2. If the second check fails (lines 1337-1359), the pending_reservations entry added by should_accept is never cleaned up, causing a reservation leak
  3. The transient was already dropped at line 1309, so if either check fails, the transient slot is lost

Consider either:

  • Removing the redundant check at lines 1336-1359 and relying on should_accept's internal logic
  • Or restructuring to check admission before dropping the transient, and cleaning up pending_reservations on failure
Suggested change
let current = connection_manager.connection_count();
if current >= connection_manager.max_connections {
tracing::warn!(
tx = %tx,
%peer,
current_connections = current,
max_connections = connection_manager.max_connections,
%loc,
"connect_peer: rejecting transient promotion to enforce cap"
);
callback
.send_result(Err(()))
.await
.inspect_err(|err| {
tracing::debug!(
tx = %tx,
remote = %peer,
?err,
"connect_peer: failed to notify cap-rejection callback"
);
})
.ok();
return Ok(());
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the duplicate guard. The intent was a belt-and-suspenders check on the promotion path that bypasses the normal admission flow, but if that second cap check fires we should clear the reservation to avoid skewing capacity. I’ll tidy this in the follow-up slice so we don’t leak reservations there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_accept having side effects is bad, that verb does not imply mutation, we should fix it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. should_accept currently reserves capacity (and can log) to make acceptance atomic, which is why it has side effects. Agreed the naming is misleading; I’ll split this into a pure check plus an explicit reserve call (or rename to reflect reservation) in a follow-up cleanup so callers aren’t surprised.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed—doing it in this PR. I will split the logic into a pure check and an explicit reserve step (or rename to reflect reservation) so callers aren’t surprised and there are no hidden side effects.

Comment on lines 1805 to 1830
let current = connection_manager.connection_count();
if current >= connection_manager.max_connections {
tracing::warn!(
%peer_id,
current_connections = current,
max_connections = connection_manager.max_connections,
%loc,
"handle_successful_connection: rejecting new connection to enforce cap"
);
return Ok(());
}
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the issue in handle_connect_peer, this code calls should_accept at line 1796 (which internally checks capacity and inserts into pending_reservations), then redundantly checks capacity again at lines 1805-1815.

If the second check fails, the function returns without calling add_connection, but the pending_reservations entry created by should_accept is never cleaned up, causing a reservation leak.

The fix should be consistent with the one applied to handle_connect_peer—remove the redundant check or restructure to avoid the reservation leak.

Suggested change
let current = connection_manager.connection_count();
if current >= connection_manager.max_connections {
tracing::warn!(
%peer_id,
current_connections = current,
max_connections = connection_manager.max_connections,
%loc,
"handle_successful_connection: rejecting new connection to enforce cap"
);
return Ok(());
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@sanity sanity Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here—extra cap guard after should_accept was meant as a backstop, but if it trips we should drop the reservation so it doesn’t linger. I’ll clean this up alongside the other promotion path tweak to avoid the leak.

@sanity sanity force-pushed the feat/isotonic-forwarding-pr branch from db01dcf to 82edca5 Compare November 23, 2025 16:39
@sanity sanity marked this pull request as ready for review November 23, 2025 18:53
@sanity sanity force-pushed the feat/isotonic-forwarding-pr branch 2 times, most recently from 89edd6e to f883f1c Compare November 24, 2025 03:18
@sanity sanity changed the base branch from fix/transient-connection-tracking to main November 24, 2025 03:18
@sanity
Copy link
Collaborator Author

sanity commented Nov 24, 2025

Addressed parts of Claude's review:

  • Made isotonic prior size consistent (single ADJUSTMENT_PRIOR_SIZE) and added a guard in the estimator to return InsufficientData when peer locations are missing.
  • Added small smoke tests: forward estimator tolerates missing locations and expired forward attempts are cleared/recorded.
  • Documented the shared estimator intent; broader refactor to bound/reset and move it under an explicit owner (while keeping the shared curve) is tracked in Refactor transient connections into dedicated manager #2127 along with size limits and richer integration tests.

Pending for #2127: moving the estimator into owned state with reset/bounds, adding size/age caps, and more thorough isotonic integration coverage.

@sanity sanity force-pushed the feat/isotonic-forwarding-pr branch from 39aacb8 to 7d90e88 Compare November 24, 2025 03:48
) -> Option<PeerKeyLocation>;

/// Whether the acceptance should be treated as a short-lived transient link.
fn transient_hint(&self, acceptor: &PeerKeyLocation, joiner: &PeerKeyLocation) -> bool;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for downstreams during connect? if so clarify in the documentation, where/how is this used, why is needed, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in code: this per-node ConnectForwardEstimator tracks downstream Connect forwarding outcomes (success/fail) so a node can bias future Connect forwards toward peers likely to accept/complete when capacity is scarce. Added doc comments on the estimator explaining what it learns and why.

Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indont see the changes pushed but approved preemtively

@sanity sanity enabled auto-merge November 24, 2025 21:53
@sanity
Copy link
Collaborator Author

sanity commented Nov 24, 2025

@iduartgomez Follow-up on your review: the per-node connect estimator and associated fixes are now pushed to feat/isotonic-forwarding-pr (commits 48bdf25 and 2f8e7f70). Let me know if you see anything still missing.

@sanity sanity force-pushed the feat/isotonic-forwarding-pr branch from 48bdf25 to 967acc5 Compare November 24, 2025 22:59
sanity and others added 3 commits November 24, 2025 17:20
…cture

PR #2136 changed ExpectedInboundTracker from HashMap<IpAddr, Vec<ExpectedInbound>>
to HashMap<SocketAddr, ExpectedInbound>. The transactions_for() test helper was
added in this branch before that change and wasn't updated during the merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity added this pull request to the merge queue Nov 24, 2025
Merged via the queue into main with commit d91beeb Nov 25, 2025
11 checks passed
@sanity sanity deleted the feat/isotonic-forwarding-pr branch November 25, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants