feat: learn connect forwarding success with isotonic scoring #2126

sanity · 2025-11-23T16:02:07Z

Problem

Connect forwarding is capacity-blind during join bursts. Gateways/early peers keep routing ConnectRequests to already-full neighbors, leaving joiners stuck with gateway-only links and slow small-world formation.

Solution

Track per-op forward attempts (success on ConnectResponse, failure on timeout) and learn a monotonic success-vs-distance curve via an isotonic estimator with per-peer adjustments.
Score routing candidates with that estimate while preserving ring-distance bias and a small recency cooldown to avoid hammering one neighbor.
Failures are recorded automatically when attempts age out (20s), so the model learns even without responses. All logic stays inside ConnectOp; cold start falls back to existing distance routing.

Testing

cargo test -p freenet --tests --no-run
40-peer soak: SOAK_PEER_COUNT=40 SOAK_SNAPSHOT_ITERATIONS=2 SOAK_SNAPSHOT_INTERVAL_SECS=60 cargo test -p freenet --test large_network -- --ignored --nocapture (all peers healthy; snapshots preserved under the test network run root).

Notes

Draft PR stacked on the transient stack (base fix/transient-connection-tracking); intended to augment/replace PR 2125 with learned, capacity-aware forwarding.

sanity · 2025-11-23T16:16:14Z

@claude Please review, both high-level and in-detail

claude · 2025-11-23T16:16:36Z

Claude finished @sanity's task —— View job

Code Review Complete

Reviewed PR #2126: feat: learn connect forwarding success with isotonic scoring

High-Level Assessment

This PR introduces a significant enhancement to the connect forwarding mechanism using machine learning (isotonic regression) to make capacity-aware routing decisions. The approach is sound and addresses a real problem (capacity-blind forwarding during join bursts), but there are several concerns around complexity, potential bugs, and testing.

Overall Verdict: Needs revisions before merge. The core algorithm is solid, but implementation details need attention.

Architectural Review

✅ Strengths

Well-scoped solution: The isotonic learning stays entirely within ConnectOp, keeping concerns localized
Graceful degradation: Falls back to distance-based routing when the model lacks data
Automatic failure detection: 20s timeout mechanism learns from unresponsive peers without explicit negative feedback
Clever recency cooldown: Prevents hammering the same neighbor repeatedly

⚠️ Concerns

Global static state: FORWARD_ESTIMATOR (lines 257-258 in connect.rs) is a global LazyLock<RwLock<ForwardEstimator>>. This creates several issues:
- Thread contention: All connect operations across all peers compete for the same lock
- No isolation: All transactions share the same estimator, so a bad peer affects everyone
- Testing nightmares: Global state makes tests non-deterministic and prevents parallel test execution
- Memory leak potential: Never freed, grows unbounded
Seed events use dummy peer: Lines 213-232 create a dummy peer with hardcoded 127.0.0.1:0 just to bootstrap the estimator. This is a code smell suggesting the estimator API needs rethinking.
Missing capacity signal: The estimator learns success/failure of forwarding attempts, but doesn't directly observe peer capacity. A peer might accept the forward but still be nearly full, leading to suboptimal small-world formation.

Detailed Code Review

`crates/core/src/operations/connect.rs`

Critical Issues

Lines 257-258: Global mutable state

static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
    Lazy::new(|| RwLock::new(ForwardEstimator::new()));

❌ Problem: This creates a global shared estimator that all operations fight over.

Recommendation: Move estimator ownership to OpManager or Ring and pass it by reference. Each node should have its own estimator, not share one globally.

Lines 523-535: Timeout cleanup logic

fn expire_forward_attempts(&mut self, now: Instant) {
    let mut expired = Vec::new();
    for (peer, attempt) in self.forward_attempts.iter() {
        if now.duration_since(attempt.sent_at) >= FORWARD_ATTEMPT_TIMEOUT {
            expired.push((peer.clone(), attempt.desired));
        }
    }
    for (peer, desired) in expired {
        if let Some(attempt) = self.forward_attempts.remove(&peer) {
            self.record_forward_outcome(&attempt.peer, desired, false);
        }
    }
}

⚠️ Potential inefficiency: Double HashMap lookup (line 526 and 531). Consider using drain_filter (once stabilized) or collecting keys first.

Recommendation:

fn expire_forward_attempts(&mut self, now: Instant) {
    let expired: Vec<PeerId> = self.forward_attempts
        .iter()
        .filter(|(_, attempt)| now.duration_since(attempt.sent_at) >= FORWARD_ATTEMPT_TIMEOUT)
        .map(|(peer, _)| peer.clone())
        .collect();
    
    for peer in expired {
        if let Some(attempt) = self.forward_attempts.remove(&peer) {
            self.record_forward_outcome(&attempt.peer, attempt.desired, false);
        }
    }
}

Lines 412-454: Scoring and fallback logic

let mut scored: Vec<(f64, PeerKeyLocation)> = Vec::new();
let mut fallback: Vec<PeerKeyLocation> = Vec::new();

for cand in candidates {
    if let Some(ts) = recency.get(&cand.peer) {
        if now.duration_since(*ts) < RECENCY_COOLDOWN {
            continue;
        }
    }
    if cand.location.is_some() {
        if let Some(score) = estimator.estimate(&cand, desired_location) {
            scored.push((score, cand.clone()));
            continue;
        }
    }
    fallback.push(cand.clone());
}

✅ Good: Clear separation between scored and fallback candidates.

⚠️ Concern: Line 441 - multiple candidates can tie for best_score. The code collects all ties and picks randomly via select_peer. This is correct but worth documenting.

Suggestion: Add a comment explaining tie-breaking strategy:

// If multiple candidates have the same best score, collect all of them and let
// the router's distance-based logic break ties (lines 437-444).

Lines 884-888: Success recording on response

if let Some(fwd) = forwarded {
    self.record_forward_outcome(&fwd, desired, true);
}

✅ Correct: Records success when a ConnectResponse arrives.

❓ Question: What if multiple responses arrive (rare but possible if TTL allows multiple acceptors)? Current code only records the first. Is this intentional?

Minor Issues

Line 30: Hardcoded timeout

const FORWARD_ATTEMPT_TIMEOUT: Duration = Duration::from_secs(20);

📝 Suggestion: Make this configurable or at least document why 20s was chosen.

Line 31: Recency cooldown

const RECENCY_COOLDOWN: Duration = Duration::from_secs(5);

📝 Suggestion: Same as above. Consider making these part of ConnectionTuning config.

`crates/core/src/router/isotonic_estimator.rs`

Issues

Lines 23-24: Magic number

const ADJUSTMENT_PRIOR_SIZE: u64 = 10;

And again at line 51:

let adjustment_prior_size = 20;

❌ Inconsistency: The constant is defined as 10 but overridden to 20 in the function. This is confusing.

Recommendation: Use the constant consistently or document why the override is necessary.

Lines 129: Unwrap on location

let distance: f64 = contract_location.distance(peer.location.unwrap()).as_f64();

⚠️ Potential panic: If peer.location is None, this will panic.

Check lines 237-238 in connect.rs:

if peer.location.is_none() {
    return;
}

The caller checks before calling estimator.record(), but estimate_retrieval_time() should also guard against this.

Recommendation:

let peer_location = peer.location.ok_or(EstimationError::InsufficientData)?;
let distance: f64 = contract_location.distance(peer_location).as_f64();

Lines 98-100: Duplicate magic number

let adjustment_prior_size = 20;
let global_regression_big_enough_to_estimate_peer_adjustments =
    self.global_regression.len() >= adjustment_prior_size;

Same issue as above.

`crates/core/src/config/mod.rs`

Issues

Lines 103-104: New config fields

transient_budget: Some(DEFAULT_TRANSIENT_BUDGET),
transient_ttl_secs: Some(DEFAULT_TRANSIENT_TTL_SECS),

✅ Good: Transient connection limits are now configurable.

❓ Question: Where are min_connections and max_connections tuning parameters that PR description mentions? I see the config changes but no new parameters for isotonic scoring itself (e.g., FORWARD_ATTEMPT_TIMEOUT).

`crates/core/src/ring/connection_manager.rs`

Issues

Lines 147-260: should_accept() changes
The logic has become quite complex with multiple early returns, gateway-specific limits, and topology manager integration.

✅ Good:

Checks for duplicate peers (lines 172-176)
Reserves capacity atomically (lines 179-181)
Gateway-specific limit (lines 206-219)

⚠️ Concern: Line 183-198 - overflow check is good, but this is defensive programming around a scenario that shouldn't happen in practice. If counters overflow, we have bigger problems.

Lines 206-219: Gateway direct-accept limit

const GATEWAY_DIRECT_ACCEPT_LIMIT: usize = 2;
if self.is_gateway {
    let direct_total = open + reserved_before;
    if direct_total >= GATEWAY_DIRECT_ACCEPT_LIMIT {
        tracing::info!(..., "Gateway reached direct-accept limit; forwarding join request instead");
        self.pending_reservations.write().remove(peer_id);
        return false;
    }
}

✅ Good: This addresses the capacity-blind problem at the gateway level.

❓ Question: Why hardcode 2? Should this be configurable or derived from max_connections?

Lines 419-443: add_connection() capacity check

if previous_location.is_none() && self.connection_count() >= self.max_connections {
    tracing::warn!(..., "add_connection: rejecting new connection to enforce cap");
    self.location_for_peer.write().remove(&peer);
    if was_reserved {
        self.pending_reservations.write().remove(&peer);
    }
    return;
}

✅ Good: Enforces global cap on new connections.

⚠️ Concern: This is the third place we check capacity (after should_accept() and topology manager). Could this create race conditions where we accept in should_accept() but reject here?

`crates/core/src/node/network_bridge/p2p_protoc.rs`

Lines 324-336: Observed address tagging

if let (
    Some(remote_addr),
    NetMessage::V1(NetMessageV1::Connect(ConnectMsg::Request {
        payload,
        ..
    })),
) = (remote, &mut msg)
{
    if payload.observed_addr.is_none() {
        payload.observed_addr = Some(remote_addr);
    }
}

✅ Excellent: This solves the NAT traversal problem cleanly by having the gateway/first hop tag the request with the observed UDP source.

Testing Analysis

Existing Tests

crates/core/src/operations/connect.rs:1152-1428 - Unit tests look comprehensive:

✅ Relay accepts when policy allows (lines 1226-1254)
✅ Relay forwards when not accepting (lines 1257-1288)
✅ Observed address propagation (lines 1291-1324)
✅ Joiner acceptance tracking (lines 1327-1345)
✅ Multi-hop forwarding with TTL (lines 1379-1427)

Missing Tests

❌ Critical gap: No tests for the isotonic estimator integration in ConnectOp!

Missing test scenarios:

Forward attempt times out → failure recorded in estimator
Forward attempt succeeds → success recorded in estimator
Scoring logic selects high-success-rate peer over closer peer
Fallback to distance-based routing when estimator has insufficient data
Recency cooldown prevents immediate re-selection of same peer
Concurrent operations don't corrupt forward_attempts HashMap

Recommendation: Add integration tests in crates/core/tests/ that:

Simulate join bursts with capacity constraints
Verify that isotonic scoring improves joiner connection diversity
Test timeout-based failure learning

Documentation Gaps

No explanation of isotonic regression: The PR description mentions it, but the code has no comments explaining what isotonic regression is or why it's suitable here.
No tuning guidance: The timeout (20s), cooldown (5s), and adjustment prior (10/20) are magic numbers with no justification.
No capacity vs. success distinction: The code learns "forward success" (did the peer respond?) but the problem is about capacity (can the peer accept more connections?). This distinction should be documented.

Security Considerations

✅ No obvious vulnerabilities, but consider:

Resource exhaustion: Global estimator grows unbounded. A malicious peer could cause memory bloat by generating many unique PeerKeyLocation entries.
Poisoning attack: A malicious peer could strategically fail/succeed to manipulate routing. The per-peer adjustments mitigate this somewhat, but a coordinated attack across multiple peer identities could be effective.

Mitigation: Add a cap on estimator size and/or age out old entries.

Performance Concerns

Global lock contention: FORWARD_ESTIMATOR RwLock is acquired on every forward decision (line 412). Under heavy join load, this could become a bottleneck.
HashMap clones: Lines 423-424 clone PeerKeyLocation repeatedly. Consider using Rc or Arc for these.
Scoring overhead: Isotonic interpolation is O(log n) per candidate. For 20 candidates, this is negligible, but worth profiling.

Recommendations Summary

Must Fix

Eliminate global FORWARD_ESTIMATOR static - Move to per-node instance in OpManager
Fix magic number inconsistency - ADJUSTMENT_PRIOR_SIZE used inconsistently
Add test coverage - Integration tests for isotonic scoring behavior
Guard against panic - Check peer.location before unwrap in estimator

Should Fix

Document isotonic regression - Add module-level doc explaining the approach
Make timeouts configurable - Move magic numbers to config
Explain capacity vs. success - Document what the estimator actually learns
Add estimator size limit - Prevent unbounded growth

Nice to Have

Optimize clones - Use Arc<PeerKeyLocation> if profiling shows allocation pressure
Tune gateway direct-accept limit - Make GATEWAY_DIRECT_ACCEPT_LIMIT configurable

Conclusion

The isotonic scoring approach is architecturally sound and addresses a real pain point. However, the implementation has several issues (global state, missing tests, magic numbers) that should be resolved before merge.

Action Items:

Refactor FORWARD_ESTIMATOR from global static to instance in OpManager
Add comprehensive integration tests (especially timeout-based learning)
Fix magic number inconsistencies
Add documentation explaining the algorithm

Once these are addressed, this will be a strong contribution to the project. The complexity is justified by the problem being solved.

|

Copilot

Pull request overview

This PR implements capacity-aware connection forwarding for the connect operation using learned success patterns via isotonic regression. The implementation tracks forward attempts with timeouts, learns a monotonic success-vs-distance curve, and uses it to score routing candidates while preserving ring-distance bias and recency cooldowns.

Key changes:

Introduces min_connections and max_connections configuration parameters with defaults (10 and 20 respectively)
Replaces atomic connection counters with a pending_reservations map for clearer admission control
Adds forward attempt tracking in ConnectOp with automatic timeout-based failure recording (20s)
Implements a global isotonic estimator for learning per-peer forward success rates

Major concerns identified:

Reservation counting and capacity enforcement logic has race conditions and potential leaks
Path dependency in Cargo.toml will break external builds
Global static estimator shared across all node instances (problematic for testing)
IP-based inbound matching with arbitrary fallback when port doesn't match

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
crates/core/src/config/mod.rs	Adds min/max connection configuration parameters with defaults
crates/core/src/operations/connect.rs	Implements forward attempt tracking and isotonic-based routing selection
crates/core/src/ring/connection_manager.rs	Replaces atomic counters with pending_reservations map; adds capacity enforcement
crates/core/src/ring/mod.rs	Updates connection maintenance to filter by live transactions and capacity
crates/core/src/node/network_bridge/p2p_protoc.rs	Adds immediate prune on drop; enforces capacity on transient promotion
crates/core/src/node/network_bridge/handshake.rs	Changes expected inbound tracking from socket-based to IP-based with port fallback
crates/core/src/router/isotonic_estimator.rs	Changes visibility from super to crate for use in connect operation
crates/core/Cargo.toml	Adds local path dependency for freenet-test-network (problematic)
Test configuration files	Adds min/max_connections: None to all test node configs
apps/freenet-ping/app/tests/run_app_blocked_peers.rs	Updates ignore attribute with TODO explaining WebSocket teardown issue

Comments suppressed due to low confidence (2)

crates/core/src/ring/connection_manager.rs:185

The reserved_before variable is read at line 150 before inserting into pending_reservations at line 180. This means when calculating total_conn at line 183-185, the newly inserted reservation is not included in the count. This could lead to accepting more connections than max_connections allows during concurrent admission checks.

Consider moving the read of reserved_before to after the early returns (lines 172-176, 200-203, etc.) but before the insertion, or restructure to ensure the count is accurate when making the acceptance decision.

        let open = self.connection_count();
        let reserved_before = self.pending_reservations.read().len();

        tracing::info!(
            %peer_id,
            open,
            reserved_before,
            is_gateway = self.is_gateway,
            min = self.min_connections,
            max = self.max_connections,
            rnd_if_htl_above = self.rnd_if_htl_above,
            "should_accept: evaluating direct acceptance guard"
        );

        if self.is_gateway && (open > 0 || reserved_before > 0) {
            tracing::info!(
                %peer_id,
                open,
                reserved_before,
                "Gateway evaluating additional direct connection (post-bootstrap)"
            );
        }

        if self.location_for_peer.read().get(peer_id).is_some() {
            // We've already accepted this peer (pending or active); treat as a no-op acceptance.
            tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
            return true;
        }

        {
            let mut pending = self.pending_reservations.write();
            pending.insert(peer_id.clone(), location);
        }

        let total_conn = match reserved_before
            .checked_add(1)
            .and_then(|val| val.checked_add(open))

crates/core/src/ring/connection_manager.rs:614

Both connection_count() (line 555-562) and num_connections() (line 605-614) compute the same value: the total number of connections across all location buckets. This duplication is confusing and error-prone.

Consider consolidating to a single method or clearly documenting why both exist if there's a semantic difference.

    pub(crate) fn connection_count(&self) -> usize {
        // Count only established connections tracked by location buckets.
        self.connections_by_location
            .read()
            .values()
            .map(|conns| conns.len())
            .sum()
    }

    pub(crate) fn get_connections_by_location(&self) -> BTreeMap<Location, Vec<Connection>> {
        self.connections_by_location.read().clone()
    }

    pub(super) fn get_known_locations(&self) -> BTreeMap<PeerId, Location> {
        self.location_for_peer.read().clone()
    }

    /// Route an op to the most optimal target.
    pub fn routing(
        &self,
        target: Location,
        requesting: Option<&PeerId>,
        skip_list: impl Contains<PeerId>,
        router: &Router,
    ) -> Option<PeerKeyLocation> {
        let connections = self.connections_by_location.read();
        tracing::debug!(
            total_locations = connections.len(),
            self_peer = self
                .get_peer_key()
                .as_ref()
                .map(|id| id.to_string())
                .unwrap_or_else(|| "unknown".into()),
            "routing: considering connections"
        );
        let peers = connections.values().filter_map(|conns| {
            let conn = conns.choose(&mut rand::rng())?;
            if self.is_transient(&conn.location.peer) {
                return None;
            }
            if let Some(requester) = requesting {
                if requester == &conn.location.peer {
                    return None;
                }
            }
            (!skip_list.has_element(conn.location.peer.clone())).then_some(&conn.location)
        });
        router.select_peer(peers, target).cloned()
    }

    pub fn num_connections(&self) -> usize {
        let connections = self.connections_by_location.read();
        let total: usize = connections.values().map(|v| v.len()).sum();
        tracing::debug!(
            unique_locations = connections.len(),
            total_connections = total,
            "num_connections called"
        );
        total
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-23T16:20:31Z

crates/core/Cargo.toml

 freenet-stdlib = { features = ["net", "testing"], workspace = true }
 freenet-macros = { path = "../freenet-macros" }
-freenet-test-network = "0.1.1"
+freenet-test-network = { version = "0.1.2", path = "../../../../freenet-test-network" }


Keeping deps unchanged in this slice to avoid churn; the path patching is from the underlying stack and already addressed on the base. If needed we can follow up once the stack lands.

Copilot · 2025-11-23T16:20:32Z

crates/core/src/node/network_bridge/handshake.rs

+        let entry = list.pop();
+        if list.is_empty() {
+            self.entries.remove(&ip);
+        }
+        if let Some(entry) = entry {
+            tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback");
+            return Some(entry);
+        }


The IP-only fallback existed before this slice; to keep this PR scoped to forwarding we’re leaving it unchanged. Agree it merits cleanup—tracking that in issue #2127 for a follow-up refactor once the stack lands.

Copilot · 2025-11-23T16:20:32Z

crates/core/src/node/network_bridge/handshake.rs

    peer: PeerId,
    transaction: Option<Transaction>,
-    courtesy: bool,
+    transient: bool, // TODO: rename to transient in protocol once we migrate terminology


The TODO comment says "rename to transient in protocol" but the field is already named transient. This comment appears to be stale or incorrectly worded. If the intention is to rename something in the protocol layer, the comment should clarify what specifically needs to be renamed.

Suggested change

transient: bool, // TODO: rename to transient in protocol once we migrate terminology

transient: bool,

Stale note—the field is already named transient here. Keeping code as-is and will drop the TODO when we do the terminology cleanup tracked in #2127.

Copilot · 2025-11-23T16:20:32Z

crates/core/src/operations/connect.rs

+
+static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
+    Lazy::new(|| RwLock::new(ForwardEstimator::new()));
+


The FORWARD_ESTIMATOR is a global static shared across all ConnectOp instances and potentially all node instances in the same process. This means:

Multiple nodes in the same process (e.g., during testing) will share and contaminate each other's forward success data

The estimator's state persists across operation instances, which may be desired but should be documented

There's no mechanism to reset or clear the estimator if needed

Consider either:

Making this per-node or per-OpManager instance

Adding clear documentation about the intended global sharing semantics

Providing a way to reset/clear for testing purposes

Suggested change

static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =

Lazy::new(|| RwLock::new(ForwardEstimator::new()));

/// Global estimator shared across all ConnectOp instances and all nodes in the process.

///

/// # Warning

/// This estimator is global and shared. Its state will be visible to all users in the process,

/// including across tests and multiple nodes. Use `reset_forward_estimator()` to clear it in tests.

static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =

Lazy::new(|| RwLock::new(ForwardEstimator::new()));

impl ForwardEstimator {

/// Reset the estimator to its initial state.

fn reset(&mut self) {

*self = ForwardEstimator::new();

}

}

/// Reset the global FORWARD_ESTIMATOR. Intended for use in tests.

#[cfg(test)]

pub fn reset_forward_estimator() {

let mut estimator = FORWARD_ESTIMATOR.write();

estimator.reset();

}

Agreed the global static isn’t ideal. In practice we run one node per process so cross-node contamination hasn’t bitten us, but it should live on node/OpManager state with reset/bounds. I’ve logged that in #2127 and kept this slice scoped to forwarding; happy to move it in the follow-up.

In the tests run by cargo tests are not run in multiple processes, nodes all run as part of the same process, so be careful with that.

Addressed: the estimator is no longer global. Each node now owns a per-node connect_forward_estimator (on OpManager) that’s shared only across ConnectOps for that node, so single-process multi-node tests don’t contaminate each other.

Changes are now on feat/isotonic-forwarding-pr (pushed 48bdf25 + 2f8e7f70). The estimator is per-node and scoped to Connect ops as discussed—thanks for the approval and the heads-up.

Copilot · 2025-11-23T16:20:33Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                let current = connection_manager.connection_count();
+                if current >= connection_manager.max_connections {
+                    tracing::warn!(
+                        tx = %tx,
+                        %peer,
+                        current_connections = current,
+                        max_connections = connection_manager.max_connections,
+                        %loc,
+                        "connect_peer: rejecting transient promotion to enforce cap"
+                    );
+                    callback
+                        .send_result(Err(()))
+                        .await
+                        .inspect_err(|err| {
+                            tracing::debug!(
+                                tx = %tx,
+                                remote = %peer,
+                                ?err,
+                                "connect_peer: failed to notify cap-rejection callback"
+                            );
+                        })
+                        .ok();
+                    return Ok(());
+                }


After dropping the transient connection at line 1309, the code calls should_accept at line 1314, which internally inserts into pending_reservations and checks capacity limits. Then at lines 1336-1359, it redundantly checks connection_count() against max_connections again.

This is problematic because:

The capacity check is duplicated (already done inside should_accept)

If the second check fails (lines 1337-1359), the pending_reservations entry added by should_accept is never cleaned up, causing a reservation leak

The transient was already dropped at line 1309, so if either check fails, the transient slot is lost

Consider either:

Removing the redundant check at lines 1336-1359 and relying on should_accept's internal logic

Or restructuring to check admission before dropping the transient, and cleaning up pending_reservations on failure

Suggested change

let current = connection_manager.connection_count();

if current >= connection_manager.max_connections {

tracing::warn!(

tx = %tx,

%peer,

current_connections = current,

max_connections = connection_manager.max_connections,

%loc,

"connect_peer: rejecting transient promotion to enforce cap"

);

callback

.send_result(Err(()))

.await

.inspect_err(|err| {

tracing::debug!(

tx = %tx,

remote = %peer,

?err,

"connect_peer: failed to notify cap-rejection callback"

);

})

.ok();

return Ok(());

}

Good catch on the duplicate guard. The intent was a belt-and-suspenders check on the promotion path that bypasses the normal admission flow, but if that second cap check fires we should clear the reservation to avoid skewing capacity. I’ll tidy this in the follow-up slice so we don’t leak reservations there.

should_accept having side effects is bad, that verb does not imply mutation, we should fix it

Good point. should_accept currently reserves capacity (and can log) to make acceptance atomic, which is why it has side effects. Agreed the naming is misleading; I’ll split this into a pure check plus an explicit reserve call (or rename to reflect reservation) in a follow-up cleanup so callers aren’t surprised.

Agreed—doing it in this PR. I will split the logic into a pure check and an explicit reserve step (or rename to reflect reservation) so callers aren’t surprised and there are no hidden side effects.

Copilot · 2025-11-23T16:20:33Z

crates/core/src/node/network_bridge/p2p_protoc.rs

+                let current = connection_manager.connection_count();
+                if current >= connection_manager.max_connections {
+                    tracing::warn!(
+                        %peer_id,
+                        current_connections = current,
+                        max_connections = connection_manager.max_connections,
+                        %loc,
+                        "handle_successful_connection: rejecting new connection to enforce cap"
+                    );
+                    return Ok(());
+                }


Similar to the issue in handle_connect_peer, this code calls should_accept at line 1796 (which internally checks capacity and inserts into pending_reservations), then redundantly checks capacity again at lines 1805-1815.

If the second check fails, the function returns without calling add_connection, but the pending_reservations entry created by should_accept is never cleaned up, causing a reservation leak.

The fix should be consistent with the one applied to handle_connect_peer—remove the redundant check or restructure to avoid the reservation leak.

Suggested change

let current = connection_manager.connection_count();

if current >= connection_manager.max_connections {

tracing::warn!(

%peer_id,

current_connections = current,

max_connections = connection_manager.max_connections,

%loc,

"handle_successful_connection: rejecting new connection to enforce cap"

);

return Ok(());

}

Same here—extra cap guard after should_accept was meant as a backstop, but if it trips we should drop the reservation so it doesn’t linger. I’ll clean this up alongside the other promotion path tweak to avoid the leak.

sanity · 2025-11-24T03:39:26Z

Addressed parts of Claude's review:

Made isotonic prior size consistent (single ADJUSTMENT_PRIOR_SIZE) and added a guard in the estimator to return InsufficientData when peer locations are missing.
Added small smoke tests: forward estimator tolerates missing locations and expired forward attempts are cleared/recorded.
Documented the shared estimator intent; broader refactor to bound/reset and move it under an explicit owner (while keeping the shared curve) is tracked in Refactor transient connections into dedicated manager #2127 along with size limits and richer integration tests.

Pending for #2127: moving the estimator into owned state with reset/bounds, adding size/age caps, and more thorough isotonic integration coverage.

iduartgomez · 2025-11-24T21:10:17Z

crates/core/src/operations/connect.rs

    ) -> Option<PeerKeyLocation>;
+
+    /// Whether the acceptance should be treated as a short-lived transient link.
+    fn transient_hint(&self, acceptor: &PeerKeyLocation, joiner: &PeerKeyLocation) -> bool;


is this for downstreams during connect? if so clarify in the documentation, where/how is this used, why is needed, etc.

Clarified in code: this per-node ConnectForwardEstimator tracks downstream Connect forwarding outcomes (success/fail) so a node can bias future Connect forwards toward peers likely to accept/complete when capacity is scarce. Added doc comments on the estimator explaining what it learns and why.

iduartgomez

Indont see the changes pushed but approved preemtively

sanity · 2025-11-24T22:56:01Z

@iduartgomez Follow-up on your review: the per-node connect estimator and associated fixes are now pushed to feat/isotonic-forwarding-pr (commits 48bdf25 and 2f8e7f70). Let me know if you see anything still missing.

…cture PR #2136 changed ExpectedInboundTracker from HashMap<IpAddr, Vec<ExpectedInbound>> to HashMap<SocketAddr, ExpectedInbound>. The transactions_for() test helper was added in this branch before that change and wasn't updated during the merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

sanity requested review from Copilot and iduartgomez November 23, 2025 16:15

Copilot started reviewing on behalf of sanity November 23, 2025 16:16 View session

Copilot finished reviewing on behalf of sanity November 23, 2025 16:19

Copilot AI reviewed Nov 23, 2025

View reviewed changes

sanity force-pushed the feat/isotonic-forwarding-pr branch from db01dcf to 82edca5 Compare November 23, 2025 16:39

sanity marked this pull request as ready for review November 23, 2025 18:53

sanity force-pushed the feat/isotonic-forwarding-pr branch 2 times, most recently from 89edd6e to f883f1c Compare November 24, 2025 03:18

sanity changed the base branch from fix/transient-connection-tracking to main November 24, 2025 03:18

sanity force-pushed the feat/isotonic-forwarding-pr branch from 39aacb8 to 7d90e88 Compare November 24, 2025 03:48

iduartgomez reviewed Nov 24, 2025

View reviewed changes

iduartgomez approved these changes Nov 24, 2025

View reviewed changes

sanity enabled auto-merge November 24, 2025 21:53

sanity added 7 commits November 24, 2025 16:58

feat: learn connect forwarding success with isotonic scoring

28fb354

fix: replace expected inbound entry by port

104d984

test: retry riverctl when contract not yet present

81188e3

test: add ignored riverctl propagation smoke test

79409b7

fix: tighten isotonic estimator guards and tests

d2759c8

fix(connect): make forward estimator per-node and document usage

f304d3b

fix(connect): avoid holding estimator lock across awaits

967acc5

sanity force-pushed the feat/isotonic-forwarding-pr branch from 48bdf25 to 967acc5 Compare November 24, 2025 22:59

sanity and others added 3 commits November 24, 2025 17:20

Merge branch 'main' into feat/isotonic-forwarding-pr

647c7b7

Merge branch 'main' into feat/isotonic-forwarding-pr

057c1e5

sanity added this pull request to the merge queue Nov 24, 2025

Merged via the queue into main with commit d91beeb Nov 25, 2025
11 checks passed

sanity deleted the feat/isotonic-forwarding-pr branch November 25, 2025 00:04

	transient: bool, // TODO: rename to transient in protocol once we migrate terminology
	transient: bool,


		static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
		Lazy::new(\|\| RwLock::new(ForwardEstimator::new()));

-static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
-    Lazy::new(|| RwLock::new(ForwardEstimator::new()));
+/// Global estimator shared across all ConnectOp instances and all nodes in the process.
+///
+/// # Warning
+/// This estimator is global and shared. Its state will be visible to all users in the process,
+/// including across tests and multiple nodes. Use `reset_forward_estimator()` to clear it in tests.
+static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> =
+    Lazy::new(|| RwLock::new(ForwardEstimator::new()));
+impl ForwardEstimator {
+    /// Reset the estimator to its initial state.
+    fn reset(&mut self) {
+        *self = ForwardEstimator::new();
+    }
+}
+/// Reset the global FORWARD_ESTIMATOR. Intended for use in tests.
+#[cfg(test)]
+pub fn reset_forward_estimator() {
+    let mut estimator = FORWARD_ESTIMATOR.write();
+    estimator.reset();
+}

Uh oh!

feat: learn connect forwarding success with isotonic scoring #2126

feat: learn connect forwarding success with isotonic scoring #2126

Uh oh!

Conversation

sanity commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Notes

Uh oh!

sanity commented Nov 23, 2025

Uh oh!

claude bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

High-Level Assessment

Architectural Review

✅ Strengths

⚠️ Concerns

Detailed Code Review

crates/core/src/operations/connect.rs

Critical Issues

Minor Issues

crates/core/src/router/isotonic_estimator.rs

Issues

crates/core/src/config/mod.rs

Issues

crates/core/src/ring/connection_manager.rs

Issues

crates/core/src/node/network_bridge/p2p_protoc.rs

Testing Analysis

Existing Tests

Missing Tests

Documentation Gaps

Security Considerations

Performance Concerns

Recommendations Summary

Must Fix

Should Fix

Nice to Have

Conclusion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 23, 2025 • edited by sanity Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025 • edited by sanity Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iduartgomez Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sanity commented Nov 23, 2025 •

edited

Loading

claude bot commented Nov 23, 2025 •

edited

Loading

`crates/core/src/operations/connect.rs`

`crates/core/src/router/isotonic_estimator.rs`

`crates/core/src/config/mod.rs`

`crates/core/src/ring/connection_manager.rs`

`crates/core/src/node/network_bridge/p2p_protoc.rs`

Copilot AI Nov 23, 2025 •

edited by sanity

Loading

Copilot AI Nov 23, 2025 •

edited by sanity

Loading

iduartgomez Nov 24, 2025 •

edited

Loading

sanity Nov 23, 2025 •

edited

Loading

sanity commented Nov 24, 2025 •

edited

Loading