-
-
Notifications
You must be signed in to change notification settings - Fork 107
feat: learn connect forwarding success with isotonic scoring #2126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@claude Please review, both high-level and in-detail |
|
Claude finished @sanity's task —— View job Code Review CompleteReviewed PR #2126: feat: learn connect forwarding success with isotonic scoring High-Level AssessmentThis PR introduces a significant enhancement to the connect forwarding mechanism using machine learning (isotonic regression) to make capacity-aware routing decisions. The approach is sound and addresses a real problem (capacity-blind forwarding during join bursts), but there are several concerns around complexity, potential bugs, and testing. Overall Verdict: Needs revisions before merge. The core algorithm is solid, but implementation details need attention. Architectural Review✅ Strengths
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements capacity-aware connection forwarding for the connect operation using learned success patterns via isotonic regression. The implementation tracks forward attempts with timeouts, learns a monotonic success-vs-distance curve, and uses it to score routing candidates while preserving ring-distance bias and recency cooldowns.
Key changes:
- Introduces
min_connectionsandmax_connectionsconfiguration parameters with defaults (10 and 20 respectively) - Replaces atomic connection counters with a
pending_reservationsmap for clearer admission control - Adds forward attempt tracking in
ConnectOpwith automatic timeout-based failure recording (20s) - Implements a global isotonic estimator for learning per-peer forward success rates
Major concerns identified:
- Reservation counting and capacity enforcement logic has race conditions and potential leaks
- Path dependency in Cargo.toml will break external builds
- Global static estimator shared across all node instances (problematic for testing)
- IP-based inbound matching with arbitrary fallback when port doesn't match
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/core/src/config/mod.rs | Adds min/max connection configuration parameters with defaults |
| crates/core/src/operations/connect.rs | Implements forward attempt tracking and isotonic-based routing selection |
| crates/core/src/ring/connection_manager.rs | Replaces atomic counters with pending_reservations map; adds capacity enforcement |
| crates/core/src/ring/mod.rs | Updates connection maintenance to filter by live transactions and capacity |
| crates/core/src/node/network_bridge/p2p_protoc.rs | Adds immediate prune on drop; enforces capacity on transient promotion |
| crates/core/src/node/network_bridge/handshake.rs | Changes expected inbound tracking from socket-based to IP-based with port fallback |
| crates/core/src/router/isotonic_estimator.rs | Changes visibility from super to crate for use in connect operation |
| crates/core/Cargo.toml | Adds local path dependency for freenet-test-network (problematic) |
| Test configuration files | Adds min/max_connections: None to all test node configs |
| apps/freenet-ping/app/tests/run_app_blocked_peers.rs | Updates ignore attribute with TODO explaining WebSocket teardown issue |
Comments suppressed due to low confidence (2)
crates/core/src/ring/connection_manager.rs:185
- The
reserved_beforevariable is read at line 150 before inserting intopending_reservationsat line 180. This means when calculatingtotal_connat line 183-185, the newly inserted reservation is not included in the count. This could lead to accepting more connections thanmax_connectionsallows during concurrent admission checks.
Consider moving the read of reserved_before to after the early returns (lines 172-176, 200-203, etc.) but before the insertion, or restructure to ensure the count is accurate when making the acceptance decision.
let open = self.connection_count();
let reserved_before = self.pending_reservations.read().len();
tracing::info!(
%peer_id,
open,
reserved_before,
is_gateway = self.is_gateway,
min = self.min_connections,
max = self.max_connections,
rnd_if_htl_above = self.rnd_if_htl_above,
"should_accept: evaluating direct acceptance guard"
);
if self.is_gateway && (open > 0 || reserved_before > 0) {
tracing::info!(
%peer_id,
open,
reserved_before,
"Gateway evaluating additional direct connection (post-bootstrap)"
);
}
if self.location_for_peer.read().get(peer_id).is_some() {
// We've already accepted this peer (pending or active); treat as a no-op acceptance.
tracing::debug!(%peer_id, "Peer already pending/connected; acknowledging acceptance");
return true;
}
{
let mut pending = self.pending_reservations.write();
pending.insert(peer_id.clone(), location);
}
let total_conn = match reserved_before
.checked_add(1)
.and_then(|val| val.checked_add(open))
crates/core/src/ring/connection_manager.rs:614
- Both
connection_count()(line 555-562) andnum_connections()(line 605-614) compute the same value: the total number of connections across all location buckets. This duplication is confusing and error-prone.
Consider consolidating to a single method or clearly documenting why both exist if there's a semantic difference.
pub(crate) fn connection_count(&self) -> usize {
// Count only established connections tracked by location buckets.
self.connections_by_location
.read()
.values()
.map(|conns| conns.len())
.sum()
}
pub(crate) fn get_connections_by_location(&self) -> BTreeMap<Location, Vec<Connection>> {
self.connections_by_location.read().clone()
}
pub(super) fn get_known_locations(&self) -> BTreeMap<PeerId, Location> {
self.location_for_peer.read().clone()
}
/// Route an op to the most optimal target.
pub fn routing(
&self,
target: Location,
requesting: Option<&PeerId>,
skip_list: impl Contains<PeerId>,
router: &Router,
) -> Option<PeerKeyLocation> {
let connections = self.connections_by_location.read();
tracing::debug!(
total_locations = connections.len(),
self_peer = self
.get_peer_key()
.as_ref()
.map(|id| id.to_string())
.unwrap_or_else(|| "unknown".into()),
"routing: considering connections"
);
let peers = connections.values().filter_map(|conns| {
let conn = conns.choose(&mut rand::rng())?;
if self.is_transient(&conn.location.peer) {
return None;
}
if let Some(requester) = requesting {
if requester == &conn.location.peer {
return None;
}
}
(!skip_list.has_element(conn.location.peer.clone())).then_some(&conn.location)
});
router.select_peer(peers, target).cloned()
}
pub fn num_connections(&self) -> usize {
let connections = self.connections_by_location.read();
let total: usize = connections.values().map(|v| v.len()).sum();
tracing::debug!(
unique_locations = connections.len(),
total_connections = total,
"num_connections called"
);
total
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
crates/core/Cargo.toml
Outdated
| freenet-stdlib = { features = ["net", "testing"], workspace = true } | ||
| freenet-macros = { path = "../freenet-macros" } | ||
| freenet-test-network = "0.1.1" | ||
| freenet-test-network = { version = "0.1.2", path = "../../../../freenet-test-network" } |
Copilot
AI
Nov 23, 2025
•
edited by sanity
Loading
edited by sanity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping deps unchanged in this slice to avoid churn; the path patching is from the underlying stack and already addressed on the base. If needed we can follow up once the stack lands.
| let entry = list.pop(); | ||
| if list.is_empty() { | ||
| self.entries.remove(&ip); | ||
| } | ||
| if let Some(entry) = entry { | ||
| tracing::debug!(remote = %addr, peer = %entry.peer.addr, transient = entry.transient, tx = ?entry.transaction, "ExpectInbound: matched by IP fallback"); | ||
| return Some(entry); | ||
| } |
Copilot
AI
Nov 23, 2025
•
edited by sanity
Loading
edited by sanity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IP-only fallback existed before this slice; to keep this PR scoped to forwarding we’re leaving it unchanged. Agree it merits cleanup—tracking that in issue #2127 for a follow-up refactor once the stack lands.
| peer: PeerId, | ||
| transaction: Option<Transaction>, | ||
| courtesy: bool, | ||
| transient: bool, // TODO: rename to transient in protocol once we migrate terminology |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment says "rename to transient in protocol" but the field is already named transient. This comment appears to be stale or incorrectly worded. If the intention is to rename something in the protocol layer, the comment should clarify what specifically needs to be renamed.
| transient: bool, // TODO: rename to transient in protocol once we migrate terminology | |
| transient: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale note—the field is already named transient here. Keeping code as-is and will drop the TODO when we do the terminology cleanup tracked in #2127.
|
|
||
| static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> = | ||
| Lazy::new(|| RwLock::new(ForwardEstimator::new())); | ||
|
|
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FORWARD_ESTIMATOR is a global static shared across all ConnectOp instances and potentially all node instances in the same process. This means:
- Multiple nodes in the same process (e.g., during testing) will share and contaminate each other's forward success data
- The estimator's state persists across operation instances, which may be desired but should be documented
- There's no mechanism to reset or clear the estimator if needed
Consider either:
- Making this per-node or per-OpManager instance
- Adding clear documentation about the intended global sharing semantics
- Providing a way to reset/clear for testing purposes
| static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> = | |
| Lazy::new(|| RwLock::new(ForwardEstimator::new())); | |
| /// Global estimator shared across all ConnectOp instances and all nodes in the process. | |
| /// | |
| /// # Warning | |
| /// This estimator is global and shared. Its state will be visible to all users in the process, | |
| /// including across tests and multiple nodes. Use `reset_forward_estimator()` to clear it in tests. | |
| static FORWARD_ESTIMATOR: Lazy<RwLock<ForwardEstimator>> = | |
| Lazy::new(|| RwLock::new(ForwardEstimator::new())); | |
| impl ForwardEstimator { | |
| /// Reset the estimator to its initial state. | |
| fn reset(&mut self) { | |
| *self = ForwardEstimator::new(); | |
| } | |
| } | |
| /// Reset the global FORWARD_ESTIMATOR. Intended for use in tests. | |
| #[cfg(test)] | |
| pub fn reset_forward_estimator() { | |
| let mut estimator = FORWARD_ESTIMATOR.write(); | |
| estimator.reset(); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed the global static isn’t ideal. In practice we run one node per process so cross-node contamination hasn’t bitten us, but it should live on node/OpManager state with reset/bounds. I’ve logged that in #2127 and kept this slice scoped to forwarding; happy to move it in the follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tests run by cargo tests are not run in multiple processes, nodes all run as part of the same process, so be careful with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed: the estimator is no longer global. Each node now owns a per-node connect_forward_estimator (on OpManager) that’s shared only across ConnectOps for that node, so single-process multi-node tests don’t contaminate each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are now on feat/isotonic-forwarding-pr (pushed 48bdf25 + 2f8e7f70). The estimator is per-node and scoped to Connect ops as discussed—thanks for the approval and the heads-up.
| let current = connection_manager.connection_count(); | ||
| if current >= connection_manager.max_connections { | ||
| tracing::warn!( | ||
| tx = %tx, | ||
| %peer, | ||
| current_connections = current, | ||
| max_connections = connection_manager.max_connections, | ||
| %loc, | ||
| "connect_peer: rejecting transient promotion to enforce cap" | ||
| ); | ||
| callback | ||
| .send_result(Err(())) | ||
| .await | ||
| .inspect_err(|err| { | ||
| tracing::debug!( | ||
| tx = %tx, | ||
| remote = %peer, | ||
| ?err, | ||
| "connect_peer: failed to notify cap-rejection callback" | ||
| ); | ||
| }) | ||
| .ok(); | ||
| return Ok(()); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After dropping the transient connection at line 1309, the code calls should_accept at line 1314, which internally inserts into pending_reservations and checks capacity limits. Then at lines 1336-1359, it redundantly checks connection_count() against max_connections again.
This is problematic because:
- The capacity check is duplicated (already done inside
should_accept) - If the second check fails (lines 1337-1359), the
pending_reservationsentry added byshould_acceptis never cleaned up, causing a reservation leak - The transient was already dropped at line 1309, so if either check fails, the transient slot is lost
Consider either:
- Removing the redundant check at lines 1336-1359 and relying on
should_accept's internal logic - Or restructuring to check admission before dropping the transient, and cleaning up
pending_reservationson failure
| let current = connection_manager.connection_count(); | |
| if current >= connection_manager.max_connections { | |
| tracing::warn!( | |
| tx = %tx, | |
| %peer, | |
| current_connections = current, | |
| max_connections = connection_manager.max_connections, | |
| %loc, | |
| "connect_peer: rejecting transient promotion to enforce cap" | |
| ); | |
| callback | |
| .send_result(Err(())) | |
| .await | |
| .inspect_err(|err| { | |
| tracing::debug!( | |
| tx = %tx, | |
| remote = %peer, | |
| ?err, | |
| "connect_peer: failed to notify cap-rejection callback" | |
| ); | |
| }) | |
| .ok(); | |
| return Ok(()); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the duplicate guard. The intent was a belt-and-suspenders check on the promotion path that bypasses the normal admission flow, but if that second cap check fires we should clear the reservation to avoid skewing capacity. I’ll tidy this in the follow-up slice so we don’t leak reservations there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should_accept having side effects is bad, that verb does not imply mutation, we should fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. should_accept currently reserves capacity (and can log) to make acceptance atomic, which is why it has side effects. Agreed the naming is misleading; I’ll split this into a pure check plus an explicit reserve call (or rename to reflect reservation) in a follow-up cleanup so callers aren’t surprised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed—doing it in this PR. I will split the logic into a pure check and an explicit reserve step (or rename to reflect reservation) so callers aren’t surprised and there are no hidden side effects.
| let current = connection_manager.connection_count(); | ||
| if current >= connection_manager.max_connections { | ||
| tracing::warn!( | ||
| %peer_id, | ||
| current_connections = current, | ||
| max_connections = connection_manager.max_connections, | ||
| %loc, | ||
| "handle_successful_connection: rejecting new connection to enforce cap" | ||
| ); | ||
| return Ok(()); | ||
| } |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the issue in handle_connect_peer, this code calls should_accept at line 1796 (which internally checks capacity and inserts into pending_reservations), then redundantly checks capacity again at lines 1805-1815.
If the second check fails, the function returns without calling add_connection, but the pending_reservations entry created by should_accept is never cleaned up, causing a reservation leak.
The fix should be consistent with the one applied to handle_connect_peer—remove the redundant check or restructure to avoid the reservation leak.
| let current = connection_manager.connection_count(); | |
| if current >= connection_manager.max_connections { | |
| tracing::warn!( | |
| %peer_id, | |
| current_connections = current, | |
| max_connections = connection_manager.max_connections, | |
| %loc, | |
| "handle_successful_connection: rejecting new connection to enforce cap" | |
| ); | |
| return Ok(()); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here—extra cap guard after should_accept was meant as a backstop, but if it trips we should drop the reservation so it doesn’t linger. I’ll clean this up alongside the other promotion path tweak to avoid the leak.
db01dcf to
82edca5
Compare
89edd6e to
f883f1c
Compare
|
Addressed parts of Claude's review:
Pending for #2127: moving the estimator into owned state with reset/bounds, adding size/age caps, and more thorough isotonic integration coverage. |
39aacb8 to
7d90e88
Compare
| ) -> Option<PeerKeyLocation>; | ||
|
|
||
| /// Whether the acceptance should be treated as a short-lived transient link. | ||
| fn transient_hint(&self, acceptor: &PeerKeyLocation, joiner: &PeerKeyLocation) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for downstreams during connect? if so clarify in the documentation, where/how is this used, why is needed, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified in code: this per-node ConnectForwardEstimator tracks downstream Connect forwarding outcomes (success/fail) so a node can bias future Connect forwards toward peers likely to accept/complete when capacity is scarce. Added doc comments on the estimator explaining what it learns and why.
iduartgomez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indont see the changes pushed but approved preemtively
|
@iduartgomez Follow-up on your review: the per-node connect estimator and associated fixes are now pushed to feat/isotonic-forwarding-pr (commits 48bdf25 and 2f8e7f70). Let me know if you see anything still missing. |
48bdf25 to
967acc5
Compare
…cture PR #2136 changed ExpectedInboundTracker from HashMap<IpAddr, Vec<ExpectedInbound>> to HashMap<SocketAddr, ExpectedInbound>. The transactions_for() test helper was added in this branch before that change and wasn't updated during the merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem
Connect forwarding is capacity-blind during join bursts. Gateways/early peers keep routing ConnectRequests to already-full neighbors, leaving joiners stuck with gateway-only links and slow small-world formation.
Solution
Testing
cargo test -p freenet --tests --no-runSOAK_PEER_COUNT=40 SOAK_SNAPSHOT_ITERATIONS=2 SOAK_SNAPSHOT_INTERVAL_SECS=60 cargo test -p freenet --test large_network -- --ignored --nocapture(all peers healthy; snapshots preserved under the test network run root).Notes
fix/transient-connection-tracking); intended to augment/replace PR 2125 with learned, capacity-aware forwarding.