fix: ttl-aware reservation cleanup prevents node isolation by netsirius · Pull Request #3015 · freenet/freenet-core

netsirius · 2026-02-14T15:37:55Z

Problem

When a ConnectOp reserves a gateway slot via should_accept() but the actual connection fails (e.g., NAT traversal failure, blocked acceptor), the pending reservation persists indefinitely. has_connection_or_pending() treats stale reservations as active connections, causing is_not_connected() to filter out the gateway on retry. This creates permanent node isolation — the node believes it's connected to the gateway but has no actual connection.

A previous workaround bypassed is_not_connected() filtering when open_conns == 0, but this was fragile: it only helped fully isolated nodes and ignored partial isolation where some connections exist but a stale reservation blocks reconnection to a specific gateway.

Solution

Replace the open_conns == 0 workaround with TTL-aware reservation visibility. has_connection_or_pending() now ignores reservations older than PENDING_RESERVATION_TTL (60s), making stale entries invisible without requiring explicit cleanup. This handles all isolation scenarios uniformly — both full and partial.

Changes

connection_manager.rs — TTL check in has_connection_or_pending()

Reservations older than 60s are treated as expired and ignored
inject_reservation() method for test state injection
age_reservation test helper for clean TTL manipulation
3 new unit tests: TTL boundary, re-reservation after expiry, cleanup frees slots

connect.rs — Removed open_conns == 0 override

is_not_connected() filtering now applies uniformly since TTL handles staleness

testing_impl.rs — SimNetwork infrastructure for connection state inspection

ConnectionManager exposed per node via shared_cm rendezvous pattern
has_connection_or_pending(), inject_stale_reservation(), connection_count() accessors

simulation_smoke.rs — Integration test

test_stale_reservation_ttl_expiry: verifies fresh reservations are visible, expired ones are invisible, and established connections are unaffected

Fixes

Closes #2888

netsirius · 2026-02-14T15:58:06Z

@claude review this PR

claude · 2026-02-14T15:58:26Z

Claude finished @netsirius's task in 3m 45s —— View job

Code Review ✓

Read project guidelines and architecture docs
Review core implementation changes
Analyze test coverage and quality
Check for potential issues and edge cases
Provide comprehensive feedback

Summary

This PR effectively solves the stale reservation problem by making expired reservations invisible rather than requiring explicit cleanup. The TTL-aware approach is cleaner and more robust than the previous open_conns == 0 workaround. However, there are a few issues to address before merging.

Critical Issues

1. Race Condition in `should_accept()` (High Priority)

Location: connection_manager.rs:262-298

Problem: There's a TOCTOU (time-of-check-time-of-use) race between reading reserved_before and inserting the new reservation:

let reserved_before = self.pending_reservations.read().len();  // Line 262
// ... 30+ lines of logic ...
{
    let mut pending = self.pending_reservations.write();
    pending.insert(addr, (location, Instant::now()));  // Line 298
}
let total_conn = match reserved_before.checked_add(1)...  // Line 301

Impact: Under concurrent load, multiple threads could:

Read the same reserved_before count (e.g., 18)
Each calculate total_conn = 18 + 1 + open = 19 < max_connections (20)
All accept, causing reserved_before to become 21+ (exceeding the limit)

This violates the max_connections invariant and could cause resource exhaustion on gateways.

Fix: Move the reservation insertion before the count read, or hold the write lock for the entire capacity check:

// Option 1: Insert first, then check
{
    let mut pending = self.pending_reservations.write();
    if pending.contains_key(&addr) {
        return true;  // Already reserved/connected
    }
    pending.insert(addr, (location, Instant::now()));
}
let reserved_after = self.pending_reservations.read().len();
let total_conn = reserved_after.checked_add(open)?;
// ... capacity checks, cleanup on rejection ...

Fix race condition →

2. Stale Reservation Count in Capacity Decisions

Location: connection_manager.rs:262

Problem: reserved_before = self.pending_reservations.read().len() counts all reservations, including expired ones that are invisible to has_connection_or_pending(). This creates an asymmetry:

Capacity enforcement: Counts stale reservations (inflates the count)
Visibility checks: Ignores stale reservations

Scenario:

Node has 5 stale reservations (age > 60s) + 15 fresh ones = 20 total
should_accept() reads reserved_before = 20
New connection attempt: total_conn = 20 + 1 + 0 = 21 >= max (20) → REJECTED
But from is_not_connected()'s perspective, only 15 are visible (5 are expired)
The node appears to have capacity but rejects connections

Fix: Filter expired reservations when counting:

let now = Instant::now();
let reserved_before = self.pending_reservations.read()
    .iter()
    .filter(|(_, (_, created))| now.duration_since(*created) <= PENDING_RESERVATION_TTL)
    .count();

Fix stale count →

Medium Priority Issues

3. Inconsistent State After Rejection

Location: connection_manager.rs:314, 372

Problem: When rejection happens, the code removes the reservation from pending_reservations but not from location_for_peer (inserted at line 380 via record_pending_location). However, the current flow calls record_pending_location after the acceptance decision (line 380), so this might not be an issue.

Clarification needed: Does location_for_peer get populated before or after should_accept() returns? If line 380 is the only path, this is fine. If there's early insertion (line 286 check), verify cleanup consistency.

4. Cleanup Timing Uncertainty

Location: ring/mod.rs:1223

Issue: cleanup_stale_reservations() is only called during the maintenance loop (every iteration, but no fixed interval is documented for this loop). This means:

Stale reservations might persist in the BTreeMap for an unbounded time
Memory usage grows proportionally to the rate of failed connections
The 60s TTL is a "soft" expiry (invisible to queries) but not a hard cleanup guarantee

Recommendation: Document the expected cleanup interval in the constant definition:

/// Maximum time a pending reservation can remain before being considered stale.
/// Reservations exceeding this TTL become invisible to `has_connection_or_pending()`
/// and are removed by periodic cleanup (called every maintenance loop iteration,
/// typically every few seconds).
const PENDING_RESERVATION_TTL: Duration = Duration::from_secs(60);

Minor Issues

5. Test Coverage Gaps

Unit tests are excellent, but the integration test (simulation_smoke.rs) could be stronger:

Doesn't test the actual isolation scenario: The test injects stale reservations but doesn't verify that a retry to the same gateway succeeds after TTL expiry. The core bug (fix: replace zero-connection gateway override with reservation TTL #2888) was that retries were blocked — the test should demonstrate this is fixed.
Missing concurrent scenario: No test validates behavior when multiple ConnectOps race to reserve the same gateway.

Suggested addition:

// After injecting stale reservation
assert!(sim.inject_stale_reservation(&gw, phantom_addr, phantom_loc, Duration::from_secs(120)));
assert_eq!(sim.has_connection_or_pending(&gw, phantom_addr), Some(false));

// Verify retry is allowed (the core fix for #2888)
let retry_result = sim.attempt_connect_to(&gw, phantom_addr).await;
assert!(retry_result.is_ok(), "Retry should succeed after TTL expiry");

6. Documentation Clarity

Location: connection_manager.rs:23-25

The comment is good but could clarify the dual mechanism:

/// Maximum time a pending reservation can remain before being considered stale.
/// After this TTL:
/// - `has_connection_or_pending()` treats the reservation as non-existent (immediate effect)
/// - `cleanup_stale_reservations()` removes it from the BTreeMap (periodic cleanup)
/// This prevents permanent node isolation when CONNECT operations fail to complete.
const PENDING_RESERVATION_TTL: Duration = Duration::from_secs(60);

7. Magic Number in Test

Location: simulation_smoke.rs:958

Duration::from_secs(120) should use a constant relative to PENDING_RESERVATION_TTL:

// Expired reservation (age=120s > 60s TTL) is invisible
let expired_age = Duration::from_secs(120);  // 2 × TTL
assert!(sim.inject_stale_reservation(..., expired_age));

Positive Aspects

✅ Elegant solution: TTL-aware visibility avoids the fragility of the open_conns == 0 workaround
✅ Comprehensive unit tests: Excellent coverage of TTL boundary conditions
✅ Proper test infrastructure: inject_reservation() and age_reservation() helpers are well-designed
✅ No breaking changes: The fix is backward-compatible
✅ Clear commit message: Follows conventional commits

Recommendations

Before merging:

✅ Fix the race condition in should_accept() (critical)
✅ Fix stale reservation counting in capacity checks (critical)
⚠️ Verify location_for_peer cleanup consistency (medium)
⚠️ Add integration test for retry-after-TTL scenario (medium)
📝 Improve documentation clarity (minor)

Post-merge follow-up:

Consider adding metrics for stale reservation cleanup rate to monitor effectiveness in production

Conclusion

The core idea is sound and solves the problem effectively. The TTL-aware approach is more robust than the previous workaround. However, the race condition and stale count issues must be fixed before merging, as they could cause max_connections violations under load.

netsirius · 2026-02-14T17:15:54Z

The race condition in should_accept() is pre-existing (not introduced by this PR). process_message spawns each inbound message as a separate task, so concurrent ConnectOps can read the same reserved_before count and both accept, briefly exceeding max_connections.

What do you think @claude ? This should be handled in a separate issue

FYI @iduartgomez

claude · 2026-02-14T17:16:13Z

Claude finished @netsirius's task in 1m 3s —— View job

Analyzing race condition context

Check git diff to see what changed in should_accept()
Analyze if race condition is pre-existing or new
Provide recommendation on separate issue vs blocking

Analysis

You're correct — the race condition is pre-existing and not introduced by this PR. Here's what I found:

Original Code (main branch)

let reserved_before = self.pending_reservations.read().len();  // Line 274
// ... logic ...
{
    let mut pending = self.pending_reservations.write();
    pending.insert(addr, (location, Instant::now()));  // Line 306
}
let total_conn = reserved_before.checked_add(1)...  // Line 310

This PR's Changes

This PR only changed line 262-267 to add TTL filtering when counting reservations:

let now = Instant::now();
let reserved_before = self
    .pending_reservations
    .read()
    .iter()
    .filter(|(_, (_, created))| now.duration_since(*created) <= PENDING_RESERVATION_TTL)
    .count();

The TOCTOU pattern (read count → insert → calculate) existed before and remains unchanged by this PR.

Recommendation

This should NOT block the PR. Here's why:

Pre-existing issue: The race has been in the codebase since should_accept() was first written. This PR doesn't make it worse.
This PR's scope: The PR solves the stale reservation isolation problem (issue fix: replace zero-connection gateway override with reservation TTL #2888). The race condition is a separate capacity enforcement issue.
Correctness vs robustness trade-off:
- The stale reservation bug causes permanent node isolation (correctness issue)
- The race condition could cause brief max_connections violations under concurrent load (robustness issue)
- Fixing fix: replace zero-connection gateway override with reservation TTL #2888 first is the right priority
Testing burden: Fixing the race would require significant concurrency test infrastructure (not present in this PR's scope).

However, I recommend two things:

1. Actually, the TTL filtering does improve the race slightly

My original review incorrectly claimed this PR worsens the race. In fact, the TTL filtering reduces the attack surface:

Before this PR: Race window = all pending reservations (including stale ones)
After this PR: Race window = only fresh reservations (<60s old)

So if anything, this PR makes a small improvement to the race condition by not counting stale reservations.

2. Create a follow-up issue for the race

I suggest filing a separate issue with this context:

Title: fix: race condition in ConnectionManager::should_accept() allows max_connections violations

Description:

## Problem

`should_accept()` has a TOCTOU race between reading `reserved_before` and inserting the new reservation. Under concurrent `ConnectOp` load (via `process_message` task spawning), multiple threads can:

1. Read the same `reserved_before` count (e.g., 18)
2. Each calculate `total_conn = 18 + 1 + open = 19 < max_connections`
3. All accept, causing `pending_reservations.len()` to exceed `max_connections`

## Impact

- Brief violations of `max_connections` capacity limit on gateways
- Potential resource exhaustion under sustained concurrent connection attempts
- Severity: Medium (affects robustness, not correctness)

## Solution

Move reservation insertion before capacity check, or hold write lock during entire decision:

```rust
// Option 1: Insert first, check after
{
    let mut pending = self.pending_reservations.write();
    if pending.contains_key(&addr) { return true; }
    pending.insert(addr, (location, Instant::now()));
}
let reserved_after = self.pending_reservations.read().len();
let total_conn = reserved_after.checked_add(open)?;
// ... capacity checks, cleanup on rejection ...

Context

Pre-existing issue (not introduced by fix: ttl-aware reservation cleanup prevents node isolation #3015)
Separate from fix: replace zero-connection gateway override with reservation TTL #2888 (stale reservation isolation)
Requires concurrency testing infrastructure


### Conclusion

**Approve the PR** — it fixes the critical isolation bug without introducing new issues. The race condition should be tracked separately and fixed in a follow-up PR with proper concurrency tests.

The only remaining critical issue from my original review is **#2: Stale Reservation Count in Capacity Decisions** — but wait, I need to re-check that because this PR **does** add TTL filtering to the count! Let me verify...

Actually, looking at the diff again, **this PR already fixes issue #2**! The new code at lines 262-267 filters out expired reservations when calculating `reserved_before`. So that's addressed.

### Final verdict

The PR is good to merge. The race condition exists but should be handled separately. No blocking issues remain.

---

iduartgomez · 2026-02-14T18:59:44Z

@netsirius lets open an issue for the race condition before merging this, and then we can merge this if it is good

iduartgomez · 2026-02-14T19:01:41Z

If we can have the suggested sim network tests that would be good.

netsirius · 2026-02-15T10:52:01Z

@iduartgomez I just added sim network integration tests for the retry-after-TTL scenario

Two integration tests added to simulation_smoke.rs with supporting methods on SimNetwork:

test_stale_reservation_allows_gateway_retry — Verifies the core fix: replace zero-connection gateway override with reservation TTL #2888 fix end-to-end: injects a fresh reservation (blocks retry), ages it past the 60s TTL (becomes invisible), then confirms should_accept() admits the retry from the same address
test_stale_reservation_cleanup_frees_capacity — Verifies that expired reservations don't consume capacity slots and that cleanup_stale_reservations() removes them from the map

netsirius requested a review from iduartgomez February 14, 2026 15:39

netsirius changed the title ~~fix: TTL-aware reservation cleanup prevents node isolation (#2888)~~ fix: ttl-aware reservation cleanup prevents node isolation Feb 14, 2026

netsirius added 3 commits February 16, 2026 10:32

improve stale reservation handling and connection management (#2888)

4ebacd4

improve pending reservation count calculation in connection manager

69f84a5

clean and add missing tests

4aadbac

iduartgomez force-pushed the fix/2888-improve-reservation-cleanup branch from f7329c6 to 4aadbac Compare February 16, 2026 09:32

sanity mentioned this pull request Feb 16, 2026

fix: Flaky ping blocked/partially-connected network tests #3052

Closed

sanity added this pull request to the merge queue Feb 16, 2026

Merged via the queue into main with commit f7efc2c Feb 16, 2026
26 of 31 checks passed

sanity deleted the fix/2888-improve-reservation-cleanup branch February 16, 2026 20:24

sanity mentioned this pull request Feb 17, 2026

feat: fast CONNECT recovery for NATed peers via TTL=0 gateway fallback #2998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ttl-aware reservation cleanup prevents node isolation#3015

fix: ttl-aware reservation cleanup prevents node isolation#3015
sanity merged 3 commits intomainfrom
fix/2888-improve-reservation-cleanup

netsirius commented Feb 14, 2026

Uh oh!

netsirius commented Feb 14, 2026

Uh oh!

claude Bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

netsirius commented Feb 14, 2026

Uh oh!

claude Bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

iduartgomez commented Feb 14, 2026

Uh oh!

iduartgomez commented Feb 14, 2026

Uh oh!

netsirius commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

netsirius commented Feb 14, 2026

Problem

Solution

Changes

Fixes

Uh oh!

netsirius commented Feb 14, 2026

Uh oh!

claude Bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review ✓

Summary

Critical Issues

1. Race Condition in should_accept() (High Priority)

2. Stale Reservation Count in Capacity Decisions

Medium Priority Issues

3. Inconsistent State After Rejection

4. Cleanup Timing Uncertainty

Minor Issues

5. Test Coverage Gaps

6. Documentation Clarity

7. Magic Number in Test

Positive Aspects

Recommendations

Conclusion

Uh oh!

netsirius commented Feb 14, 2026

Uh oh!

claude Bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analyzing race condition context

Analysis

Original Code (main branch)

This PR's Changes

Recommendation

However, I recommend two things:

1. Actually, the TTL filtering does improve the race slightly

2. Create a follow-up issue for the race

Context

Uh oh!

iduartgomez commented Feb 14, 2026

Uh oh!

iduartgomez commented Feb 14, 2026

Uh oh!

netsirius commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Feb 14, 2026 •

edited

Loading

1. Race Condition in `should_accept()` (High Priority)

claude Bot commented Feb 14, 2026 •

edited

Loading