Skip to content

fix: ttl-aware reservation cleanup prevents node isolation#3015

Merged
sanity merged 3 commits intomainfrom
fix/2888-improve-reservation-cleanup
Feb 16, 2026
Merged

fix: ttl-aware reservation cleanup prevents node isolation#3015
sanity merged 3 commits intomainfrom
fix/2888-improve-reservation-cleanup

Conversation

@netsirius
Copy link
Copy Markdown
Collaborator

Problem

When a ConnectOp reserves a gateway slot via should_accept() but the actual connection fails (e.g., NAT traversal failure, blocked acceptor), the pending reservation persists indefinitely. has_connection_or_pending() treats stale reservations as active connections, causing is_not_connected() to filter out the gateway on retry. This creates permanent node isolation — the node believes it's connected to the gateway but has no actual connection.

A previous workaround bypassed is_not_connected() filtering when open_conns == 0, but this was fragile: it only helped fully isolated nodes and ignored partial isolation where some connections exist but a stale reservation blocks reconnection to a specific gateway.

Solution

Replace the open_conns == 0 workaround with TTL-aware reservation visibility. has_connection_or_pending() now ignores reservations older than PENDING_RESERVATION_TTL (60s), making stale entries invisible without requiring explicit cleanup. This handles all isolation scenarios uniformly — both full and partial.

Changes

connection_manager.rs — TTL check in has_connection_or_pending()

  • Reservations older than 60s are treated as expired and ignored
  • inject_reservation() method for test state injection
  • age_reservation test helper for clean TTL manipulation
  • 3 new unit tests: TTL boundary, re-reservation after expiry, cleanup frees slots

connect.rs — Removed open_conns == 0 override

  • is_not_connected() filtering now applies uniformly since TTL handles staleness

testing_impl.rs — SimNetwork infrastructure for connection state inspection

  • ConnectionManager exposed per node via shared_cm rendezvous pattern
  • has_connection_or_pending(), inject_stale_reservation(), connection_count() accessors

simulation_smoke.rs — Integration test

  • test_stale_reservation_ttl_expiry: verifies fresh reservations are visible, expired ones are invisible, and established connections are unaffected

Fixes

Closes #2888

@netsirius netsirius changed the title fix: TTL-aware reservation cleanup prevents node isolation (#2888) fix: ttl-aware reservation cleanup prevents node isolation Feb 14, 2026
@netsirius
Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Feb 14, 2026

Claude finished @netsirius's task in 3m 45s —— View job


Code Review ✓

  • Read project guidelines and architecture docs
  • Review core implementation changes
  • Analyze test coverage and quality
  • Check for potential issues and edge cases
  • Provide comprehensive feedback

Summary

This PR effectively solves the stale reservation problem by making expired reservations invisible rather than requiring explicit cleanup. The TTL-aware approach is cleaner and more robust than the previous open_conns == 0 workaround. However, there are a few issues to address before merging.

Critical Issues

1. Race Condition in should_accept() (High Priority)

Location: connection_manager.rs:262-298

Problem: There's a TOCTOU (time-of-check-time-of-use) race between reading reserved_before and inserting the new reservation:

let reserved_before = self.pending_reservations.read().len();  // Line 262
// ... 30+ lines of logic ...
{
    let mut pending = self.pending_reservations.write();
    pending.insert(addr, (location, Instant::now()));  // Line 298
}
let total_conn = match reserved_before.checked_add(1)...  // Line 301

Impact: Under concurrent load, multiple threads could:

  1. Read the same reserved_before count (e.g., 18)
  2. Each calculate total_conn = 18 + 1 + open = 19 < max_connections (20)
  3. All accept, causing reserved_before to become 21+ (exceeding the limit)

This violates the max_connections invariant and could cause resource exhaustion on gateways.

Fix: Move the reservation insertion before the count read, or hold the write lock for the entire capacity check:

// Option 1: Insert first, then check
{
    let mut pending = self.pending_reservations.write();
    if pending.contains_key(&addr) {
        return true;  // Already reserved/connected
    }
    pending.insert(addr, (location, Instant::now()));
}
let reserved_after = self.pending_reservations.read().len();
let total_conn = reserved_after.checked_add(open)?;
// ... capacity checks, cleanup on rejection ...

Fix race condition →

2. Stale Reservation Count in Capacity Decisions

Location: connection_manager.rs:262

Problem: reserved_before = self.pending_reservations.read().len() counts all reservations, including expired ones that are invisible to has_connection_or_pending(). This creates an asymmetry:

  • Capacity enforcement: Counts stale reservations (inflates the count)
  • Visibility checks: Ignores stale reservations

Scenario:

  1. Node has 5 stale reservations (age > 60s) + 15 fresh ones = 20 total
  2. should_accept() reads reserved_before = 20
  3. New connection attempt: total_conn = 20 + 1 + 0 = 21 >= max (20)REJECTED
  4. But from is_not_connected()'s perspective, only 15 are visible (5 are expired)
  5. The node appears to have capacity but rejects connections

Fix: Filter expired reservations when counting:

let now = Instant::now();
let reserved_before = self.pending_reservations.read()
    .iter()
    .filter(|(_, (_, created))| now.duration_since(*created) <= PENDING_RESERVATION_TTL)
    .count();

Fix stale count →

Medium Priority Issues

3. Inconsistent State After Rejection

Location: connection_manager.rs:314, 372

Problem: When rejection happens, the code removes the reservation from pending_reservations but not from location_for_peer (inserted at line 380 via record_pending_location). However, the current flow calls record_pending_location after the acceptance decision (line 380), so this might not be an issue.

Clarification needed: Does location_for_peer get populated before or after should_accept() returns? If line 380 is the only path, this is fine. If there's early insertion (line 286 check), verify cleanup consistency.

4. Cleanup Timing Uncertainty

Location: ring/mod.rs:1223

Issue: cleanup_stale_reservations() is only called during the maintenance loop (every iteration, but no fixed interval is documented for this loop). This means:

  • Stale reservations might persist in the BTreeMap for an unbounded time
  • Memory usage grows proportionally to the rate of failed connections
  • The 60s TTL is a "soft" expiry (invisible to queries) but not a hard cleanup guarantee

Recommendation: Document the expected cleanup interval in the constant definition:

/// Maximum time a pending reservation can remain before being considered stale.
/// Reservations exceeding this TTL become invisible to `has_connection_or_pending()`
/// and are removed by periodic cleanup (called every maintenance loop iteration,
/// typically every few seconds).
const PENDING_RESERVATION_TTL: Duration = Duration::from_secs(60);

Minor Issues

5. Test Coverage Gaps

Unit tests are excellent, but the integration test (simulation_smoke.rs) could be stronger:

  1. Doesn't test the actual isolation scenario: The test injects stale reservations but doesn't verify that a retry to the same gateway succeeds after TTL expiry. The core bug (fix: replace zero-connection gateway override with reservation TTL #2888) was that retries were blocked — the test should demonstrate this is fixed.

  2. Missing concurrent scenario: No test validates behavior when multiple ConnectOps race to reserve the same gateway.

Suggested addition:

// After injecting stale reservation
assert!(sim.inject_stale_reservation(&gw, phantom_addr, phantom_loc, Duration::from_secs(120)));
assert_eq!(sim.has_connection_or_pending(&gw, phantom_addr), Some(false));

// Verify retry is allowed (the core fix for #2888)
let retry_result = sim.attempt_connect_to(&gw, phantom_addr).await;
assert!(retry_result.is_ok(), "Retry should succeed after TTL expiry");

6. Documentation Clarity

Location: connection_manager.rs:23-25

The comment is good but could clarify the dual mechanism:

/// Maximum time a pending reservation can remain before being considered stale.
/// After this TTL:
/// - `has_connection_or_pending()` treats the reservation as non-existent (immediate effect)
/// - `cleanup_stale_reservations()` removes it from the BTreeMap (periodic cleanup)
/// This prevents permanent node isolation when CONNECT operations fail to complete.
const PENDING_RESERVATION_TTL: Duration = Duration::from_secs(60);

7. Magic Number in Test

Location: simulation_smoke.rs:958

Duration::from_secs(120) should use a constant relative to PENDING_RESERVATION_TTL:

// Expired reservation (age=120s > 60s TTL) is invisible
let expired_age = Duration::from_secs(120);  // 2 × TTL
assert!(sim.inject_stale_reservation(..., expired_age));

Positive Aspects

Elegant solution: TTL-aware visibility avoids the fragility of the open_conns == 0 workaround
Comprehensive unit tests: Excellent coverage of TTL boundary conditions
Proper test infrastructure: inject_reservation() and age_reservation() helpers are well-designed
No breaking changes: The fix is backward-compatible
Clear commit message: Follows conventional commits

Recommendations

Before merging:

  1. ✅ Fix the race condition in should_accept() (critical)
  2. ✅ Fix stale reservation counting in capacity checks (critical)
  3. ⚠️ Verify location_for_peer cleanup consistency (medium)
  4. ⚠️ Add integration test for retry-after-TTL scenario (medium)
  5. 📝 Improve documentation clarity (minor)

Post-merge follow-up:

  • Consider adding metrics for stale reservation cleanup rate to monitor effectiveness in production

Conclusion

The core idea is sound and solves the problem effectively. The TTL-aware approach is more robust than the previous workaround. However, the race condition and stale count issues must be fixed before merging, as they could cause max_connections violations under load.


@netsirius
Copy link
Copy Markdown
Collaborator Author

The race condition in should_accept() is pre-existing (not introduced by this PR). process_message spawns each inbound message as a separate task, so concurrent ConnectOps can read the same reserved_before count and both accept, briefly exceeding max_connections.

What do you think @claude ? This should be handled in a separate issue

FYI @iduartgomez

@claude
Copy link
Copy Markdown

claude Bot commented Feb 14, 2026

Claude finished @netsirius's task in 1m 3s —— View job


Analyzing race condition context

  • Check git diff to see what changed in should_accept()
  • Analyze if race condition is pre-existing or new
  • Provide recommendation on separate issue vs blocking

Analysis

You're correct — the race condition is pre-existing and not introduced by this PR. Here's what I found:

Original Code (main branch)

let reserved_before = self.pending_reservations.read().len();  // Line 274
// ... logic ...
{
    let mut pending = self.pending_reservations.write();
    pending.insert(addr, (location, Instant::now()));  // Line 306
}
let total_conn = reserved_before.checked_add(1)...  // Line 310

This PR's Changes

This PR only changed line 262-267 to add TTL filtering when counting reservations:

let now = Instant::now();
let reserved_before = self
    .pending_reservations
    .read()
    .iter()
    .filter(|(_, (_, created))| now.duration_since(*created) <= PENDING_RESERVATION_TTL)
    .count();

The TOCTOU pattern (read count → insert → calculate) existed before and remains unchanged by this PR.

Recommendation

This should NOT block the PR. Here's why:

  1. Pre-existing issue: The race has been in the codebase since should_accept() was first written. This PR doesn't make it worse.

  2. This PR's scope: The PR solves the stale reservation isolation problem (issue fix: replace zero-connection gateway override with reservation TTL #2888). The race condition is a separate capacity enforcement issue.

  3. Correctness vs robustness trade-off:

  4. Testing burden: Fixing the race would require significant concurrency test infrastructure (not present in this PR's scope).

However, I recommend two things:

1. Actually, the TTL filtering does improve the race slightly

My original review incorrectly claimed this PR worsens the race. In fact, the TTL filtering reduces the attack surface:

  • Before this PR: Race window = all pending reservations (including stale ones)
  • After this PR: Race window = only fresh reservations (<60s old)

So if anything, this PR makes a small improvement to the race condition by not counting stale reservations.

2. Create a follow-up issue for the race

I suggest filing a separate issue with this context:

Title: fix: race condition in ConnectionManager::should_accept() allows max_connections violations

Description:

## Problem

`should_accept()` has a TOCTOU race between reading `reserved_before` and inserting the new reservation. Under concurrent `ConnectOp` load (via `process_message` task spawning), multiple threads can:

1. Read the same `reserved_before` count (e.g., 18)
2. Each calculate `total_conn = 18 + 1 + open = 19 < max_connections`
3. All accept, causing `pending_reservations.len()` to exceed `max_connections`

## Impact

- Brief violations of `max_connections` capacity limit on gateways
- Potential resource exhaustion under sustained concurrent connection attempts
- Severity: Medium (affects robustness, not correctness)

## Solution

Move reservation insertion before capacity check, or hold write lock during entire decision:

```rust
// Option 1: Insert first, check after
{
    let mut pending = self.pending_reservations.write();
    if pending.contains_key(&addr) { return true; }
    pending.insert(addr, (location, Instant::now()));
}
let reserved_after = self.pending_reservations.read().len();
let total_conn = reserved_after.checked_add(open)?;
// ... capacity checks, cleanup on rejection ...

Context


### Conclusion

**Approve the PR** — it fixes the critical isolation bug without introducing new issues. The race condition should be tracked separately and fixed in a follow-up PR with proper concurrency tests.

The only remaining critical issue from my original review is **#2: Stale Reservation Count in Capacity Decisions** — but wait, I need to re-check that because this PR **does** add TTL filtering to the count! Let me verify...

Actually, looking at the diff again, **this PR already fixes issue #2**! The new code at lines 262-267 filters out expired reservations when calculating `reserved_before`. So that's addressed.

### Final verdict

The PR is good to merge. The race condition exists but should be handled separately. No blocking issues remain.

---

@iduartgomez
Copy link
Copy Markdown
Collaborator

@netsirius lets open an issue for the race condition before merging this, and then we can merge this if it is good

@iduartgomez
Copy link
Copy Markdown
Collaborator

If we can have the suggested sim network tests that would be good.

@netsirius
Copy link
Copy Markdown
Collaborator Author

@iduartgomez I just added sim network integration tests for the retry-after-TTL scenario

Two integration tests added to simulation_smoke.rs with supporting methods on SimNetwork:

  • test_stale_reservation_allows_gateway_retry — Verifies the core fix: replace zero-connection gateway override with reservation TTL #2888 fix end-to-end: injects a fresh reservation (blocks retry), ages it past the 60s TTL (becomes invisible), then confirms should_accept() admits the retry from the same address
  • test_stale_reservation_cleanup_frees_capacity — Verifies that expired reservations don't consume capacity slots and that cleanup_stale_reservations() removes them from the map

@iduartgomez iduartgomez force-pushed the fix/2888-improve-reservation-cleanup branch from f7329c6 to 4aadbac Compare February 16, 2026 09:32
@sanity sanity added this pull request to the merge queue Feb 16, 2026
Merged via the queue into main with commit f7efc2c Feb 16, 2026
26 of 31 checks passed
@sanity sanity deleted the fix/2888-improve-reservation-cleanup branch February 16, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: replace zero-connection gateway override with reservation TTL

3 participants