Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 22, 2025

Problem

Transient caps and inbound identity handling had regressions: promotion paths could exceed caps and gateways could register inbound peers under placeholder identities. We also lacked targeted diagnostics around neighbor selection and a small-cap repro harness.

This change

  • Add instrumentation of neighbor candidates/live tx tracking to help debug connect maintenance.
  • Enforce transient promotion caps and add a regression harness for small cap scenarios.
  • Add gateway inbound identity regression test to ensure inbound peers are registered with real IDs.

Testing

  • cargo check -p freenet (local).

@sanity sanity marked this pull request as draft November 22, 2025 05:34
@freenet freenet deleted a comment from claude bot Nov 23, 2025
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

@claude Please carefully review this PR, both high-level and the details.

@claude
Copy link

claude bot commented Nov 23, 2025

Claude finished @sanity's task —— View job


PR Review Complete ✅

I've conducted a thorough review of this PR, examining both the high-level architecture and implementation details. Here's my assessment:


High-Level Assessment

Problem Statement: Clear and well-articulated. The PR addresses two critical regressions:

  1. Transient connections could be promoted beyond max connection caps
  2. Gateways registered inbound peers under placeholder identities instead of real IDs

Solution Approach: Sound. The changes enforce cap checks during transient promotion and add targeted regression tests to prevent recurrence.


Detailed Code Review

connection_manager.rs - Major Refactoring (163 additions, 273 deletions)

Strengths:

  • Simplified state tracking: Replaced separate AtomicUsize counters (open_connections, reserved_connections) with a single source of truth via BTreeMap lengths. This eliminates potential race conditions and counter drift.
  • Unified location tracking: Merged pending_locations into location_for_peer, making pending peers immediately discoverable for routing. Good architectural simplification.
  • Cap enforcement at promotion: Lines 436-449 now enforce the global cap when adding connections, preventing the regression where transient promotions bypassed limits.
  • Better idempotency: Lines 172-176 now treat duplicate should_accept() calls as no-ops rather than rejecting them, which is more robust for network retries.

Concerns:

  1. Potential race in cap enforcement (lines 436-449):

    if previous_location.is_none() && self.connection_count() >= self.max_connections {
        // ... reject and rollback
    }

    Between the connection_count() check and the rollback, another thread could add a connection. Consider using a write lock held across both the check and the insert/rollback to make this atomic.

  2. Double reservation cleanup (lines 422-429):

    if was_reserved {
        let old = self.pending_reservations.write().remove(&peer);
        if old.is_none() {
            tracing::warn!(%peer, "add_connection: expected pending reservation missing");
        }
    }
    if was_reserved {  // ← Duplicate check here
        self.pending_reservations.write().remove(&peer);
    }

    This appears to be redundant code. The second if was_reserved block (lines 428-430) repeats the same removal. Should remove one.

  3. Removed unit tests: The refactoring deleted two unit tests:

    • pending_connections_hidden_from_known_locations
    • should_accept_does_not_leak_reservations_for_duplicate_peer

    These tested important invariants. While the behavior changed (pending peers are now in location_for_peer), consider adding equivalent tests for the new semantics.

p2p_protoc.rs - Transient Promotion Guards

Strengths:

  • Defense in depth: Lines 1293-1340 add both admission logic (should_accept()) AND a direct cap check before promotion. This ensures no bypass path exists.
  • Clear error handling: Proper callback notifications when promotion is rejected.
  • Instrumentation: Enhanced logging helps diagnose cap violations in production.

Minor issue:

  • Line 1348: was_reserved parameter is now true, which is correct since should_accept() now reserves the slot. However, this assumes should_accept() was called—if there's any code path that promotes a transient without calling should_accept() first, this would be incorrect. Verify all promotion paths call should_accept().

New Regression Tests

connection_cap.rs:

  • Simple, focused test that spins up a small network and verifies no peer exceeds max_connections
  • Good use of freenet-test-network builder API
  • Suggestion: Consider lowering max_connections in the test config rather than using the default. The comment mentions "low max-connections setting (min=5, max=6)" but the code uses DEFAULT_MAX_CONNECTIONS which is likely much higher. This would make the test more sensitive to regressions.

gateway_inbound_identity.rs:

  • Well-documented test purpose and historical context
  • Correctly verifies that gateway connections reference real peer IDs, not the gateway itself
  • Minor: Line 46 assertion message could be clearer about what it means for a connection to "reference remote peers, not itself"

Instrumentation Changes

live_tx.rs:

  • Added len() method for debugging/telemetry. Simple and safe.

ring/mod.rs:

  • Enhanced maintenance logging to include candidates and live_tx_peers counts
  • Helpful for diagnosing connection state issues

Potential Issues & Questions

  1. Race condition in add_connection(): The cap enforcement check at line 436 is not atomic with the insertion. Could lead to temporary cap violations under concurrent load.

  2. Duplicate cleanup code: Lines 422-429 contain redundant pending_reservations.write().remove(&peer) calls.

  3. Missing test coverage: The deleted unit tests covered important edge cases. While integration tests now exist, consider restoring unit tests for:

    • Idempotent should_accept() behavior
    • Reservation tracking invariants
    • Cap enforcement under concurrent adds
  4. Transient promotion assumption: The was_reserved=true parameter in p2p_protoc.rs:1348 assumes should_accept() was called. Is this guaranteed for all promotion code paths?

  5. Test configuration mismatch: connection_cap.rs comments mention "min=5, max=6" but uses DEFAULT_MAX_CONNECTIONS. Should the test override this to create a tighter constraint?


Testing Assessment

Testing claimed: "cargo check -p freenet (local)"

Concern: cargo check only verifies compilation, not correctness. For a PR addressing regressions with new tests, please run:

cargo test -p freenet-core --test connection_cap
cargo test -p freenet-core --test gateway_inbound_identity
cargo test -p freenet-core  # Full suite

Recommendations

Before Merge:

  1. ✅ Remove duplicate pending_reservations.write().remove(&peer) at lines 428-430 in connection_manager.rs
  2. ⚠️ Address the race condition in add_connection() cap enforcement (consider atomic check-and-insert pattern)
  3. ⚠️ Run full test suite, not just cargo check
  4. 💡 Consider adding unit tests for the refactored connection manager invariants

Post-Merge:

  1. Monitor production logs for "add_connection: rejecting new connection to enforce cap" warnings to verify the fix works as expected
  2. Consider adding metrics for transient promotion success/failure rates

Verdict

Architecturally sound with good test coverage for the regressions. The refactoring simplifies state management and closes the bypass paths. However, there are implementation details (race condition, duplicate code) that should be addressed before merge.

The PR description could be slightly improved per AGENTS.md guidelines—it lists what changed but could better explain why the previous approach (separate counters, separate pending map) was inadequate and what insight led to unifying them.

Recommendation: Request changes to fix the duplicate cleanup code and race condition, then approve after verification.


@sanity sanity marked this pull request as ready for review November 23, 2025 02:50
@sanity
Copy link
Collaborator Author

sanity commented Nov 23, 2025

Thanks for the review. The concerns you raised (race in add_connection, duplicate reservation cleanup, restoring unit tests) are addressed in the later slices of this stack—trying to keep this PR narrowly scoped so we don’t churn the stack order. If we need the duplicate cleanup fix here too, I can backport, but it’s already covered in the follow-ups.

@sanity sanity force-pushed the fix/transient-split-3 branch from 70213aa to 7764539 Compare November 23, 2025 21:22
@sanity sanity enabled auto-merge November 23, 2025 22:03
@sanity sanity added this pull request to the merge queue Nov 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 23, 2025
@sanity sanity enabled auto-merge November 23, 2025 22:44
@sanity sanity added this pull request to the merge queue Nov 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 23, 2025
@sanity sanity added this pull request to the merge queue Nov 23, 2025
Merged via the queue into main with commit 134b3b3 Nov 24, 2025
11 checks passed
@sanity sanity deleted the fix/transient-split-3 branch November 24, 2025 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants