Skip to content

Refactor transient connections into dedicated manager #2127

@sanity

Description

@sanity

Problem

Transient handling spans connection_manager, p2p_protoc, and routing code. Budgeting, reservation, promotion, and cleanup are scattered across multiple maps and counters, making it hard to reason about correctness and concurrency. Nacho called out in Matrix that this muddiness makes concurrency issues hard to spot; AI reviewers also struggle to follow the lifecycle.

Specific pain points:

  • Budget and lifecycle logic (reserve, promote, drop, expire) are spread between connection_manager and the bridge.
  • Promotion requires touching several data structures manually (pending reservations, transient map, connections_by_location), which is error-prone.
  • TTL enforcement is anchored in the bridge task rather than encapsulated.
  • Admission idempotency/race handling is duplicated.

Proposal

Introduce a dedicated TransientConnectionManager that owns transient lifecycle and budget.

  • API surface: try_reserve(peer, location) -> bool, promote(peer, location) -> Result<(), PromoteError>, drop(peer), expire_due() or internal cleanup task, is_transient(peer), count(), budget().
  • Encapsulate map + counter and enforce atomicity inside the manager (no cross-module map/counter mutations).
  • ConnectionManager holds a reference and delegates transient operations; routing/topology use a single source of truth for transient state.
  • Promotion path: bridge calls drop_transient/promote via the manager; add_connection only handles established peers and does not need transient bookkeeping knowledge.
  • TTL: transient manager owns timers/cleanup (single task or timer-wheel) instead of ad-hoc tasks in the bridge.
  • Idempotency: manager handles duplicate admissions/promotions centrally to reduce TOCTOU races.

Rationale

  • Clear separation between transient and established connections should reduce concurrency bugs and make reviews/maintenance easier.
  • Limits the number of places that mutate transient state, making invariants testable.
  • Makes it easier to add targeted tests (unit + integration) for transient lifecycle without threading through the bridge.

Isotonic / forwarding follow-up

  • Keep the global isotonic regression (shared curve across peers) but bound/reset it so it doesn’t grow unbounded and can adapt. Per-peer offsets stay layered on the shared curve.
  • Add a simple recency bias that’s easy to reason about (e.g., short cooldown or small recent-window weighting) to avoid overfitting to stale data; keep it minimal to avoid hard-to-debug behaviors.
  • Move storage out of the ad-hoc global static into a controlled owner (even if process-global) with explicit reset/bounds.

Questions / Decisions

  • Should promotion be synchronous (returns Result) or enqueue a promotion task? (lean: synchronous to keep invariants tight)
  • Do we need metrics/hooks for transient events (reserve, promote, drop, expire) as part of the refactor?
  • TTL enforcement strategy: per-connection timer vs. periodic sweep.
  • Recency handling: what’s the minimal cooldown/decay that still prevents hammering the same neighbor?

Next Steps

  • Sketch API in a small RFC/PRD comment thread on this issue.
  • Draft refactor in a follow-up PR (after current transient stack) with focused tests:
    • Concurrent reserve/promote/drop stress
    • TTL expiry cleanup
    • Promotion idempotency and cap enforcement
    • Estimator bounds/reset + minimal recency bias
  • Wire bridge and connection_manager to use the new manager, minimize surface changes elsewhere.

CC: @iduartgomez for architectural alignment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-developer-xpArea: developer experienceA-networkingArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotS-needs-designStatus: Needs architectural design or RFCT-enhancementType: Improvement to existing functionality

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions