-
-
Notifications
You must be signed in to change notification settings - Fork 107
Open
Labels
A-developer-xpArea: developer experienceArea: developer experienceA-networkingArea: Networking, ring protocol, peer discoveryArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotExperience needed to fix/implement: Hard / a lotS-needs-designStatus: Needs architectural design or RFCStatus: Needs architectural design or RFCT-enhancementType: Improvement to existing functionalityType: Improvement to existing functionality
Description
Problem
Transient handling spans connection_manager, p2p_protoc, and routing code. Budgeting, reservation, promotion, and cleanup are scattered across multiple maps and counters, making it hard to reason about correctness and concurrency. Nacho called out in Matrix that this muddiness makes concurrency issues hard to spot; AI reviewers also struggle to follow the lifecycle.
Specific pain points:
- Budget and lifecycle logic (reserve, promote, drop, expire) are spread between connection_manager and the bridge.
- Promotion requires touching several data structures manually (pending reservations, transient map, connections_by_location), which is error-prone.
- TTL enforcement is anchored in the bridge task rather than encapsulated.
- Admission idempotency/race handling is duplicated.
Proposal
Introduce a dedicated TransientConnectionManager that owns transient lifecycle and budget.
- API surface:
try_reserve(peer, location) -> bool,promote(peer, location) -> Result<(), PromoteError>,drop(peer),expire_due()or internal cleanup task,is_transient(peer),count(),budget(). - Encapsulate map + counter and enforce atomicity inside the manager (no cross-module map/counter mutations).
- ConnectionManager holds a reference and delegates transient operations; routing/topology use a single source of truth for transient state.
- Promotion path: bridge calls
drop_transient/promotevia the manager;add_connectiononly handles established peers and does not need transient bookkeeping knowledge. - TTL: transient manager owns timers/cleanup (single task or timer-wheel) instead of ad-hoc tasks in the bridge.
- Idempotency: manager handles duplicate admissions/promotions centrally to reduce TOCTOU races.
Rationale
- Clear separation between transient and established connections should reduce concurrency bugs and make reviews/maintenance easier.
- Limits the number of places that mutate transient state, making invariants testable.
- Makes it easier to add targeted tests (unit + integration) for transient lifecycle without threading through the bridge.
Isotonic / forwarding follow-up
- Keep the global isotonic regression (shared curve across peers) but bound/reset it so it doesn’t grow unbounded and can adapt. Per-peer offsets stay layered on the shared curve.
- Add a simple recency bias that’s easy to reason about (e.g., short cooldown or small recent-window weighting) to avoid overfitting to stale data; keep it minimal to avoid hard-to-debug behaviors.
- Move storage out of the ad-hoc global static into a controlled owner (even if process-global) with explicit reset/bounds.
Questions / Decisions
- Should promotion be synchronous (returns Result) or enqueue a promotion task? (lean: synchronous to keep invariants tight)
- Do we need metrics/hooks for transient events (reserve, promote, drop, expire) as part of the refactor?
- TTL enforcement strategy: per-connection timer vs. periodic sweep.
- Recency handling: what’s the minimal cooldown/decay that still prevents hammering the same neighbor?
Next Steps
- Sketch API in a small RFC/PRD comment thread on this issue.
- Draft refactor in a follow-up PR (after current transient stack) with focused tests:
- Concurrent reserve/promote/drop stress
- TTL expiry cleanup
- Promotion idempotency and cap enforcement
- Estimator bounds/reset + minimal recency bias
- Wire bridge and connection_manager to use the new manager, minimize surface changes elsewhere.
CC: @iduartgomez for architectural alignment.
Metadata
Metadata
Assignees
Labels
A-developer-xpArea: developer experienceArea: developer experienceA-networkingArea: Networking, ring protocol, peer discoveryArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotExperience needed to fix/implement: Hard / a lotS-needs-designStatus: Needs architectural design or RFCStatus: Needs architectural design or RFCT-enhancementType: Improvement to existing functionalityType: Improvement to existing functionality
Type
Projects
Status
Triage