Skip to content

fix(storer): unstick radius decrement on live networks#5463

Open
martinconic wants to merge 1 commit into
masterfrom
fix/5396-radius-decrement
Open

fix(storer): unstick radius decrement on live networks#5463
martinconic wants to merge 1 commit into
masterfrom
fix/5396-radius-decrement

Conversation

@martinconic
Copy link
Copy Markdown
Contributor

@martinconic martinconic commented May 13, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Nodes whose local storageRadius drifts above the neighborhood consensus stay stuck there indefinitely. Salud (pkg/salud/salud.go:243) compares CommittedDepth = capacityDoubling + Radius against the network mode with zero tolerance, flips the node unhealthy on any mismatch, and redistribution stops playing rounds. The reporter of #5428 observed this on bee 2.7.1-rc2 after an RPC instability event; the reporter of #5396 reproduced the same condition with code-level analysis.

The reserve worker's radius-decrement gate db.syncer.SyncRate() == 0 is structurally unreachable on a live mainnet. SyncRate() measures historical sync only, but historical sync never goes quiet for the 15-minute rate window because (1) peer churn keeps starting new cursor-based historical workers and (2) every successful decrement calls puller.onChange → resetIntervals(prevRadius) which discards non-neighbor sync history and forces re-historical-sync, bumping SyncRate above zero for the next ~15-30 min. The gate locks closed permanently. Nodes whose local Radius drifts above the neighborhood consensus stay there indefinitely — salud (zero-tolerance comparison of CommittedDepth = capacityDoubling + Radius) flips the node unhealthy and redistribution stops playing rounds. Restart doesn't help because waitNetworkRFunc re-loads the wrong local value.

Capacity-doubled nodes are structurally worse off: they pull more data, their SyncRate floor is higher, and the +1 from doubling amplifies any Radius mismatch into CommittedDepth mismatch.

Three changes, applied together:

  1. Remove the SyncRate gate. The threshold check (count < threshold(capacity)) is already a correctness signal — if less than half of capacity sits within radius, the radius is wrong, regardless of sync activity. The unreserve path raises radius without consulting SyncRate; the asymmetry was unjustified.

  2. Allow a two-step decrement when count is far below threshold (count*4 <= threshold). Under approximately uniform CAC bin distribution, each step doubles count-within-radius, so two steps land at or below threshold rather than overshooting. Capped at two to keep behavior bounded; each radius change triggers one resetIntervals so multi-step in one tick is strictly cheaper than the equivalent single-step ticks 5 minutes apart.

  3. Reduce reserveWakeUpDuration from 15 to 5 minutes. With the gate gone, faster ticks bound worst-case recovery; tick cost is one countWithinRadius iteration.

The existing test "radius doesn't change due to non-zero pull rate encoded the buggy behavior as an invariant. Renamed and inverted to assert the new contract: with an empty reserve, the worker decrements toward minimumRadius regardless of the syncer's reported rate.

No protocol, storage layout, persistence format, or interface changes. The Syncer.SyncRate() interface stays — it's still used for "fully synced" determination at startup and for the status protocol's pullsyncRate field, both correct uses.

Fixes #5396
Fixes #5428

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@martinconic martinconic self-assigned this May 13, 2026
@martinconic martinconic added this to the 2026 milestone May 13, 2026
@acud
Copy link
Copy Markdown
Contributor

acud commented May 14, 2026

would like to get input from @significance on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants