Skip to content

chore(release): promote rc-2026.5.4 to 0.2.6,0.2.5#102

Merged
jacderida merged 30 commits into
mainfrom
rc-2026.5.4
May 28, 2026
Merged

chore(release): promote rc-2026.5.4 to 0.2.6,0.2.5#102
jacderida merged 30 commits into
mainfrom
rc-2026.5.4

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Promote ant-core to 0.2.6 and ant-cli to 0.2.5. Rewrites ant-protocol to 2.1.1 (already pinned), rewrites ant-node refs (runtime optional + dev-deps) to 0.11.5.

jacderida and others added 30 commits May 22, 2026 18:16
Hardens the download path against two distinct failure modes observed
running `ant file download` against the production network: residential
link saturation and per-peer / DHT transient errors that previously
fatally aborted multi-hundred-chunk downloads.

End to end, this takes a residential `ant file download` from "aborts
on the first 256-wide concurrent batch's saturation event" to
"completes 11/11 files including 2+ GB downloads," and on a fat-pipe
droplet from "matches baseline" to "matches baseline" — no regression
on the warm-start path that production downloaders actually exercise.

Six related changes:

1. retry-on-Ok(None) with unanimous-NotFound threshold in chunk_get.
   When the close group returns Ok(None) (no peer has the chunk),
   retry once with a fresh find_closest_peers lookup, unless every
   queried peer responded with an authoritative NotFound (the only
   safe stop for genuine data absence). The previous behaviour treated
   Ok(None) as fatal on first occurrence, which on a saturated link
   meant any single chunk's transient close-group exhaustion aborted
   the whole download.

2. rebucketed_unordered in file.rs instead of buffer_unordered for the
   in-flight chunk fetches. The adaptive limiter's cap can now shrink
   the in-flight count mid-batch under sustained pressure;
   buffer_unordered snapshotted the cap once at pipeline build and
   ignored later Decrease decisions.

3. observe-outer with Ok(None) -> Outcome::Timeout instead of
   observe-per-peer. The controller sees one observation per chunk_get
   (not one per peer attempt), classified via a new chunk_get_outcome
   helper that treats Ok(None) as a load-shedding signal. Avoids the
   per-peer noise floor on the production network where some peers in
   any K=7 close group are unreachable from any given client even on a
   healthy link — that noise was driving spurious Decrease decisions
   on the droplet and pinning steady-state cap low.

4. ChannelStart::fetch: 64 -> 4 cold-start. The 64-wide initial burst
   saturated residential connections before the controller had any
   observation to act on. 4 is the value confirmed safe on a real
   residential link. On droplets the cost is a one-off cold-start
   warm-up of ~16 min on the first 2.5 GB file; subsequent files
   warm-start from the persisted client_adaptive.json snapshot (which
   the controller cleanly grows to cap=256, the channel ceiling).

5. Deferred-retry pass in streaming_decrypt's consumer. When chunk_get
   returns Ok(None) for a chunk during a batch, the chunk is deferred
   rather than aborting the batch. After the main batch settles, the
   deferred chunks are retried serially with sleeps of 10/30/60 s.
   This rides out transient saturation events that hit multiple
   in-flight chunks at once — by the time the batch has drained and
   the first sleep elapses, the link has usually settled. A chunk
   only becomes fatal after all 3 deferred attempts fail.

6. Per-peer protocol-error tolerance and deferred-retry transient-error
   tolerance. A single peer returning Error::Protocol (e.g. "Chunk
   verification failed" from a corrupted local copy) no longer aborts
   the close-group sweep — the loop counts it and continues to the
   next peer. Similarly, an Err(_) from chunk_get_observed during a
   deferred-retry attempt logs and falls through to the next attempt's
   longer backoff rather than escalating.

Also: latency_inflation_factor default 2.0 -> 4.0. Natural close-group
fallback latency on the production network routinely doubles vs the
EWMA baseline (a single peer hitting fallback adds ~10 s on top of a
sub-second median), and was firing spurious Decrease decisions even
on the droplet. 4.0 is the value validated on the previously-merged
tune-latency-inflation-factor branch.

Test plan:
  - 296 ant-core unit tests pass.
  - End-to-end residential download (PROD-LOCAL-DL-04): 11/11 files
    completed including 2.51 GB (42m 43s) and 2.76 GB (46m 26s).
    During an earlier residential run 14 chunks went through the
    deferred-retry path and every one recovered on attempt 1/3 after
    the 10 s sleep.
  - End-to-end droplet download (PROD-DL-05): 20/20 files completed.
    The first 2.5 GB file paid 16m 9s of cold-start cost; subsequent
    multi-GB files ran in 3-6 min each, near the pre-change ~5 min
    baseline on PROD-DL-02. No retry mechanism fired across the
    20-file run — healthy-network success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit fixed residential saturation and abort-on-first-
failure, but field reports from fast connections (e.g. an Oracle VPS
that gets full speed on the released client) showed the opposite
problem: the fetch cap stayed pinned at ~13-24 across an entire 36-file
run and never climbed toward the 256 ceiling, so multi-GB files took
~22 min each instead of ~5.

Three compounding causes, all from having tuned exclusively for the
saturated-home case:

1. Cap can't grow. AIMD exits slow-start permanently on the first
   Decrease, then grows +1 per 32-observation window. On a link with a
   steady ~4% close-group-exhaustion trickle, intermittent Decreases
   fire often enough that additive +1 never gets ahead — equilibrium
   ~20. Additive growth simply cannot reach a useful cap from a low
   base before a file finishes.

   Fix: add `LimiterConfig::slow_start_ramp_threshold`. Below it, a
   Decrease still halves the cap but keeps slow-start armed, so the
   next healthy window doubles back up instead of crawling. The fetch
   channel sets it to the channel ceiling, so download concurrency
   tracks the connection's real capacity. Default 0 preserves the
   original behaviour for quote/store.

2. The p95-latency Decrease misfires on fetch. `chunk_get_observed`'s
   latency includes the internal 1 s retry sleep and the slow retry
   sweep for chunks that needed one, so a window with a couple of
   retry-path chunks has a wildly inflated p95 that reads as
   congestion. Fix: add `LimiterConfig::latency_decrease_enabled`,
   false for fetch. Genuine fetch congestion still surfaces via the
   Ok(None) -> Timeout rate, which the timeout_ceiling check catches.

3. The deferred-retry pass was a throughput sink. It retried deferred
   chunks SERIALLY with a mandatory 10 s pre-sleep each; a batch that
   deferred ~20 chunks burned minutes of near-zero throughput even
   though every chunk succeeded on its first retry (the 10 s sleep was
   pure waste — the deferrals were peer-side noise that clears in <1 s).
   Fix: retry deferred chunks in CONCURRENT rounds reusing the fetch
   limiter, with the first round firing immediately and later rounds
   backing off (0/15/45 s) only for chunks that survive a round. Both
   Ok(None) and transient errors re-defer to the next round; only the
   final round's leftovers are fatal.

Quote and store channel behaviour is unchanged (threshold 0,
latency-decrease enabled). New unit tests cover protected-vs-additive
recovery, the disabled latency check, and that the controller applies
the download tuning to fetch only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ceiling

The previous commit added slow-start protection so the fetch cap could
double back up after transient Decreases — but a re-test on the same
fast-but-lossy VPS showed the cap still pinned in the ~6-26 range,
growing only additively. Root cause: `warm_start` unconditionally set
`left_slow_start = true`.

Every `ant file download` is a fresh process that warm-starts the
controller from the persisted snapshot. With warm_start always exiting
slow-start, the protection only ever applied to the very first
(cold-start) file; every subsequent file began with slow-start already
exited and could only grow the fetch cap by +1 per 32-observation
window. Additive growth from a low warm value cannot climb to the
ceiling against the connection's intermittent close-group-exhaustion
trickle, so the cap drifted down across files (observed: 15, 8, 26, 20,
14, 12, 12, 9, 6...) and 2.5 GB files stayed at ~17-23 min.

Fix: warm_start now sets `left_slow_start = clamped >=
slow_start_ramp_threshold`. For quote/store (threshold 0) this is
unchanged — always exits, so a learned warm value isn't doubled on the
first healthy window. For fetch (threshold == ceiling) a warm value
below the ceiling keeps slow-start armed, so the cap doubles back
toward the connection's real capacity instead of crawling.

This was the missing piece: the in-process slow-start protection and
the cross-process warm-start path have to agree, or the multi-file CLI
pattern silently defeats the protection. Added a regression test
covering both the protected (doubles after warm_start) and default
(stays additive) channels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an optional --upgrade-channel argument (stable|beta) to `ant node
add`. When supplied, it is persisted in the node registry and passed
through to the ant-node process as --upgrade-channel, which accepts the
same values.

- New UpgradeChannel enum in ant-core (serde snake_case, Display maps to
  the lowercase values ant-node expects).
- Threaded through AddNodeOpts and NodeConfig (serde-default for
  backward-compatible registry deserialization).
- build_node_args emits --upgrade-channel only when set.
- CLI exposes a clap ValueEnum that converts into the core enum, keeping
  ant-core clap-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… them stopped

Nodes adopted across a daemon restart (adopt_from_registry) have no owning
monitor_node task, so all restart/respawn logic — which lives in monitor_node —
never runs for them. The liveness monitor was their only supervisor, and it only
ever marks a dead node Stopped. Result: an adopted node that auto-upgrades exits
cleanly under --stop-on-upgrade expecting the service manager (the daemon) to
restart it, but nothing does. The node stays dead, reported Stopped, with the
registry version left stale. Reproduced deterministically on a testnet node
(0.11.3 -> 0.11.14-rc.1): process gone, daemon emitted only node_stopped.

Track adopted nodes in a HashSet on the Supervisor (set in adopt_from_registry,
cleared once this daemon owns a monitor_node for the node). When the liveness
monitor finds an adopted node's process dead and the on-disk binary version has
drifted from the registry, treat it as an upgrade exit: respawn on the new binary
via respawn_upgraded_node and hand it a monitor_node, so it comes back on the new
version and is properly supervised from then on. Non-upgrade exits and
daemon-spawned nodes keep their existing behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stops the daemon and all nodes, kills lingering processes, removes the
installed `ant` binary and all daemon/registry state, and clears node
data/logs under /mnt/nodes. POSIX sh for Alpine/busybox; safe to re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The external-signer merkle upload path (`merkle_upload_chunks`) aborted the
entire file on the first chunk that returned `InsufficientPeers`, and never
retried. Because all chunks in a batch share one `winner_pool`, the few nodes
whose routing tables disagree about that pool's midpoint reject it consistently
for every chunk whose close group includes them — turning a transient,
self-healing shortfall into a fatal partial upload.

C2.1: collect per-chunk `InsufficientPeers` failures rather than `?`-propagating
them; non-quorum errors (e.g. a missing proof) stay fatal.

C2.2: retry the failed set with the same reusable proofs across a total budget
of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave
path), re-collecting each chunk's close group per attempt and applying a
jittered 30s backoff between rounds so a large failed set does not re-probe the
same divergent nodes in lockstep. The retry round a chunk lands on is recorded
in `retries_histogram[round]`.

The store loop is extracted into `merkle_store_with_retry`, unit-tested for
collect-not-abort, non-quorum-fatal, retry-only-the-failed-set,
retry-counted-once, and exhausted-budget behaviour. `file.rs` reports honest
`chunks_failed`/`total_chunks`; the `data.rs` path returns a hard error on a
residual shortfall rather than a success with an undownloadable data map
(`DataUploadResult` cannot express a partial store).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ci.yml` only triggered on `main`, so PRs targeting release-candidate
branches (e.g. `rc-2026.5.4`) ran no fmt/clippy/test checks — the
`pull_request` branch filter is matched against the PR's base branch.

Add an `rc-*` glob to both the `push` and `pull_request` filters so every
release-candidate branch is covered. Takes effect for the rc branch once
this merges (GitHub reads the `pull_request` trigger from the base branch),
and future rc cuts inherit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on the `--all-features`/E2E/doc jobs failed to compile: `ant-protocol` is
pinned to its git `rc-2026.5.4` branch (→ `saorsa-core 0.24.5-rc.1`) but
`ant-node` was still on crates.io `0.11.4` (→ `saorsa-core 0.24.4`), so two
incompatible `saorsa-core` versions clashed on `MultiAddr` in `node/devnet.rs`.

Point `ant-node` (runtime + dev dep) and the direct `saorsa-core` dev-dep at
their `rc-2026.5.4` git branches so the whole graph resolves to a single
`saorsa-core 0.24.5-rc.1`, matching `ant-protocol`. `Cargo.lock` updated to the
current branch tips (saorsa-core 1be73520, which includes the
`find_closest_nodes_local_by_distance` method ant-node's verifier needs).

Verified locally: cargo clippy --all-targets --all-features, cargo doc
--all-features, and compilation of all e2e/merkle test targets now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(merkle): retry quorum shortfalls instead of aborting the file upload
…slow-start, require well-sampled NotFound

Addresses three findings from @mickvandijke's review:

P1 (file.rs / chunk.rs): a transient error on a single chunk's INITIAL
close-group lookup could still fail a whole download. chunk_get took
`chunk_get_try_close_group(...).await?`, so an error from
close_group_peers (e.g. a momentary DHT/InsufficientPeers failure)
propagated before the retry path, and the main download pass aborted on
any Err from chunk_get_observed. Two changes:
  - chunk_get now treats a first-attempt lookup error as a
    non-authoritative miss (zeroed outcome) and falls through to its
    retry path instead of propagating.
  - the main streaming-decrypt pass defers a chunk on Err (same as
    Ok(None)) rather than aborting; only a chunk that survives all
    deferred retry rounds is fatal.

P2 (adaptive.rs): fetch slow-start protection was lost at the ceiling.
With `slow_start_ramp_threshold = max_concurrency`, a Decrease while the
cap sat at the ceiling satisfied `current >= threshold` and exited
slow-start, so 256 -> 128 recovered by +1 windows instead of doubling —
re-pinning a fast-but-lossy link after a single transient decrease. Set
the fetch threshold to `usize::MAX` so slow-start never exits and the
cap always doubles back.

P3 (chunk.rs): authoritative NotFound only required `not_found ==
queried`, so a thin/under-sampled DHT walk (close_group_peers accepts
any non-empty result) returning 1/1 or 3/3 NotFound was treated as
final absence with no retry, even though the real replica majority may
lie outside that narrow view. Now also requires
`queried >= CLOSE_GROUP_MAJORITY`, so an under-sampled walk falls
through to the retry (which re-walks the DHT).

Tests: updated the NotFound test for the well-sampled requirement;
added a ceiling slow-start regression test (MAX-threshold out-recovers
a max_concurrency-threshold limiter after stress at the ceiling).
301 ant-core unit tests pass; fmt and clippy --all-targets -D warnings
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(download): residential saturation + transient failure hardening
…ty guard)

The liveness monitor snapshots each Running node's PID, and on finding that
PID dead it re-checked only that the status was still Running before flipping
the node to Stopped. But between the snapshot and the action, monitor_node can
respawn the node after an auto-upgrade (or crash), installing a live PID_new
while the status stays Running. The sweep — still holding the dead PID_old —
would then clobber the healthy respawned process to Stopped and delete its pid
file, leaving the node reported as Stopped while it actually runs the new
version. A daemon restart re-adopts the live process and masks it, which is the
temporary workaround users have been relying on.

Extract the decision into liveness_should_stop and require the recorded PID to
still equal the one observed dead, so a respawn under the sweep is left alone.
Add a regression test.

Reproduced on a testnet node via the real upgrade path before fixing: SSE
showed node_started(PID_new)+node_upgraded followed by a spurious node_stopped
~26s later, with PID_new still serving the new version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(daemon): node wrongly reported stopped after auto-upgrade; add --upgrade-channel
Semver: patch

Use a byte-aware throughput hill climber for chunk fetch concurrency so downloads back away from caps that do not improve goodput.

Apply the rolling fetch scheduler to file/data download paths so cap changes can take effect while large downloads are still in progress.
Semver: patch

Measure fetch hill epochs from operation start time rather than first completion, and size epochs to cover full concurrency waves so upward probes are judged on steady goodput.

Migrate schema-1 adaptive snapshots by preserving quote/store warm-starts while resetting fetch to the hill-climber cold start.
feat(chunk): add chunk get peer diagnostics
…rency

fix(data): tune fetch concurrency with throughput hill climb
The saorsa-core and ant-protocol rc-2026.5.4 branches are being abandoned
(saorsa-core #121/#122 reverted; see WithAutonomi/ant-node#116). Point
ant-core's direct ant-protocol dep and its dev-dep saorsa-core back at their
crates.io releases (2.1.1 / 0.24.4) and refresh the lock. The optional/dev
ant-node deps stay on rc-2026.5.4 (that branch survives and now itself pins
crates.io saorsa-core/ant-protocol).
…doned-deps

Drop abandoned saorsa-core/ant-protocol rc pins back to crates.io
@jacderida jacderida merged commit eeba52b into main May 28, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants