fix(download): residential saturation + transient failure hardening by jacderida · Pull Request #95 · WithAutonomi/ant-client

jacderida · 2026-05-22T17:09:33Z

Summary

Hardens the download path against two distinct failure modes observed running ant file download against the production network: residential link saturation and per-peer / DHT transient errors that previously fatally aborted multi-hundred-chunk downloads. Together the changes take an ant file download run on a residential connection from "aborts on the first 256-wide concurrent batch's saturation event" to "completes 11/11 files including 2+ GB downloads," and on a fat-pipe droplet from "matches baseline" to "matches baseline" — no regression on the warm-start path that production downloaders actually exercise.

What's in here

Six related pieces:

retry-on-Ok(None) with unanimous-NotFound threshold in chunk_get. When the close group returns Ok(None) (no peer has the chunk), retry once with a fresh find_closest_peers lookup, unless every queried peer responded with an authoritative NotFound (which is the only safe stop for genuine data absence).
rebucketed_unordered in file.rs instead of buffer_unordered for the in-flight chunk fetches, so the adaptive limiter's cap can shrink the in-flight count mid-batch under sustained pressure.
observe-outer with Ok(None) → Outcome::Timeout instead of observe-per-peer. The controller sees one observation per chunk_get (not one per peer attempt), classified via a new chunk_get_outcome helper that treats Ok(None) as a load-shedding signal. Avoids the per-peer noise floor on the production network where some peers in any K=7 close group are unreachable from any given client even on a healthy link.
ChannelStart::fetch: 64 → 4 cold-start. The original 64-wide initial burst would saturate residential connections before the controller had any observation to act on. 4 is the value confirmed safe on the operator's home link. On droplets the cost is a one-off cold-start warm-up of ~16 min on the first 2.5 GB file; subsequent files warm-start from the persisted snapshot (which reaches cap=256 cleanly).
Deferred-retry pass in streaming_decrypt's consumer. When chunk_get returns Ok(None) for a chunk during a batch, the chunk is deferred rather than aborting the batch. After the main batch settles, the deferred chunks are retried serially with sleeps of 10/30/60 s, giving the link time to clear any transient saturation. A chunk only becomes fatal after all 3 deferred attempts fail.
Per-peer protocol-error tolerance and deferred-retry transient-error tolerance. A single peer returning Error::Protocol (e.g., "Chunk verification failed" from a corrupted local copy) no longer aborts the close-group sweep — the loop counts it and continues to the next peer. Similarly, an Err(_) from chunk_get_observed during a deferred-retry attempt logs and falls through to the next attempt's longer backoff, rather than escalating.

Also: latency_inflation_factor default 2.0 → 4.0 (cherry-picked from the previously-validated tune-latency-inflation-factor branch — natural close-group fallback latency on the production network routinely doubles vs the EWMA baseline, and was firing spurious Decrease decisions on the droplet).

What's not in here

No new user-facing CLI flags. All controls are internal adaptive knobs.
No retry behavior at the per-peer level inside chunk_get_from_peer. The retry happens at the close-group sweep level (once inside chunk_get, once at the deferred pass).

Evidence

The most recent end-to-end runs both completed cleanly:

Local download (residential connection, `PROD-LOCAL-DL-04`)

11/11 files completed including 2.51 GB and 2.76 GB downloads:

#	Size	Duration	Started (UTC)
1	18 B	24.2 s	2026-05-22 14:43:33
2	150.4 KB	28.8 s	2026-05-22 14:43:57
3	15.0 MB	42.7 s	2026-05-22 14:44:26
4	2.51 GB	42m 43.3s	2026-05-22 14:45:09
5	2.76 GB	46m 26.5s	2026-05-22 15:27:52
6	65.9 MB	1m 21.0s	2026-05-22 16:14:19
7	6.0 MB	32.5 s	2026-05-22 16:15:40
8	802.6 KB	37.4 s	2026-05-22 16:16:12
9	2.2 MB	36.6 s	2026-05-22 16:16:50
10	12.8 MB	50.6 s	2026-05-22 16:17:26
11	961.6 KB	25.3 s	2026-05-22 16:18:17

Files 4 and 5 are the multi-GB workloads that previously aborted on the first close-group exhaustion within the first few minutes. They now complete via the deferred-retry mechanism — during the earlier successful home test, 14 chunks were deferred and every single one recovered on attempt 1/3 after the 10 s sleep.

Droplet download (production, `PROD-DL-05`)

20/20 files completed:

#	Size	Duration
1	18 B	25.1 s
2	150.4 KB	25.0 s
3	15.0 MB	38.8 s
4	2.51 GB	16m 9.4s ← cold-start, only file paying warm-up cost
5	2.76 GB	6m 22.5s
6	65.9 MB	48.0 s
7	6.0 MB	26.8 s
8	802.6 KB	30.1 s
9	2.2 MB	25.2 s
10	12.8 MB	47.7 s
11	961.6 KB	19.1 s
12	1.77 GB	3m 53.9s
13	2.25 GB	5m 4.3s
14	2.34 GB	3m 55.2s
15	2.28 GB	4m 17.7s
16	2.47 GB	5m 7.0s
17	2.55 GB	5m 12.8s
18	2.50 GB	4m 32.6s
19	2.75 GB	5m 5.6s
20	2.84 GB	4m 53.0s

File #4 is the cold-start cost: the adaptive limiter ramps from ChannelStart::fetch=4 through doublings to the channel ceiling of 256, and the snapshot persists at 256 for subsequent runs. Files #5 onwards run at near-baseline speeds: 3-6 min per 2+ GB file, vs the pre-change baseline of ~5 min on PROD-DL-02.

Grepping the per-file logs on PROD-DL-05 shows none of the new retry mechanisms fired across the 20-file run — every chunk_get succeeded on its first close-group sweep. So this is a healthy-network success, not a "saved by deferred retry" success. The deferred-retry mechanism is proven on the home runs (14 successful recoveries in PROD-LOCAL-DL-03); the droplet just didn't need it.

Test plan

All 296 ant-core unit tests pass (cargo test -p ant-core --lib).
End-to-end residential download (PROD-LOCAL-DL-04): 11/11 files completed.
End-to-end droplet download (PROD-DL-05): 20/20 files completed.

🤖 Generated with Claude Code

Hardens the download path against two distinct failure modes observed running `ant file download` against the production network: residential link saturation and per-peer / DHT transient errors that previously fatally aborted multi-hundred-chunk downloads. End to end, this takes a residential `ant file download` from "aborts on the first 256-wide concurrent batch's saturation event" to "completes 11/11 files including 2+ GB downloads," and on a fat-pipe droplet from "matches baseline" to "matches baseline" — no regression on the warm-start path that production downloaders actually exercise. Six related changes: 1. retry-on-Ok(None) with unanimous-NotFound threshold in chunk_get. When the close group returns Ok(None) (no peer has the chunk), retry once with a fresh find_closest_peers lookup, unless every queried peer responded with an authoritative NotFound (the only safe stop for genuine data absence). The previous behaviour treated Ok(None) as fatal on first occurrence, which on a saturated link meant any single chunk's transient close-group exhaustion aborted the whole download. 2. rebucketed_unordered in file.rs instead of buffer_unordered for the in-flight chunk fetches. The adaptive limiter's cap can now shrink the in-flight count mid-batch under sustained pressure; buffer_unordered snapshotted the cap once at pipeline build and ignored later Decrease decisions. 3. observe-outer with Ok(None) -> Outcome::Timeout instead of observe-per-peer. The controller sees one observation per chunk_get (not one per peer attempt), classified via a new chunk_get_outcome helper that treats Ok(None) as a load-shedding signal. Avoids the per-peer noise floor on the production network where some peers in any K=7 close group are unreachable from any given client even on a healthy link — that noise was driving spurious Decrease decisions on the droplet and pinning steady-state cap low. 4. ChannelStart::fetch: 64 -> 4 cold-start. The 64-wide initial burst saturated residential connections before the controller had any observation to act on. 4 is the value confirmed safe on a real residential link. On droplets the cost is a one-off cold-start warm-up of ~16 min on the first 2.5 GB file; subsequent files warm-start from the persisted client_adaptive.json snapshot (which the controller cleanly grows to cap=256, the channel ceiling). 5. Deferred-retry pass in streaming_decrypt's consumer. When chunk_get returns Ok(None) for a chunk during a batch, the chunk is deferred rather than aborting the batch. After the main batch settles, the deferred chunks are retried serially with sleeps of 10/30/60 s. This rides out transient saturation events that hit multiple in-flight chunks at once — by the time the batch has drained and the first sleep elapses, the link has usually settled. A chunk only becomes fatal after all 3 deferred attempts fail. 6. Per-peer protocol-error tolerance and deferred-retry transient-error tolerance. A single peer returning Error::Protocol (e.g. "Chunk verification failed" from a corrupted local copy) no longer aborts the close-group sweep — the loop counts it and continues to the next peer. Similarly, an Err(_) from chunk_get_observed during a deferred-retry attempt logs and falls through to the next attempt's longer backoff rather than escalating. Also: latency_inflation_factor default 2.0 -> 4.0. Natural close-group fallback latency on the production network routinely doubles vs the EWMA baseline (a single peer hitting fallback adds ~10 s on top of a sub-second median), and was firing spurious Decrease decisions even on the droplet. 4.0 is the value validated on the previously-merged tune-latency-inflation-factor branch. Test plan: - 296 ant-core unit tests pass. - End-to-end residential download (PROD-LOCAL-DL-04): 11/11 files completed including 2.51 GB (42m 43s) and 2.76 GB (46m 26s). During an earlier residential run 14 chunks went through the deferred-retry path and every one recovered on attempt 1/3 after the 10 s sleep. - End-to-end droplet download (PROD-DL-05): 20/20 files completed. The first 2.5 GB file paid 16m 9s of cold-start cost; subsequent multi-GB files ran in 3-6 min each, near the pre-change ~5 min baseline on PROD-DL-02. No retry mechanism fired across the 20-file run — healthy-network success. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit fixed residential saturation and abort-on-first- failure, but field reports from fast connections (e.g. an Oracle VPS that gets full speed on the released client) showed the opposite problem: the fetch cap stayed pinned at ~13-24 across an entire 36-file run and never climbed toward the 256 ceiling, so multi-GB files took ~22 min each instead of ~5. Three compounding causes, all from having tuned exclusively for the saturated-home case: 1. Cap can't grow. AIMD exits slow-start permanently on the first Decrease, then grows +1 per 32-observation window. On a link with a steady ~4% close-group-exhaustion trickle, intermittent Decreases fire often enough that additive +1 never gets ahead — equilibrium ~20. Additive growth simply cannot reach a useful cap from a low base before a file finishes. Fix: add `LimiterConfig::slow_start_ramp_threshold`. Below it, a Decrease still halves the cap but keeps slow-start armed, so the next healthy window doubles back up instead of crawling. The fetch channel sets it to the channel ceiling, so download concurrency tracks the connection's real capacity. Default 0 preserves the original behaviour for quote/store. 2. The p95-latency Decrease misfires on fetch. `chunk_get_observed`'s latency includes the internal 1 s retry sleep and the slow retry sweep for chunks that needed one, so a window with a couple of retry-path chunks has a wildly inflated p95 that reads as congestion. Fix: add `LimiterConfig::latency_decrease_enabled`, false for fetch. Genuine fetch congestion still surfaces via the Ok(None) -> Timeout rate, which the timeout_ceiling check catches. 3. The deferred-retry pass was a throughput sink. It retried deferred chunks SERIALLY with a mandatory 10 s pre-sleep each; a batch that deferred ~20 chunks burned minutes of near-zero throughput even though every chunk succeeded on its first retry (the 10 s sleep was pure waste — the deferrals were peer-side noise that clears in <1 s). Fix: retry deferred chunks in CONCURRENT rounds reusing the fetch limiter, with the first round firing immediately and later rounds backing off (0/15/45 s) only for chunks that survive a round. Both Ok(None) and transient errors re-defer to the next round; only the final round's leftovers are fatal. Quote and store channel behaviour is unchanged (threshold 0, latency-decrease enabled). New unit tests cover protected-vs-additive recovery, the disabled latency check, and that the controller applies the download tuning to fetch only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ceiling The previous commit added slow-start protection so the fetch cap could double back up after transient Decreases — but a re-test on the same fast-but-lossy VPS showed the cap still pinned in the ~6-26 range, growing only additively. Root cause: `warm_start` unconditionally set `left_slow_start = true`. Every `ant file download` is a fresh process that warm-starts the controller from the persisted snapshot. With warm_start always exiting slow-start, the protection only ever applied to the very first (cold-start) file; every subsequent file began with slow-start already exited and could only grow the fetch cap by +1 per 32-observation window. Additive growth from a low warm value cannot climb to the ceiling against the connection's intermittent close-group-exhaustion trickle, so the cap drifted down across files (observed: 15, 8, 26, 20, 14, 12, 12, 9, 6...) and 2.5 GB files stayed at ~17-23 min. Fix: warm_start now sets `left_slow_start = clamped >= slow_start_ramp_threshold`. For quote/store (threshold 0) this is unchanged — always exits, so a learned warm value isn't doubled on the first healthy window. For fetch (threshold == ceiling) a warm value below the ceiling keeps slow-start armed, so the cap doubles back toward the connection's real capacity instead of crawling. This was the missing piece: the in-process slow-start protection and the cross-process warm-start path have to agree, or the multi-file CLI pattern silently defeats the protection. Added a regression test covering both the protected (doubles after warm_start) and default (stays additive) channels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sorry, codex found some issues

mickvandijke

Findings:

P1 ant-core/src/data/client/file.rs:1982: the first download pass still aborts on any Err from chunk_get_observed; only Ok(None) enters deferred retry. Since ant-core/src/data/client/chunk.rs:371 propagates an initial close_group_peers/DHT failure before its retry path, one transient DHT/network error for one chunk can still fail a large file download. Retryable first-pass errors should be deferred too, or chunk_get should retry the initial close-group lookup before returning Err.
P2 ant-core/src/data/client/adaptive.rs:576: fetch slow-start protection is lost when the cap is at the ceiling. Fetch sets slow_start_ramp_threshold = max_concurrency, but apply_decision marks left_slow_start = true before halving, so 256 -> 128 recovers by +1 windows, not doubling. That contradicts the PR intent that the next healthy window doubles it back, and can re-pin fast-but-lossy links after a single transient decrease.
P2 ant-core/src/data/client/chunk.rs:68: authoritative NotFound only checks not_found == queried. Because ant-core/src/data/client/mod.rs:472 accepts any non-empty DHT result, a thin lookup like 1/1 NotFound or 3/3 NotFound becomes final absence with no retry, even though the real replica majority may be outside that under-sampled view. Require a full close group, or at least CLOSE_GROUP_MAJORITY, before treating NotFound as authoritative.

Verification:

cargo test -p ant-core adaptive
cargo test -p ant-core data::client::chunk
cargo test -p ant-core data::client::file
cargo fmt --all -- --check
cargo clippy -p ant-core --all-targets --all-features -- -D warnings

@mickvandijke

…slow-start, require well-sampled NotFound Addresses three findings from @mickvandijke's review: P1 (file.rs / chunk.rs): a transient error on a single chunk's INITIAL close-group lookup could still fail a whole download. chunk_get took `chunk_get_try_close_group(...).await?`, so an error from close_group_peers (e.g. a momentary DHT/InsufficientPeers failure) propagated before the retry path, and the main download pass aborted on any Err from chunk_get_observed. Two changes: - chunk_get now treats a first-attempt lookup error as a non-authoritative miss (zeroed outcome) and falls through to its retry path instead of propagating. - the main streaming-decrypt pass defers a chunk on Err (same as Ok(None)) rather than aborting; only a chunk that survives all deferred retry rounds is fatal. P2 (adaptive.rs): fetch slow-start protection was lost at the ceiling. With `slow_start_ramp_threshold = max_concurrency`, a Decrease while the cap sat at the ceiling satisfied `current >= threshold` and exited slow-start, so 256 -> 128 recovered by +1 windows instead of doubling — re-pinning a fast-but-lossy link after a single transient decrease. Set the fetch threshold to `usize::MAX` so slow-start never exits and the cap always doubles back. P3 (chunk.rs): authoritative NotFound only required `not_found == queried`, so a thin/under-sampled DHT walk (close_group_peers accepts any non-empty result) returning 1/1 or 3/3 NotFound was treated as final absence with no retry, even though the real replica majority may lie outside that narrow view. Now also requires `queried >= CLOSE_GROUP_MAJORITY`, so an under-sampled walk falls through to the retry (which re-walks the DHT). Tests: updated the NotFound test for the well-sampled requirement; added a ceiling slow-start regression test (MAX-threshold out-recovers a max_concurrency-threshold limiter after stress at the ceiling). 301 ant-core unit tests pass; fmt and clippy --all-targets -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jacderida · 2026-05-26T16:37:14Z

Thanks @mickvandijke — all three addressed in c5823a6.

P1 (first-pass error aborts a large download) — fixed at both layers you noted:

chunk_get no longer propagates an error from its initial chunk_get_try_close_group call. A first-attempt lookup/transport error (incl. the close_group_peers DHT/InsufficientPeers failure) is now treated as a non-authoritative miss (a zeroed outcome) and falls through to the existing retry path.
The main streaming-decrypt pass now defers a chunk on Err exactly as it does on Ok(None), so one transient error can't abort the batch. Only a chunk that survives all deferred retry rounds is fatal.

P2 (slow-start protection lost at the ceiling) — fixed. You're right that threshold = max_concurrency exits slow-start at the ceiling (current >= threshold holds at 256), so 256→128 recovered additively. The fetch threshold is now usize::MAX, so slow-start never exits and the cap doubles back after a Decrease at any level. Added a regression test (slow_start_stays_armed_at_ceiling_with_max_threshold) that pins MAX-threshold out-recovering a max_concurrency-threshold limiter after stress at the ceiling.

P3 (under-sampled NotFound treated as authoritative) — fixed. is_authoritative_not_found now also requires queried >= CLOSE_GROUP_MAJORITY, so a thin 1/1 or 3/3 NotFound from an under-sampled DHT walk falls through to the retry (which re-walks the DHT) instead of being declared final absence. Updated the test accordingly.

Verification (your list):

cargo test -p ant-core adaptive ✓
cargo test -p ant-core data::client::chunk ✓
cargo test -p ant-core data::client::file ✓ (301 lib tests pass overall)
cargo fmt --all -- --check ✓
cargo clippy -p ant-core --all-targets --all-features -- -D warnings ✓

jacderida force-pushed the home-download-fixes branch from e8b1de5 to 13070d9 Compare May 22, 2026 17:17

jacderida and others added 2 commits May 24, 2026 22:48

mickvandijke previously approved these changes May 25, 2026

View reviewed changes

mickvandijke requested changes May 26, 2026

View reviewed changes

jacderida changed the base branch from main to rc-2026.5.4 May 26, 2026 16:27

jacderida merged commit 8e2bb6b into WithAutonomi:rc-2026.5.4 May 26, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(download): residential saturation + transient failure hardening#95

fix(download): residential saturation + transient failure hardening#95
jacderida merged 4 commits into
WithAutonomi:rc-2026.5.4from
jacderida:home-download-fixes

jacderida commented May 22, 2026 •

edited

Loading

Uh oh!

mickvandijke left a comment

Uh oh!

jacderida commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacderida commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

What's not in here

Evidence

Local download (residential connection, PROD-LOCAL-DL-04)

Droplet download (production, PROD-DL-05)

Test plan

Uh oh!

mickvandijke left a comment

Choose a reason for hiding this comment

Uh oh!

jacderida commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacderida commented May 22, 2026 •

edited

Loading

Local download (residential connection, `PROD-LOCAL-DL-04`)

Droplet download (production, `PROD-DL-05`)