fix(merkle): retry quorum shortfalls instead of aborting the file upload by jacderida · Pull Request #96 · WithAutonomi/ant-client

jacderida · 2026-05-24T17:40:54Z

Problem

The external-signer merkle upload path (merkle_upload_chunks) aborted the entire file on the first chunk that returned InsufficientPeers, and never retried.

Because all chunks in a batch share one winner_pool, the few nodes whose routing tables disagree about that pool's midpoint reject it consistently for every chunk whose close group includes them — turning a transient, self-healing quorum shortfall into a fatal partial upload. (Observed: file 3 → 3/4 stores, file 4 → 2/4 stores; a node that rejected a chunk at 20:30 accepted the identical chunk+proof at 20:35 once its routing table converged.)

Changes

C2.1 — stop the whole-file abort. Per-chunk InsufficientPeers failures are now collected rather than ?-propagated. Non-quorum errors (e.g. a missing proof, or a chunk/address count mismatch) stay fatal.

C2.2 — bounded same-pool retry. The failed set is retried with the same reusable proofs across a total budget of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave path's 0..=MAX_RETRIES contract and the 4-slot retries_histogram), re-collecting each chunk's close group per attempt so a converged routing table can yield a fresh group. No re-payment, no new pool. The backoff (~30s) is jittered ±10% so a large failed set doesn't re-probe the same divergent nodes in lockstep. The retry round a chunk lands on is recorded in retries_histogram[round] (previously vestigial — always [0] += 1).

The store loop is extracted into merkle_store_with_retry, a seam unit-tested for:

collect-not-abort
non-quorum-error-stays-fatal
retry-only-the-failed-set
retry-success-counted-once + recorded in retries_histogram[1]
exhausted-retry-budget (all fail → reported, not propagated)

Honest failure reporting. merkle_upload_chunks returns a MerkleStoreOutcome { stored, failed, stats }. file.rs reports honest chunks_failed/total_chunks. The data.rs path returns a hard InsufficientPeers error on a residual shortfall rather than a success with an undownloadable data map (DataUploadResult cannot express a partial store).

Layers on top of the merkle preflight (#94): merkle_upload_chunks takes the preflight's stored_offset/total_chunks so progress and the returned stored count reflect the whole file.

Scope

Client-side only, no wire change → takes effect as soon as the uploader binary is rebuilt; no fleet upgrade required.

Not included (follow-ups): eager full-close-group fan-out (C2.3), over-fetch beyond CLOSE_GROUP_SIZE (C2.4), and the gated --merkle-repay-stragglers re-pay escalation.

Testing

cargo check -p ant-core clean on rc-2026.5.4. Merkle unit tests (incl. the 5 new ones) pass. cargo clippy --all-targets/cargo test does not build locally on this base due to a pre-existing two-saorsa-core-version split (git ant-protocol vs crates.io ant-node dev-dep) clashing in node/devnet.rs — unrelated to this change.

🤖 Generated with Claude Code

dirvine

Review Summary

Overall: Solid fix for the exact problem Chris observed (files 3 and 4 partially uploaded). The collect-not-abort + bounded retry pattern is clean, well-tested, and brings the external-signer merkle path to parity with the wave path in batch.rs.

What I like

Collect-not-abort (C2.1): Per-chunk InsufficientPeers failures are collected rather than fatal — this was the root cause of partial uploads
Bounded retry (C2.2): Same proof, fresh close group, 3 rounds × 30s backoff — well-chosen parameters that balance recovery speed with network load
Honest reporting: chunks_failed/total_chunks instead of hardcoded zero
4 unit tests covering: collect-not-abort, non-quorum-error-fatal, retry-only-failed-set, retry-counting-in-histogram
Client-side only, no wire change → safe to deploy without fleet upgrade

CI status

No checks reported on this branch. May need to verify the RC branch CI pipeline is configured for this repo.

👍 No blockers

Approving. The retry logic interacts correctly with the other PRs in this stack:

Reachability-aware close-group selection (saorsa-core #121) → retried chunks benefit from better peer selection
Local-table-first closeness check (ant-node #111) → matters for storer side, not client side

One minor suggestion: consider making MERKLE_STORE_MAX_ATTEMPTS and MERKLE_RETRY_BACKOFF configurable (env var or CLI flag) for operational flexibility, but this is non-blocking.

grumbach

One real major + some minor + nits. Not blocking, but the data.rs thing is worth fixing before merge.

major ant-core/src/data/client/data.rs:128-137 — merkle_upload_chunks can now return Ok with outcome.failed > 0, but this path swallows it: a warn! is logged and DataUploadResult { chunks_stored: outcome.stored } is returned as success. DataUploadResult has no chunks_failed/total_chunks field at all (unlike FileUploadResult), so a caller that uploaded N chunks and only stored N-k cannot tell — and the data_map they get back will fail to download. Either return Err(InsufficientPeers) when outcome.failed > 0 here, or extend DataUploadResult with chunks_failed/total_chunks so callers can decide. The file.rs path is honest about it; data.rs regressed.
minor (naming/docs) ant-core/src/data/client/merkle.rs:574-580 — MERKLE_STORE_MAX_ATTEMPTS = 3 is the total attempt budget (initial + 2 retries) because the helper iterates for attempt in 0..attempts. The PR body advertises "up to 3 rounds" of retry, and retries_histogram: [usize; 4] is sized for 3 retries (the wave path's contract). Net effect: index 3 of the histogram is unreachable on this path, and the merkle path retries one fewer time than the wave path. Either rename the constant to MERKLE_STORE_ATTEMPT_BUDGET and align docs/PR body, or bump it to 4 to match the histogram and the wave path's semantics.
minor (thundering herd) ant-core/src/data/client/merkle.rs retry loop — fixed 30s backoff, no jitter, and the entire failed set re-fires at full store_concurrency on the next round. For a 178-chunk file with dozens of shortfalls (the prod case in the PR description), every retry hammers the same midpoint-disagreeing nodes in lockstep. Add small jitter (±10%) and consider staggering the round start.
minor (test coverage gap) — no case exercises "all chunks still short after the final retry" (outcome.stored == 0 && outcome.failed == total), and the data.rs/file.rs integration of MerkleStoreOutcome is untested (a regression that re-introduces the silent-success in data.rs wouldn't fail any test). Add at least the exhausted-retries case to the helper tests.
nit ant-core/src/data/client/merkle.rs — let idx = attempt.min(outcome.stats.retries_histogram.len() - 1); is panic-safe only because the array length is a literal 4 > 0; if anyone reduces the array size to 0 this underflows. Use attempt.min(3) against a named const for defence-in-depth.

Positives: proof reuse is correctly idempotent (no re-pay, just re-PUT of an already-paid chunk that the storers will dedupe); the helper is a clean seam with focused unit tests; layers cleanly on PR #84 (cached MerkleBatchPaymentResult is exactly the input to this retry loop) and PR #94 (preflight already narrowed the chunk set); brings the external-signer path to parity with the wave path's collect-and-retry behaviour as advertised.

The external-signer merkle upload path (`merkle_upload_chunks`) aborted the entire file on the first chunk that returned `InsufficientPeers`, and never retried. Because all chunks in a batch share one `winner_pool`, the few nodes whose routing tables disagree about that pool's midpoint reject it consistently for every chunk whose close group includes them — turning a transient, self-healing shortfall into a fatal partial upload. C2.1: collect per-chunk `InsufficientPeers` failures rather than `?`-propagating them; non-quorum errors (e.g. a missing proof) stay fatal. C2.2: retry the failed set with the same reusable proofs across a total budget of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave path), re-collecting each chunk's close group per attempt and applying a jittered 30s backoff between rounds so a large failed set does not re-probe the same divergent nodes in lockstep. The retry round a chunk lands on is recorded in `retries_histogram[round]`. The store loop is extracted into `merkle_store_with_retry`, unit-tested for collect-not-abort, non-quorum-fatal, retry-only-the-failed-set, retry-counted-once, and exhausted-budget behaviour. `file.rs` reports honest `chunks_failed`/`total_chunks`; the `data.rs` path returns a hard error on a residual shortfall rather than a success with an undownloadable data map (`DataUploadResult` cannot express a partial store). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jacderida · 2026-05-26T14:56:29Z

Thanks @grumbach — all five addressed, and rebased onto the (force-updated) rc-2026.5.4 so it layers cleanly on the merkle preflight (#94).

major — data.rs swallowing partial failure. Fixed. A residual shortfall after retries is now a hard error rather than a success with an undownloadable data map:
```
if outcome.failed > 0 {
    return Err(Error::InsufficientPeers(format!(
        "Data merkle upload incomplete: {} of {} chunk(s) short of quorum after retries",
        outcome.failed, chunk_count
    )));
}
```
Went with the hard-error option rather than extending DataUploadResult, since for data_upload a partial result is never usable (the data map won't resolve). file.rs keeps its honest chunks_failed/total_chunks as before.
minor — attempt budget vs histogram. Bumped MERKLE_STORE_MAX_ATTEMPTS to 4 (initial + 3 retries) to match the wave path's 0..=MAX_RETRIES contract and fully use the 4-slot retries_histogram (final retry now lands in retries_histogram[3]). Doc/PR body updated to say "attempts" not "rounds".
minor — thundering herd. Added ±10% jitter to the backoff so the failed set doesn't re-probe the same divergent nodes in lockstep. thread_rng is !Send, so the value is computed and the rng dropped before the await to keep the future Send (the _merkle_upload_chunks_is_send assertion still holds). Note the existing store-concurrency limiter already bounds in-flight PUTs per round, so I kept round-level jitter rather than per-chunk staggering — happy to add the latter if metrics later show it's needed.
minor — test gap. Added store_with_retry_reports_all_failed_when_retries_exhausted: all chunks fail through the full budget → Ok with stored == 0, failed == total, attempted exactly MERKLE_STORE_MAX_ATTEMPTS times, empty histogram.
nit — histogram index underflow. Now attempt.min(retries_histogram.len().saturating_sub(1)).

One heads-up unrelated to this PR: cargo test/--all-features doesn't build locally on rc-2026.5.4 — the lockfile carries two saorsa-core versions (0.24.5-rc.1 via the git ant-protocol vs 0.24.4 via the crates.io ant-node dev-dep), which clash on MultiAddr in node/devnet.rs. The library itself is clean (cargo check -p ant-core passes); the merkle unit tests (incl. the new ones) passed against the pre-force-push base. Flagging in case CI hits the same dev-dep split.

`ci.yml` only triggered on `main`, so PRs targeting release-candidate branches (e.g. `rc-2026.5.4`) ran no fmt/clippy/test checks — the `pull_request` branch filter is matched against the PR's base branch. Add an `rc-*` glob to both the `push` and `pull_request` filters so every release-candidate branch is covered. Takes effect for the rc branch once this merges (GitHub reads the `pull_request` trigger from the base branch), and future rc cuts inherit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI on the `--all-features`/E2E/doc jobs failed to compile: `ant-protocol` is pinned to its git `rc-2026.5.4` branch (→ `saorsa-core 0.24.5-rc.1`) but `ant-node` was still on crates.io `0.11.4` (→ `saorsa-core 0.24.4`), so two incompatible `saorsa-core` versions clashed on `MultiAddr` in `node/devnet.rs`. Point `ant-node` (runtime + dev dep) and the direct `saorsa-core` dev-dep at their `rc-2026.5.4` git branches so the whole graph resolves to a single `saorsa-core 0.24.5-rc.1`, matching `ant-protocol`. `Cargo.lock` updated to the current branch tips (saorsa-core 1be73520, which includes the `find_closest_nodes_local_by_distance` method ant-node's verifier needs). Verified locally: cargo clippy --all-targets --all-features, cargo doc --all-features, and compilation of all e2e/merkle test targets now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dirvine approved these changes May 24, 2026

View reviewed changes

jacderida force-pushed the rc-2026.5.4 branch from fa9c2e9 to a95d338 Compare May 25, 2026 12:56

grumbach reviewed May 26, 2026

View reviewed changes

jacderida force-pushed the fix/merkle-client-retry branch from 23178ab to 5cad982 Compare May 26, 2026 14:44

jacderida force-pushed the fix/merkle-client-retry branch from 5cad982 to 88a1391 Compare May 26, 2026 14:55

jacderida and others added 2 commits May 26, 2026 16:12

jacderida merged commit 573476f into WithAutonomi:rc-2026.5.4 May 26, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(merkle): retry quorum shortfalls instead of aborting the file upload#96

fix(merkle): retry quorum shortfalls instead of aborting the file upload#96
jacderida merged 3 commits into
WithAutonomi:rc-2026.5.4from
jacderida:fix/merkle-client-retry

jacderida commented May 24, 2026 •

edited

Loading

Uh oh!

dirvine left a comment

Uh oh!

grumbach left a comment

Uh oh!

jacderida commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacderida commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Scope

Testing

Uh oh!

dirvine left a comment

Choose a reason for hiding this comment

Review Summary

What I like

CI status

👍 No blockers

Uh oh!

grumbach left a comment

Choose a reason for hiding this comment

Uh oh!

jacderida commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacderida commented May 24, 2026 •

edited

Loading