fix(merkle): retry quorum shortfalls instead of aborting the file upload#96
Conversation
dirvine
left a comment
There was a problem hiding this comment.
Review Summary
Overall: Solid fix for the exact problem Chris observed (files 3 and 4 partially uploaded). The collect-not-abort + bounded retry pattern is clean, well-tested, and brings the external-signer merkle path to parity with the wave path in batch.rs.
What I like
- Collect-not-abort (C2.1): Per-chunk
InsufficientPeersfailures are collected rather than fatal — this was the root cause of partial uploads - Bounded retry (C2.2): Same proof, fresh close group, 3 rounds × 30s backoff — well-chosen parameters that balance recovery speed with network load
- Honest reporting:
chunks_failed/total_chunksinstead of hardcoded zero - 4 unit tests covering: collect-not-abort, non-quorum-error-fatal, retry-only-failed-set, retry-counting-in-histogram
- Client-side only, no wire change → safe to deploy without fleet upgrade
CI status
No checks reported on this branch. May need to verify the RC branch CI pipeline is configured for this repo.
👍 No blockers
Approving. The retry logic interacts correctly with the other PRs in this stack:
- Reachability-aware close-group selection (saorsa-core #121) → retried chunks benefit from better peer selection
- Local-table-first closeness check (ant-node #111) → matters for storer side, not client side
One minor suggestion: consider making MERKLE_STORE_MAX_ATTEMPTS and MERKLE_RETRY_BACKOFF configurable (env var or CLI flag) for operational flexibility, but this is non-blocking.
grumbach
left a comment
There was a problem hiding this comment.
One real major + some minor + nits. Not blocking, but the data.rs thing is worth fixing before merge.
-
major
ant-core/src/data/client/data.rs:128-137—merkle_upload_chunkscan now returnOkwithoutcome.failed > 0, but this path swallows it: awarn!is logged andDataUploadResult { chunks_stored: outcome.stored }is returned as success.DataUploadResulthas nochunks_failed/total_chunksfield at all (unlikeFileUploadResult), so a caller that uploaded N chunks and only stored N-k cannot tell — and thedata_mapthey get back will fail to download. Either returnErr(InsufficientPeers)whenoutcome.failed > 0here, or extendDataUploadResultwithchunks_failed/total_chunksso callers can decide. Thefile.rspath is honest about it;data.rsregressed. -
minor (naming/docs)
ant-core/src/data/client/merkle.rs:574-580—MERKLE_STORE_MAX_ATTEMPTS = 3is the total attempt budget (initial + 2 retries) because the helper iteratesfor attempt in 0..attempts. The PR body advertises "up to 3 rounds" of retry, andretries_histogram: [usize; 4]is sized for 3 retries (the wave path's contract). Net effect: index 3 of the histogram is unreachable on this path, and the merkle path retries one fewer time than the wave path. Either rename the constant toMERKLE_STORE_ATTEMPT_BUDGETand align docs/PR body, or bump it to4to match the histogram and the wave path's semantics. -
minor (thundering herd)
ant-core/src/data/client/merkle.rsretry loop — fixed 30s backoff, no jitter, and the entire failed set re-fires at fullstore_concurrencyon the next round. For a 178-chunk file with dozens of shortfalls (the prod case in the PR description), every retry hammers the same midpoint-disagreeing nodes in lockstep. Add small jitter (±10%) and consider staggering the round start. -
minor (test coverage gap) — no case exercises "all chunks still short after the final retry" (
outcome.stored == 0 && outcome.failed == total), and thedata.rs/file.rsintegration ofMerkleStoreOutcomeis untested (a regression that re-introduces the silent-success indata.rswouldn't fail any test). Add at least the exhausted-retries case to the helper tests. -
nit
ant-core/src/data/client/merkle.rs—let idx = attempt.min(outcome.stats.retries_histogram.len() - 1);is panic-safe only because the array length is a literal4 > 0; if anyone reduces the array size to 0 this underflows. Useattempt.min(3)against a named const for defence-in-depth.
Positives: proof reuse is correctly idempotent (no re-pay, just re-PUT of an already-paid chunk that the storers will dedupe); the helper is a clean seam with focused unit tests; layers cleanly on PR #84 (cached MerkleBatchPaymentResult is exactly the input to this retry loop) and PR #94 (preflight already narrowed the chunk set); brings the external-signer path to parity with the wave path's collect-and-retry behaviour as advertised.
23178ab to
5cad982
Compare
The external-signer merkle upload path (`merkle_upload_chunks`) aborted the entire file on the first chunk that returned `InsufficientPeers`, and never retried. Because all chunks in a batch share one `winner_pool`, the few nodes whose routing tables disagree about that pool's midpoint reject it consistently for every chunk whose close group includes them — turning a transient, self-healing shortfall into a fatal partial upload. C2.1: collect per-chunk `InsufficientPeers` failures rather than `?`-propagating them; non-quorum errors (e.g. a missing proof) stay fatal. C2.2: retry the failed set with the same reusable proofs across a total budget of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave path), re-collecting each chunk's close group per attempt and applying a jittered 30s backoff between rounds so a large failed set does not re-probe the same divergent nodes in lockstep. The retry round a chunk lands on is recorded in `retries_histogram[round]`. The store loop is extracted into `merkle_store_with_retry`, unit-tested for collect-not-abort, non-quorum-fatal, retry-only-the-failed-set, retry-counted-once, and exhausted-budget behaviour. `file.rs` reports honest `chunks_failed`/`total_chunks`; the `data.rs` path returns a hard error on a residual shortfall rather than a success with an undownloadable data map (`DataUploadResult` cannot express a partial store). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5cad982 to
88a1391
Compare
|
Thanks @grumbach — all five addressed, and rebased onto the (force-updated)
One heads-up unrelated to this PR: |
`ci.yml` only triggered on `main`, so PRs targeting release-candidate branches (e.g. `rc-2026.5.4`) ran no fmt/clippy/test checks — the `pull_request` branch filter is matched against the PR's base branch. Add an `rc-*` glob to both the `push` and `pull_request` filters so every release-candidate branch is covered. Takes effect for the rc branch once this merges (GitHub reads the `pull_request` trigger from the base branch), and future rc cuts inherit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on the `--all-features`/E2E/doc jobs failed to compile: `ant-protocol` is pinned to its git `rc-2026.5.4` branch (→ `saorsa-core 0.24.5-rc.1`) but `ant-node` was still on crates.io `0.11.4` (→ `saorsa-core 0.24.4`), so two incompatible `saorsa-core` versions clashed on `MultiAddr` in `node/devnet.rs`. Point `ant-node` (runtime + dev dep) and the direct `saorsa-core` dev-dep at their `rc-2026.5.4` git branches so the whole graph resolves to a single `saorsa-core 0.24.5-rc.1`, matching `ant-protocol`. `Cargo.lock` updated to the current branch tips (saorsa-core 1be73520, which includes the `find_closest_nodes_local_by_distance` method ant-node's verifier needs). Verified locally: cargo clippy --all-targets --all-features, cargo doc --all-features, and compilation of all e2e/merkle test targets now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem
The external-signer merkle upload path (
merkle_upload_chunks) aborted the entire file on the first chunk that returnedInsufficientPeers, and never retried.Because all chunks in a batch share one
winner_pool, the few nodes whose routing tables disagree about that pool's midpoint reject it consistently for every chunk whose close group includes them — turning a transient, self-healing quorum shortfall into a fatal partial upload. (Observed: file 3 → 3/4 stores, file 4 → 2/4 stores; a node that rejected a chunk at 20:30 accepted the identical chunk+proof at 20:35 once its routing table converged.)Changes
C2.1 — stop the whole-file abort. Per-chunk
InsufficientPeersfailures are now collected rather than?-propagated. Non-quorum errors (e.g. a missing proof, or a chunk/address count mismatch) stay fatal.C2.2 — bounded same-pool retry. The failed set is retried with the same reusable proofs across a total budget of
MERKLE_STORE_MAX_ATTEMPTSattempts (initial + 3 retries, matching the wave path's0..=MAX_RETRIEScontract and the 4-slotretries_histogram), re-collecting each chunk's close group per attempt so a converged routing table can yield a fresh group. No re-payment, no new pool. The backoff (~30s) is jittered ±10% so a large failed set doesn't re-probe the same divergent nodes in lockstep. The retry round a chunk lands on is recorded inretries_histogram[round](previously vestigial — always[0] += 1).The store loop is extracted into
merkle_store_with_retry, a seam unit-tested for:retries_histogram[1]Honest failure reporting.
merkle_upload_chunksreturns aMerkleStoreOutcome { stored, failed, stats }.file.rsreports honestchunks_failed/total_chunks. Thedata.rspath returns a hardInsufficientPeerserror on a residual shortfall rather than a success with an undownloadable data map (DataUploadResultcannot express a partial store).Layers on top of the merkle preflight (#94):
merkle_upload_chunkstakes the preflight'sstored_offset/total_chunksso progress and the returnedstoredcount reflect the whole file.Scope
Client-side only, no wire change → takes effect as soon as the uploader binary is rebuilt; no fleet upgrade required.
Not included (follow-ups): eager full-close-group fan-out (C2.3), over-fetch beyond
CLOSE_GROUP_SIZE(C2.4), and the gated--merkle-repay-stragglersre-pay escalation.Testing
cargo check -p ant-coreclean onrc-2026.5.4. Merkle unit tests (incl. the 5 new ones) pass.cargo clippy --all-targets/cargo testdoes not build locally on this base due to a pre-existing two-saorsa-core-version split (gitant-protocolvs crates.ioant-nodedev-dep) clashing innode/devnet.rs— unrelated to this change.🤖 Generated with Claude Code