Skip to content

fix(merkle): retry quorum shortfalls instead of aborting the file upload#96

Merged
jacderida merged 3 commits into
WithAutonomi:rc-2026.5.4from
jacderida:fix/merkle-client-retry
May 26, 2026
Merged

fix(merkle): retry quorum shortfalls instead of aborting the file upload#96
jacderida merged 3 commits into
WithAutonomi:rc-2026.5.4from
jacderida:fix/merkle-client-retry

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

@jacderida jacderida commented May 24, 2026

Problem

The external-signer merkle upload path (merkle_upload_chunks) aborted the entire file on the first chunk that returned InsufficientPeers, and never retried.

Because all chunks in a batch share one winner_pool, the few nodes whose routing tables disagree about that pool's midpoint reject it consistently for every chunk whose close group includes them — turning a transient, self-healing quorum shortfall into a fatal partial upload. (Observed: file 3 → 3/4 stores, file 4 → 2/4 stores; a node that rejected a chunk at 20:30 accepted the identical chunk+proof at 20:35 once its routing table converged.)

Changes

C2.1 — stop the whole-file abort. Per-chunk InsufficientPeers failures are now collected rather than ?-propagated. Non-quorum errors (e.g. a missing proof, or a chunk/address count mismatch) stay fatal.

C2.2 — bounded same-pool retry. The failed set is retried with the same reusable proofs across a total budget of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave path's 0..=MAX_RETRIES contract and the 4-slot retries_histogram), re-collecting each chunk's close group per attempt so a converged routing table can yield a fresh group. No re-payment, no new pool. The backoff (~30s) is jittered ±10% so a large failed set doesn't re-probe the same divergent nodes in lockstep. The retry round a chunk lands on is recorded in retries_histogram[round] (previously vestigial — always [0] += 1).

The store loop is extracted into merkle_store_with_retry, a seam unit-tested for:

  • collect-not-abort
  • non-quorum-error-stays-fatal
  • retry-only-the-failed-set
  • retry-success-counted-once + recorded in retries_histogram[1]
  • exhausted-retry-budget (all fail → reported, not propagated)

Honest failure reporting. merkle_upload_chunks returns a MerkleStoreOutcome { stored, failed, stats }. file.rs reports honest chunks_failed/total_chunks. The data.rs path returns a hard InsufficientPeers error on a residual shortfall rather than a success with an undownloadable data map (DataUploadResult cannot express a partial store).

Layers on top of the merkle preflight (#94): merkle_upload_chunks takes the preflight's stored_offset/total_chunks so progress and the returned stored count reflect the whole file.

Scope

Client-side only, no wire change → takes effect as soon as the uploader binary is rebuilt; no fleet upgrade required.

Not included (follow-ups): eager full-close-group fan-out (C2.3), over-fetch beyond CLOSE_GROUP_SIZE (C2.4), and the gated --merkle-repay-stragglers re-pay escalation.

Testing

cargo check -p ant-core clean on rc-2026.5.4. Merkle unit tests (incl. the 5 new ones) pass. cargo clippy --all-targets/cargo test does not build locally on this base due to a pre-existing two-saorsa-core-version split (git ant-protocol vs crates.io ant-node dev-dep) clashing in node/devnet.rs — unrelated to this change.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall: Solid fix for the exact problem Chris observed (files 3 and 4 partially uploaded). The collect-not-abort + bounded retry pattern is clean, well-tested, and brings the external-signer merkle path to parity with the wave path in batch.rs.

What I like

  • Collect-not-abort (C2.1): Per-chunk InsufficientPeers failures are collected rather than fatal — this was the root cause of partial uploads
  • Bounded retry (C2.2): Same proof, fresh close group, 3 rounds × 30s backoff — well-chosen parameters that balance recovery speed with network load
  • Honest reporting: chunks_failed/total_chunks instead of hardcoded zero
  • 4 unit tests covering: collect-not-abort, non-quorum-error-fatal, retry-only-failed-set, retry-counting-in-histogram
  • Client-side only, no wire change → safe to deploy without fleet upgrade

CI status

No checks reported on this branch. May need to verify the RC branch CI pipeline is configured for this repo.

👍 No blockers

Approving. The retry logic interacts correctly with the other PRs in this stack:

  • Reachability-aware close-group selection (saorsa-core #121) → retried chunks benefit from better peer selection
  • Local-table-first closeness check (ant-node #111) → matters for storer side, not client side

One minor suggestion: consider making MERKLE_STORE_MAX_ATTEMPTS and MERKLE_RETRY_BACKOFF configurable (env var or CLI flag) for operational flexibility, but this is non-blocking.

Copy link
Copy Markdown
Contributor

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One real major + some minor + nits. Not blocking, but the data.rs thing is worth fixing before merge.

  • major ant-core/src/data/client/data.rs:128-137merkle_upload_chunks can now return Ok with outcome.failed > 0, but this path swallows it: a warn! is logged and DataUploadResult { chunks_stored: outcome.stored } is returned as success. DataUploadResult has no chunks_failed/total_chunks field at all (unlike FileUploadResult), so a caller that uploaded N chunks and only stored N-k cannot tell — and the data_map they get back will fail to download. Either return Err(InsufficientPeers) when outcome.failed > 0 here, or extend DataUploadResult with chunks_failed/total_chunks so callers can decide. The file.rs path is honest about it; data.rs regressed.

  • minor (naming/docs) ant-core/src/data/client/merkle.rs:574-580MERKLE_STORE_MAX_ATTEMPTS = 3 is the total attempt budget (initial + 2 retries) because the helper iterates for attempt in 0..attempts. The PR body advertises "up to 3 rounds" of retry, and retries_histogram: [usize; 4] is sized for 3 retries (the wave path's contract). Net effect: index 3 of the histogram is unreachable on this path, and the merkle path retries one fewer time than the wave path. Either rename the constant to MERKLE_STORE_ATTEMPT_BUDGET and align docs/PR body, or bump it to 4 to match the histogram and the wave path's semantics.

  • minor (thundering herd) ant-core/src/data/client/merkle.rs retry loop — fixed 30s backoff, no jitter, and the entire failed set re-fires at full store_concurrency on the next round. For a 178-chunk file with dozens of shortfalls (the prod case in the PR description), every retry hammers the same midpoint-disagreeing nodes in lockstep. Add small jitter (±10%) and consider staggering the round start.

  • minor (test coverage gap) — no case exercises "all chunks still short after the final retry" (outcome.stored == 0 && outcome.failed == total), and the data.rs/file.rs integration of MerkleStoreOutcome is untested (a regression that re-introduces the silent-success in data.rs wouldn't fail any test). Add at least the exhausted-retries case to the helper tests.

  • nit ant-core/src/data/client/merkle.rslet idx = attempt.min(outcome.stats.retries_histogram.len() - 1); is panic-safe only because the array length is a literal 4 > 0; if anyone reduces the array size to 0 this underflows. Use attempt.min(3) against a named const for defence-in-depth.

Positives: proof reuse is correctly idempotent (no re-pay, just re-PUT of an already-paid chunk that the storers will dedupe); the helper is a clean seam with focused unit tests; layers cleanly on PR #84 (cached MerkleBatchPaymentResult is exactly the input to this retry loop) and PR #94 (preflight already narrowed the chunk set); brings the external-signer path to parity with the wave path's collect-and-retry behaviour as advertised.

@jacderida jacderida force-pushed the fix/merkle-client-retry branch from 23178ab to 5cad982 Compare May 26, 2026 14:44
The external-signer merkle upload path (`merkle_upload_chunks`) aborted the
entire file on the first chunk that returned `InsufficientPeers`, and never
retried. Because all chunks in a batch share one `winner_pool`, the few nodes
whose routing tables disagree about that pool's midpoint reject it consistently
for every chunk whose close group includes them — turning a transient,
self-healing shortfall into a fatal partial upload.

C2.1: collect per-chunk `InsufficientPeers` failures rather than `?`-propagating
them; non-quorum errors (e.g. a missing proof) stay fatal.

C2.2: retry the failed set with the same reusable proofs across a total budget
of MERKLE_STORE_MAX_ATTEMPTS attempts (initial + 3 retries, matching the wave
path), re-collecting each chunk's close group per attempt and applying a
jittered 30s backoff between rounds so a large failed set does not re-probe the
same divergent nodes in lockstep. The retry round a chunk lands on is recorded
in `retries_histogram[round]`.

The store loop is extracted into `merkle_store_with_retry`, unit-tested for
collect-not-abort, non-quorum-fatal, retry-only-the-failed-set,
retry-counted-once, and exhausted-budget behaviour. `file.rs` reports honest
`chunks_failed`/`total_chunks`; the `data.rs` path returns a hard error on a
residual shortfall rather than a success with an undownloadable data map
(`DataUploadResult` cannot express a partial store).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the fix/merkle-client-retry branch from 5cad982 to 88a1391 Compare May 26, 2026 14:55
@jacderida
Copy link
Copy Markdown
Contributor Author

Thanks @grumbach — all five addressed, and rebased onto the (force-updated) rc-2026.5.4 so it layers cleanly on the merkle preflight (#94).

  • major — data.rs swallowing partial failure. Fixed. A residual shortfall after retries is now a hard error rather than a success with an undownloadable data map:

    if outcome.failed > 0 {
        return Err(Error::InsufficientPeers(format!(
            "Data merkle upload incomplete: {} of {} chunk(s) short of quorum after retries",
            outcome.failed, chunk_count
        )));
    }

    Went with the hard-error option rather than extending DataUploadResult, since for data_upload a partial result is never usable (the data map won't resolve). file.rs keeps its honest chunks_failed/total_chunks as before.

  • minor — attempt budget vs histogram. Bumped MERKLE_STORE_MAX_ATTEMPTS to 4 (initial + 3 retries) to match the wave path's 0..=MAX_RETRIES contract and fully use the 4-slot retries_histogram (final retry now lands in retries_histogram[3]). Doc/PR body updated to say "attempts" not "rounds".

  • minor — thundering herd. Added ±10% jitter to the backoff so the failed set doesn't re-probe the same divergent nodes in lockstep. thread_rng is !Send, so the value is computed and the rng dropped before the await to keep the future Send (the _merkle_upload_chunks_is_send assertion still holds). Note the existing store-concurrency limiter already bounds in-flight PUTs per round, so I kept round-level jitter rather than per-chunk staggering — happy to add the latter if metrics later show it's needed.

  • minor — test gap. Added store_with_retry_reports_all_failed_when_retries_exhausted: all chunks fail through the full budget → Ok with stored == 0, failed == total, attempted exactly MERKLE_STORE_MAX_ATTEMPTS times, empty histogram.

  • nit — histogram index underflow. Now attempt.min(retries_histogram.len().saturating_sub(1)).

One heads-up unrelated to this PR: cargo test/--all-features doesn't build locally on rc-2026.5.4 — the lockfile carries two saorsa-core versions (0.24.5-rc.1 via the git ant-protocol vs 0.24.4 via the crates.io ant-node dev-dep), which clash on MultiAddr in node/devnet.rs. The library itself is clean (cargo check -p ant-core passes); the merkle unit tests (incl. the new ones) passed against the pre-force-push base. Flagging in case CI hits the same dev-dep split.

jacderida and others added 2 commits May 26, 2026 16:12
`ci.yml` only triggered on `main`, so PRs targeting release-candidate
branches (e.g. `rc-2026.5.4`) ran no fmt/clippy/test checks — the
`pull_request` branch filter is matched against the PR's base branch.

Add an `rc-*` glob to both the `push` and `pull_request` filters so every
release-candidate branch is covered. Takes effect for the rc branch once
this merges (GitHub reads the `pull_request` trigger from the base branch),
and future rc cuts inherit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on the `--all-features`/E2E/doc jobs failed to compile: `ant-protocol` is
pinned to its git `rc-2026.5.4` branch (→ `saorsa-core 0.24.5-rc.1`) but
`ant-node` was still on crates.io `0.11.4` (→ `saorsa-core 0.24.4`), so two
incompatible `saorsa-core` versions clashed on `MultiAddr` in `node/devnet.rs`.

Point `ant-node` (runtime + dev dep) and the direct `saorsa-core` dev-dep at
their `rc-2026.5.4` git branches so the whole graph resolves to a single
`saorsa-core 0.24.5-rc.1`, matching `ant-protocol`. `Cargo.lock` updated to the
current branch tips (saorsa-core 1be73520, which includes the
`find_closest_nodes_local_by_distance` method ant-node's verifier needs).

Verified locally: cargo clippy --all-targets --all-features, cargo doc
--all-features, and compilation of all e2e/merkle test targets now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit 573476f into WithAutonomi:rc-2026.5.4 May 26, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants