fix(payment): gate quote freshness on own quote only, not neighbours'#127
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts the quote price “freshness” (anti-staleness) gate in payment::verifier so that a node only evaluates its own quote in a multi-peer payment bundle, instead of incorrectly judging neighbours’ quotes against the verifying node’s local record count. This aligns the freshness logic with what the verifying node can actually re-derive (its own price-from-fullness) and prevents heterogeneous close-group fullness from causing systematic false rejections and failed PUTs after on-chain payment.
Changes:
- Add a mechanism to resolve the verifying node’s own peer ID (from attached
P2PNode, with a test-only override fallback) and use it to select which quote to freshness-check. - Update
validate_quote_freshnessto skip all non-self quotes and improve the rejection message wording. - Add/adjust unit tests to cover neighbour-quote scenarios, self-stale rejection, and fail-open behavior when self identity is unavailable.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cf1ba3e to
51f7d55
Compare
grumbach
left a comment
There was a problem hiding this comment.
Reviewed the fix end to end (read verify_evm_payment, validate_quote_freshness, the self-ID resolution, the signing/binding path, and the merkle path). This is a clean, correct, well-scoped fix and the diagnosis matches the incident. Approving, with a couple of notes for your call.
What I verified
- Root cause is real and the fix is the right shape. A close group spanning 47..=1788 records makes a full node's 75% floor exceed an empty node's honest price, so the old gate rejected perfectly fresh neighbour quotes. A node can only re-derive its own price from its own record count, so gating only its own quote is correct. The neighbour's stale-replay risk is genuinely that neighbour's gate to enforce when the PUT reaches it.
- The self-ID comparison actually matches in production.
self_peer_id_bytes()returns*node.peer_id().as_bytes(), and anEncodedPeerIdis constructed from exactly those bytes (EncodedPeerId::new(*ant_peer_id.as_bytes())in this file). Same 32-byte encoding, soencoded_peer_id.as_bytes() != &self_peer_idis sound and won't silently skip our own quote. - No bypass from the pre-binding ordering.
validate_quote_freshnessruns beforevalidate_peer_bindingsand signature verification, so theencoded_peer_idlabel isn't authenticated yet when we pick out "our" quote. I chased the relabeling angle: an attacker could mislabel a forged cheap quote under a neighbour's peer ID to dodge the own-quote gate, but that quote then fails the BLAKE3 peer-id binding + signature (price is signed viabytes_for_signing). Both paths reject, so there's no surviving relabeled quote. The only untrusted-data op freshness performs (derive_records_stored_from_price) is already saturating/panic-proof for exactly this pre-auth reason. Fine as-is. - Scope is right.
validate_quote_freshnessis only called from the single-node path; the merkle/batch path has no analogous own-fullness gate, so nothing to mirror there.
Notes (non-blocking)
-
Two independent self-identity notions now coexist in the verifier.
validate_local_recipientidentifies "us" byquote.rewards_address == local_rewards_address; the new gate identifies "us" byP2PNodepeer ID. If those two ever disagree for a given node (config mismatch, or one handle attached but not the other),validate_local_recipientcould still consider the node a recipient while the freshness gate matches no quote and silently no-ops. It fails open, so it's not a security hole, just a latent inconsistency. Might be worth a one-line comment noting the two are expected to refer to the same node, or a debug log when the bundle contains our rewards address but no quote under our peer ID. -
Tiny doc-comment nit. The new doc on
validate_quote_freshnessis excellent, but it says "this node has no basis to judge it against its own record count" for neighbour quotes, which is the key insight worth keeping front and center. Consider also stating explicitly that a bundle is expected to contain exactly one quote per peer (so the loop matches at most one own quote) so a future reader doesn't wonder about multiple-own-quote handling.
Tests
The three new tests cover the regression shape (own fresh @1788 + honest neighbours @47/@978), own-still-bites, and fail-open-without-self-id. The fail-open test relies on no override being set, which is the right negative. Since these call validate_quote_freshness directly, the pub_key: [0u8; 64] fakes are fine (binding/signature aren't on this path). Coverage matches the change well.
Nice surgical fix. LGTM once you've eyeballed note 1.
The price-staleness gate (56dd370) checked every quote in the payment bundle against the verifying node's OWN record count. Fullness across a close group is wildly heterogeneous on a real network — on ant-prod-01 (2026-06-04) a close group spanned 47..=1788 records, so the three fullest nodes each found the emptiest node's honest, 10-second-old quote below their 75% floor and rejected the PUT after the client had already paid 3x the median on-chain (more than 2x even the fullest node's current price). With one more node disk-full, the DataMap store landed on 3/7 peers — below the majority of 4 — failing the upload and burning the payment. The same gate also blocked the replication paid-notify path, so the chunk could not heal past 3 copies. A node can only re-derive its own price from its own record count, so its own quote is the only one it can legitimately call stale. Resolve the node's peer ID from the attached P2PNode and skip every quote in the bundle that isn't ours; fail open (with a debug log) when no identity source is attached, mirroring the missing-record-count posture. The on-chain median payment binding is unaffected. Also reword the rejection message: it claimed "paid X" while actually printing the quote price, not the on-chain payment. Adds a regression test reproducing the incident shape (own fresh quote at 1788 records alongside honest neighbour quotes at 47 and 978), a test that an own stale quote still rejects among expensive neighbours, and a fail-open test for the missing-identity path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
51f7d55 to
8f69ae1
Compare
|
Thanks for the thorough review @grumbach — both notes addressed in the amended commit ( Note 1 (two self-identity notions): added a comment in Note 2 (at-most-one own quote): added a doc sentence noting that No behaviour change on any rejecting path; clippy (CI flags) and all 61 verifier tests still pass. 🤖 Generated with Claude Code |
| /// tolerance — on ant-prod-01 a close group spanning 47..=1788 records | ||
| /// made the three fullest nodes reject every bundle containing the | ||
| /// emptiest node's (perfectly fresh, 10-second-old) quote, failing the | ||
| /// PUT after the client had already paid on-chain. The node can only |
A proof-of-payment presented during replication is a receipt for a sale that already closed, but the verifier ran the full client-PUT check set against it. Two of those checks interrogate the present and therefore guarantee false rejections for replicated records: the own-quote price-freshness gate (record counts only grow, so every receipt's quoted price eventually drops below the verifier's live floor) and the local-recipient check (close groups churn, so a post-churn member receiving a record via replication was never a payee on the original receipt). The merkle candidate-closeness check has the same shape: it validates the winner pool against the live DHT, but the pool was sampled from the DHT of the original sale. On DEV-01 (2026-06-05) this rejected nearly 100% of replication proof-of-payment transfers within an hour of launch: 4M+ "PoP verification error ... stale" rejections at ~300k/hour, records pinned below target redundancy, close-group record counts diverging 150x (75..=11,231 per service), and a permanent ~500 MB/s fleet-wide re-offer storm (~25 TB egress in 16h). The divergence the rejections caused is also what made the client-PUT freshness gate (fixed for the heterogeneous-neighbour case in #127) keep firing: the two failure modes fed each other. Introduce VerificationContext { ClientPut, Replication } and thread it through verify_payment. Under Replication the verifier skips only the storer-being-paid-now checks (own-quote freshness, local recipient, merkle candidate closeness). Every receipt-authenticity check still runs in both contexts: quote structure, content binding to the exact address, peer-ID/ pub-key bindings, ML-DSA signatures, and the on-chain settlement lookup — a record cannot be admitted via replication without an authentic, settled payment for that record. The verified-XorName cache is context-aware to match: each entry records whether its verification ran the full client-PUT check set, a Replication-verified entry satisfies later replication lookups (re-offers of the same key are routine) but never a later ClientPut fast-path, and a full ClientPut verification upgrades the entry without ever being downgraded back. Without this, a replication receipt would let a later proof-less client PUT bypass the context-gated checks via the cache. Deliberate trade-off (documented on the enum): skipping the recipient and closeness checks for replication admits receipts from self-dealing payers who settle the median payment to their own wallet on-chain. The client-PUT path still rejects such pools, replication admission still requires responsibility for the key, and the abuse costs a settled on-chain payment per chunk; closing it properly belongs in quote issuance / payment policy rather than in the replication hot path, where the equivalent defence provably destroys the network's ability to heal. Call sites: the chunk PUT handler passes ClientPut (behaviour unchanged); the fresh-offer and paid-notify replication handlers pass Replication. Test results: payment::verifier 66/66 (5 new context tests: stale own quote and non-recipient receipts pass the gated checks under Replication, failing at the later binding/signature stage; content mismatch rejected under both contexts; duplicate-candidate merkle pool rejected under ClientPut but past the closeness check under Replication; Replication-verified cache entry does not satisfy a ClientPut fast-path, upgrades on full verification, never downgrades), replication 230/230, storage 29/29. cargo clippy --all-targets clean. Note: config::tests::test_bootstrap_peers_discover_env_var fails on machines with a real ~/.config/ant/bootstrap_peers.toml — pre-existing on main, unrelated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
The price-staleness gate introduced in 56dd370 checks every quote in the payment bundle against the verifying node's own record count:
Fullness across a close group is wildly heterogeneous on a real network. On ant-prod-01 (2026-06-04, PROD-UL-01 upload run) the close group for a public DataMap chunk spanned 47..=1788 records. The client collected 7 quotes and paid 3× the median on-chain (14,520,949,218,750,000 atto — more than 2× even the fullest node's current price), yet:
All quotes were 10 seconds old — no staleness was involved, only cross-node fullness spread. With a fourth close-group node disk-full, the PUT landed on 3/7 peers (majority is 4): the upload failed after the on-chain payment was burned, and the 500 already-stored body chunks became unreferenced. The same gate also rejects the replication paid-notify path (observed on two further nodes at 1180/1443 records), so the chunk cannot heal past the 3 copies.
Every close group whose fullest member's 75% floor exceeds its emptiest member's price will reject all single-node payments — guaranteed steady-state on prod where freshly joined nodes coexist with established ones.
Fix
A node can only re-derive its own price from its own record count, so its own quote is the only one it can legitimately call stale (the doc comment's stated intent — "a node storing a few replicated records between quoting and verifying" — is exactly the self-staleness case). The gate now:
P2PNode(same handle the merkle closeness check uses) and skips every quote in the bundle that isn't ours;The on-chain median payment binding (
completedPayments, 3× median) is unaffected.Tests
test_neighbour_cheap_quote_not_rejected— regression test reproducing the incident shape: own fresh quote at 1788 records alongside honest neighbour quotes at 47 and 978; must pass.test_own_stale_quote_still_rejected_among_neighbours— the gate still bites on a genuinely stale own quote.test_freshness_skipped_without_self_peer_id— fail-open without an identity source.set_peer_id_for_tests, cfg-gated likeset_records_stored_for_tests).All 61
payment::verifiertests pass; clippy clean. (config::tests::test_bootstrap_peers_discover_env_varfails identically on untouchedmain— environment-sensitive, unrelated.)Possible follow-up
An additional/alternative gate could compare the verified on-chain amount against the node's current price after
single_payment.verify()— that would also fix the "rejected despite aggregate overpayment" semantics. Kept out of this PR to stay surgical.Companion PR (reporting): WithAutonomi/ant-client#107 —
ant --json file uploademits no JSON when the public DataMap store fails, so harnesses report 0/0 chunks for uploads like this one.🤖 Generated with Claude Code