Skip to content

chore(release): promote rc-2026.5.4#118

Merged
jacderida merged 28 commits into
mainfrom
rc-2026.5.4
May 28, 2026
Merged

chore(release): promote rc-2026.5.4#118
jacderida merged 28 commits into
mainfrom
rc-2026.5.4

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

Promotes rc-2026.5.4 to release version(s): 0.11.5.

  • strips -rc.* from [package].version
  • rewrites internal git+branch deps to crates.io version pins
  • regenerates Cargo.lock

Once merged, the release tag will be pushed to fire the publish workflow.

grumbach and others added 28 commits May 23, 2026 16:14
The shared upgrade binary cache stored the extracted binary and, on a
cache hit, returned it after only a SHA-256 check against a sibling
.meta.json. SHA-256 is not a security control: anyone able to write to
the shared cache directory (a co-located process, a shared container
volume, a low-privilege foothold on the host) could drop a malicious
binary plus a forged matching metadata hash, and the next ant-node
instance to upgrade would execute it with no signature verification at
all — persistent RCE on every co-located node. The ML-DSA-65 signature
covers the archive and was only checked on the initial download, never
on a cache hit.

Changes:

- Cache the signed *archive + detached signature* instead of the
  extracted binary. `BinaryCache::get_verified_archive` re-runs ML-DSA-65
  verification on every cache hit; the binary is always extracted fresh
  from the just-verified archive. A tampered archive, tampered or
  missing signature, or forged metadata fails verification against the
  pinned release public key, so a poisoned cache entry is rejected and a
  fresh verified download runs.

- Stage cached files into the caller's process-private temp directory
  and verify that copy, then extract from the same private path. Closes
  the verify-vs-extract TOCTOU on the shared cache files: an attacker
  cannot swap the bytes between when the verifier reads them and when
  the extractor reads them.

- Size policy before any copy or read. `fs::symlink_metadata` +
  `file_type().is_file()` rejects symlinks / FIFOs / devices outright;
  archive size is bounded by `MAX_ARCHIVE_SIZE_BYTES` and the signature
  must be exactly `SIGNATURE_SIZE` bytes. Otherwise an attacker could
  plant `cached.archive -> /dev/zero` (stats as 0 bytes) and force
  unbounded disk fill in the staging dir or OOM in `signature::verify`.

- Cache only after successful extraction. A validly-signed-but-malformed
  release no longer becomes a shared cache poison pill that every later
  node downloads, fails to extract, and re-downloads.

- `cache_dir.rs` restricts the shared upgrade cache directory to 0700
  on Unix as defence in depth; the ML-DSA gate is the primary control.

- `store_archive` mirrors the same size / file-type / signature checks
  before persisting, so a poisoned entry cannot be created through the
  supported path either.

Tests in `src/upgrade/binary_cache.rs` cover the tamper path
(SHA-256-forged swap on disk rejected by the signature re-check), the
post-hit shared-file swap (private copy unaffected), the symlink-to-
`/dev/zero` bypass attempt, oversize archive / wrong-sized signature
rejection, and round-trip storage. Production verifies against the
pinned `RELEASE_SIGNING_KEY`; tests use a `#[cfg(test)]`-only
constructor that injects a generated key without weakening the
production trust anchor.

Residual: cache entries are not bound to a specific release version
(the ML-DSA signing context is constant across versions), so a
same-UID attacker who already has any past validly-signed release can
plant it under a newer version's cache key and force a downgrade to
that old signed binary. Not RCE (still legitimately-signed bytes) and
a same-UID attacker has easier paths anyway; closing it cleanly
requires coordinated changes in the release-signing pipeline,
ant-keygen, ant-node, and ant-client, and is tracked in the
binary_cache module docs.
Review feedback on the upgrade binary cache:

- `meta.json` was read with an unbounded `fs::read_to_string`. An
  attacker with write access to the shared cache directory could plant
  the metadata sidecar as a symlink to `/dev/zero` or as a huge file
  and stall the read into a hang/OOM before the archive/sig hardening
  ran. The metadata path now goes through the same
  open-once-and-validate gate as the archive: regular-file check on
  the opened handle, capped at `MAX_META_BYTES` (4 KiB).

- Archive + signature staging previously did `symlink_metadata` (path)
  followed by `fs::copy` (path), leaving a small TOCTOU window where
  an attacker could race-swap the path to a symlink/FIFO/device or an
  oversized file between the check and the copy. Both files are now
  opened once via `open_regular_capped`, validated on the resulting
  `File` handle (size + file-type), and copied into the private
  staging dir from the open handle (wrapped in `Read::take(len)` as
  belt-and-braces against a post-open extension). All subsequent
  operations on those files use the staged private bytes, never the
  shared path.

- Comment fix: the prior comment claimed `sha256_file` loads the
  archive into memory in full. It actually streams in 8 KiB chunks;
  the memory-pressure concern is `signature::verify_from_file*`
  (FIPS-204 requires the message as a slice). Wording updated.

- Stale error message "Failed to serialize binary cache meta" updated
  to "Failed to serialize cached archive metadata" — the cache now
  stores archive metadata, not extracted-binary metadata.

Two new tests:
  test_oversized_meta_is_rejected
  test_meta_symlink_to_special_file_is_rejected  (Unix-only)

488 lib tests pass; cfd clean.
Close a local DoS on auto-upgrade: a cache-dir attacker could plant a
FIFO at ant-node-<ver>.archive (or .sig / .meta.json) and open() for
reading would block indefinitely waiting for a writer, hanging the
upgrade. open_regular_capped previously only checked file type AFTER
the blocking open.

Two-layer defence in open_regular_capped:
- Pre-check via fs::metadata (follows symlinks), reject non-regular
  files before open(). A symlink-to-regular is still accepted as
  before; a symlink-to-FIFO/device/socket is rejected.
- On Unix, also open with O_NONBLOCK so a race between the pre-check
  and open() cannot reopen the FIFO window. Reads on regular files
  ignore O_NONBLOCK, so this is a no-op for the happy path. Platform-
  specific constant (0o4000 Linux, 0x0004 macOS/BSD); fallback to no
  flag on unknown unix-likes.

The existing post-open is_file() check on the file handle remains the
TOCTOU-safe final gate.

New regression test test_fifo_cached_archive_does_not_hang plants a
real FIFO via mkfifo and asserts return in well under 2s. 14/14
binary_cache tests pass; cfd clean.
Round 2 from adversarial review:

- Replace hand-coded O_NONBLOCK constants with libc::O_NONBLOCK. The
  previous 0o4000/0x0004 per-OS values were correct on
  x86_64/aarch64/arm but wrong on Linux/MIPS (0o200) and Linux/SPARC
  (0x4000), where 0o4000 maps to O_NOATIME. Using the libc constant
  always picks the right value for the target arch. Add libc as a
  Unix-only direct dependency (was already transitive).

- Test test_fifo_cached_archive_does_not_hang: replace the mkfifo
  shell-out with libc::mkfifo so a CI image that drops coreutils
  cannot silently skip this test. Bump the budget from 2s to 5s to
  absorb GitHub Actions macOS runner cold-start variance, since the
  failure mode "O_NONBLOCK wrong on this arch" and "CI runner slow"
  look identical from the assertion.

- Document the load-bearing invariant on get_verified_archive's
  private_dir: callers MUST supply a process-private 0o700 dir
  (apply.rs already does via tempfile + permissions). Without that the
  reopens-by-path in sha256_file/verify_archive would reopen a TOCTOU
  window.

- Add a cross-reference comment explaining the intentional asymmetry
  between store_archive (uses symlink_metadata, rejects symlinks) and
  open_regular_capped (uses fs::metadata, accepts symlink-to-regular)
  so a later editor doesn't unify them in the wrong direction.

14/14 binary_cache tests pass, 489/489 lib tests pass, cfd clean.
Switch both Linux release targets from glibc to musl so the published
binaries run on any Linux distribution, including Alpine and other
musl-based systems. Asset filenames are unchanged
(ant-node-cli-linux-{arm64,x64}.tar.gz) so existing auto-upgraders on
deployed nodes continue to find them.

x86_64-unknown-linux-musl now uses `cross` for the musl toolchain
(matching aarch64). musl-static binaries have no dynamic linker
dependency and execute on glibc hosts as well as musl hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
musl's default malloc is notably slower than glibc's under concurrent
allocation churn — the steady-state shape of a DHT-bridged P2P node.
Switching the global allocator to mimalloc neutralises that regression
for the musl Linux builds, and tends to outperform glibc's allocator as
well, so all builds benefit.

Applied to both ant-node and ant-devnet binaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia
*network* lookup (find_closest_nodes_network) on the PUT-handling hot path. That lookup runs
up to MAX_ITERATIONS rounds bounded by CLOSENESS_LOOKUP_TIMEOUT (240s) and is the dominant
term in slow per-chunk store times; its instability (fresh transient peers pulled in on every
call) also contributes to the closeness disagreements that cause outright rejections.

Answer instead from the local routing table (find_closest_nodes_local, a pure in-memory
k-bucket read with no network I/O), matching the precedent already used for the close-group
responsibility check (find_closest_nodes_local_with_self). Fall back to the network lookup
only when the local table is genuinely too sparse to be authoritative (fewer than
CLOSENESS_LOOKUP_WIDTH peers near the midpoint). The fallback is gated on local table size,
not match outcome, so a forged pool cannot force the expensive 240s path -- an attacker
cannot make a victim's local routing table sparse.

check_closeness_match and the single-flight pass-cache wrapper are unchanged. Node-side only,
no wire/protocol change, so this is backwards compatible across a mixed-version fleet. The
fallback decision is extracted into a pure const fn (closeness_should_fall_back_to_network)
so its CLOSENESS_LOOKUP_WIDTH boundary is unit tested without standing up a P2PNode.

Test results:
- cargo fmt -- --check: clean
- cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used
  -D clippy::expect_used: no warnings
- cargo test --lib payment::verifier: 67 passed, 0 failed (incl. new boundary test
  closeness_falls_back_to_network_only_below_lookup_width)
- e2e test target (--test e2e --features test-utils): compiles

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the coordination concern raised in review (dirvine): the Merkle closeness check is a
*verification* that must mirror the uploader's pure XOR-distance view, not the reachability
re-rank used for storage selection. With saorsa-core's reachability-aware
find_closest_nodes_local (WithAutonomi/saorsa-core#121), a re-rank could demote an XOR-close
relay-only peer out of the compared window and falsely reject an honest candidate pool that
legitimately contains that peer.

Switch the closeness check to find_closest_nodes_local_by_distance, the XOR-only variant added
to saorsa-core#121 for exactly this purpose. check_closeness_match (the set-membership helper)
is unchanged. Also rename the local variable network_peers -> closeness_peers for readability
(review feedback, grumbach), since it now usually holds local-table results.

The rc-2026.5.4 dependency pins (saorsa-core, ant-protocol) come from the release-cut base
commit; this commit only advances Cargo.lock to the rc-2026.5.4 tip so the pin includes the
merged #121 (find_closest_nodes_local_by_distance), which the base's release cut predated.

Test results (against the rc-2026.5.4 deps):
- cargo fmt -- --check: clean
- cargo clippy --lib --all-features (-D panic -D unwrap_used -D expect_used): no warnings
- cargo test --lib payment::verifier: 67 passed, 0 failed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: answer Merkle closeness check from local routing table
… on mismatch

Two upload-breaking regressions on testnets with a meaningful NAT
fraction, both from storer-side closeness verification diverging from
the uploader's network-walked peer selection.

Single-node close-group check (introduced in #107): switch off the
reachability-reranked find_closest_nodes_local_with_self onto the
XOR-only find_closest_nodes_local_by_distance_with_self. The re-rank
(saorsa-core #121) demoted XOR-close relay-only / NAT'd peers out of the
local top-CLOSE_GROUP_SIZE, dropping 2-3 of the uploader's 7 quoted
peers and breaching the >=5 threshold. This mirrors the fix already
applied to the Merkle path; it remains a pure local lookup, so no added
network cost.

Merkle candidate-pool check (changed in #111): #111 moved the check off
the authoritative network lookup onto the local routing table, with the
fallback gated on local-table *size*, not match *outcome*. On a real
network the local k-bucket sample legitimately diverges from the
uploader's network-walked candidates (which include reachable responders
from positions 17-32), so honest pools were hard-rejected with no
escalation. Keep #111's local fast path (accept on a local match), but
escalate to the authoritative network lookup on match *failure* too.

Bound the reopened network-fallback path with a new
closeness_fallback_permits semaphore (CLOSENESS_NETWORK_FALLBACK_CONCURRENCY
= 16): inflight_closeness already collapses same-pool concurrency, and
this caps the distinct-pool case so a forged-pool flood cannot spawn
unbounded 240s Kademlia walks -- addressing the DoS rationale #111 gave
for the size-only gate.

Requires saorsa-core's find_closest_nodes_local_by_distance_with_self
(WithAutonomi/saorsa-core#122) on rc-2026.5.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up find_closest_nodes_local_by_distance_with_self
(WithAutonomi/saorsa-core#122, now merged to rc-2026.5.4) that the
single-node close-group verification change depends on. The crate is
pinned to `branch = "rc-2026.5.4"`, so this only advances Cargo.lock
from 1be7352 to 82bb541; no manifest change. ant-node now compiles
against the published branch without a local patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI only triggered for push/pull_request against `main`, so PRs targeting
release branches (e.g. rc-2026.5.4) ran no checks. Add `rc-*` to both
branch filters.

Note: the pull_request branch filter is evaluated against the PR's base
branch, so this only starts firing for rc-targeted PRs once it has landed
on the rc-2026.5.4 branch itself (i.e. after this PR merges).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…closeness-verification

fix(payment): verify closeness against pure-XOR view; escalate Merkle on mismatch
feat: build Linux releases against musl with mimalloc allocator
…-merkle-closeness-verification"

This reverts commit 59da17b, reversing
changes made to ed939af.
…ocal-lookup"

This reverts commit 5ac1f76, reversing
changes made to baa7dd2.
The saorsa-core and ant-protocol rc-2026.5.4 branches are being abandoned
(their only changes, saorsa-core #121/#122, are reverted). Point both deps
back at their crates.io releases (saorsa-core 0.24.4, ant-protocol 2.1.1)
and refresh the lock.
The reverts of #114/#111 also dropped the rc-* branch filters from the CI
workflow. Restore them so push/PR CI still runs for rc-* base branches.
CI re-enabled on rc-* branches surfaced a pre-existing doc_markdown lint
(clippy 1.95) in binary_cache.rs that fails under -D warnings.
…le-prs

Revert #114 + #111; drop abandoned saorsa-core/ant-protocol rc pins
…ment-verification"

This reverts commit bece788, reversing
changes made to 360c2fc.
Revert #107 (enforce single-node proof verification)
Copilot AI review requested due to automatic review settings May 28, 2026 11:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR promotes the crate to 0.11.5 while also introducing broader runtime, upgrade-cache, payment-verifier, and release workflow changes.

Changes:

  • Bumps package/lockfile version and adds mimalloc/libc dependencies.
  • Reworks upgrade caching to store signed archives and re-verify signatures on cache hits.
  • Changes single-node payment verification and switches Linux release artifacts to musl builds.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Cargo.toml Bumps version and adds allocator/libc dependencies.
Cargo.lock Regenerates dependency lockfile for version/dependency changes.
src/upgrade/cache_dir.rs Tightens Unix upgrade cache directory permissions.
src/upgrade/binary_cache.rs Replaces cached binaries with signed archive/signature cache validation.
src/upgrade/apply.rs Extracts verified cached archives and caches downloaded signed archives.
src/payment/verifier.rs Simplifies single-node payment validation and delegates median payment verification.
src/bin/ant-node/main.rs Sets mimalloc as global allocator.
src/bin/ant-devnet/main.rs Sets mimalloc as global allocator.
.github/workflows/release.yml Changes Linux release targets to musl/cross builds.
.github/workflows/ci.yml Runs CI on rc-* branches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs
Comment on lines +463 to +464
Self::validate_peer_bindings(payment)?;
self.validate_local_recipient(payment)?;
Comment thread Cargo.toml
Comment on lines +26 to +30
# Global allocator. musl's default malloc is significantly slower than
# glibc's under concurrent allocation churn, which matches the node's
# steady-state workload. mimalloc neutralises that regression for the
# musl Linux builds (and tends to beat glibc's allocator too).
mimalloc = "0.1"
@jacderida jacderida merged commit d91a4a3 into main May 28, 2026
22 of 23 checks passed
@jacderida jacderida deleted the rc-2026.5.4 branch May 28, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants