Skip to content

feat(bench): VM integration test infrastructure + UDP broadcast race fix#28

Merged
maxholman merged 28 commits intomainfrom
feat/vm-integration-tests
Feb 22, 2026
Merged

feat(bench): VM integration test infrastructure + UDP broadcast race fix#28
maxholman merged 28 commits intomainfrom
feat/vm-integration-tests

Conversation

@maxholman
Copy link
Copy Markdown
Contributor

@maxholman maxholman commented Feb 22, 2026

Summary

  • QEMU VM integration test harness (bench/): two microVMs connected via virtio-net over a Unix socketpair. Entry VM runs wallhack entry, exit VM runs wallhack exit. The entry VM runs a deterministic payload transfer (TCP + UDP echo via the tunnel) and reports a JSON result token. The Python test runner (run_tests.py) orchestrates both VMs, drains their stdout into ring buffers, and checks for pass/fail tokens. Covers smoke (connectivity) and resilience (netem loss/delay) scenarios for QUIC and WebSocket transports.

  • UDP broadcast race fix (fix(transport): subscribe before spawn): broadcast::send() returns Err(SendError) when there are no active receivers — not just when receivers are lagging. The subscription was created inside the spawned run_send_responses task, after open_uni() returned. For WebSocket/yamux, open_uni() requires a round-trip through the yamux driver; a fast UDP echo could arrive before the subscription was established, killing run_udp_recv silently. Fix: subscribe before spawning and before open_uni(). Functions renamed run_data_out_{instructions,responses}run_send_{instructions,responses} with signatures that accept a pre-created broadcast::Receiver<T>, making the subscribe-before-open contract compiler-enforced.

  • socat EOF datagram fix (fix(bench/vm)): socat -T 5 - UDP4:... sends a 0-byte UDP datagram when stdin reaches EOF. This races against the real echo response through the tunnel — the 0-byte echo arrives first, socat writes 0 bytes and exits, producing a SHA256 mismatch (e3b0c44 = hash of empty string). Replaced with nc -u -w 5 which has no EOF-signalling behaviour.

  • Debugging improvements: failure log tail bumped 50 → 150 lines; saved log directory path printed on failure; UDP empty-response error distinguishes 0-byte from corrupt with an actionable message.

Test plan

  • just bench::smoke — 2/2 scenarios pass (QUIC + WebSocket)
  • just bench::resilience — 6/6 scenarios (netem loss/delay combinations)
  • just check — fmt, clippy, cargo tests, smoke, resilience

maxholman and others added 28 commits February 21, 2026 11:31
Replace the netns/pytest/sudo/pyroute2 test suite with a self-contained
QEMU VM pair architecture. The old suite required root, accumulated sudo
prompts that silently blocked automated runs, and produced log noise that
made CI debugging impossible.

New architecture:
- Two QEMU VMs per test scenario connected by a socketpair L2 link
- VMs are self-contained: kernel boots our PID 1 init script directly
- Configuration via kernel cmdline params (wallhack.role/scenario/transport/…)
- Results reported as WALLHACK_RESULT: JSON on the serial console
- Host runner (Python stdlib only, no venv) polls stdout ring buffers
- No SSH, no root, no sudo, no pytest, no pyroute2

New files:
- bench/vm/init.sh       — VM PID 1 init script (exit + entry roles)
- bench/run_tests.py     — smoke + resilience + debug-topology runner
- bench/run_benchmarks.py — benchmark runner (iperf3, min/median/max JSON)
- bench/setup-vm.sh      — builds base.qcow2 via cloud-init (run once)

Justfile additions: smoke, resilience, benchmark, build-release,
debug-topology, setup-vm, fetch-iperf3.

just check now runs smoke + resilience alongside cargo test.
just benchmark is separate (not in check).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Calling socket.abort() sends RST, which causes the peer to lose data
still in its receive buffer — resulting in ECONNRESET even when the
transfer completed successfully. Using socket.close() sends FIN and
lets the four-way shutdown complete normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
UDP send_to and response-delivery messages were at TRACE level, making
them invisible with --debug. Elevating them means the full UDP round-trip
is visible in the ring buffer during integration test failures without
needing --trace.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getrandom(2) blocks until the CRNG has 256 bits of entropy, causing
silent hangs in crypto startup on entropy-starved VMs. Probing
/dev/random with O_NONBLOCK at startup surfaces the wait as a visible
warning rather than an unexplained hang.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- init.sh: replace urandom payloads with deterministic inputs (all-zero
  TCP, 'U'*64 UDP) so hash mismatches immediately reveal whether the
  response was empty or corrupted; add hexdump on UDP mismatch
- init.sh: replace sleep 2 with socat ,shut-down for half-close so the
  TCP echo round-trip completes without a mandatory 2s delay per test
- init.sh: remove shellcheck disables — use ${DEBUG:+"--debug"} for
  optional wallhack flags and ${DELAY:+...}/${LOSS:+...} for tc args
- init.sh: add socat-tcp log to _fail diagnostics; tee wallhack output
  to serial console and log file simultaneously
- run_tests.py / run_benchmarks.py: fix seen_count deque rotation bug
  where tokens could be missed once the ring buffer reached maxlen
- run_tests.py / run_benchmarks.py: suppress kernel boot noise with
  quiet loglevel=0 to keep the ring buffer useful
- bench.just: add iperf3 to build-initrd so benchmark scenario works
- bench/vm/kernel/: add Docker build context for static kernel + socat

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts bench and website targets into dedicated module files
(bench/bench.just, website.just) so the root justfile stays focused on
the top-level gate. Also fixes gitignore to cover python cache and the
old root-level /bin/ artifact directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bench/bin/iperf3 is obsolete (iperf3 now lives under bench/vm/staging/,
covered by bench/.gitignore). bench/results duplicates the results/
entry already in bench/.gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Broadcast::send() returns Err(SendError) when there are no active
receivers, not when receivers are lagging — this silently killed
run_udp_recv on the exit side if a UDP echo arrived before the
run_send_responses task had subscribed.

The subscription was created inside the spawned task, after open_uni()
returned. For WebSocket/yamux, open_uni() requires a round-trip through
the yamux driver; a fast UDP echo from the far side of the tunnel could
arrive and be sent to the broadcast channel before the subscription was
established.

Fix: subscribe() is called before the task is spawned and before
open_uni(), so buffered messages are preserved even during stream setup.

Also renames run_data_out_{instructions,responses} to
run_send_{instructions,responses} and changes signatures to accept
a pre-created broadcast::Receiver<T> instead of &broadcast::Sender<T>,
making the subscribe-before-open contract explicit and compiler-enforced.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
socat -T 5 - UDP4:... sends a 0-byte UDP datagram when stdin reaches EOF
(after the payload file is fully read). This races against the real echo
response arriving through the tunnel — if the 0-byte echo arrives first,
socat writes 0 bytes to the output file and exits, producing a SHA256
mismatch (response_size=0, hash matches empty string).

nc -u does not have this EOF-signalling behaviour; it reads stdin, sends
it as a single datagram, then waits for the -w timeout before exiting.
Both QUIC and WebSocket smoke tests pass reliably with nc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… style

AGENTS.md: expand naming conventions section to cover domain logic vs
transport layer distinction, prohibited directional terms, required
vector-based terminology, and CLI consistency rules.

TODO.md: add backlog item for topology/directional naming refactor.

wallhack.rs: remove redundant info log on default startup; use let-else
chain for entropy check (more idiomatic).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_tests.py:
- Print saved log directory path on failure so logs are immediately
  findable without having to browse bench/results/logs manually.
- Bump default failure tail from 50 → 150 lines; 50 was routinely
  missing the interesting early-boot context on init script failures.

init.sh:
- On UDP sha256 mismatch, distinguish empty response (0 bytes) from a
  genuine corrupt response with a specific, actionable error message.
  The original "UDP echo mismatch: expected=... got=e3b0c44..." output
  silently hid that the response was empty — e3b0c44 is the SHA256 of
  the empty string, which was opaque during debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ebsite.just

bridge.rs: replace |r| std::mem::discriminant(r) with std::mem::discriminant
(clippy::redundant_closure).

website.just: add `set working-directory := "website"` so pnpm commands
run from the package root; the module was executing from the repo root
where no package.json exists, causing `just check` to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Start entry VM first (it's the listener) and wait for
WALLHACK_ENTRY_READY_MAGIC_TOKEN before starting exit. The exit node
now connects on its first attempt, eliminating the ~500ms retry backoff
that occurred when exit started while entry wasn't yet listening.

WALLHACK_ENTRY_READY_MAGIC_TOKEN is emitted after wallhack binds its
listen port, detected via /proc/net/{udp,tcp} poll at 100ms intervals —
no arbitrary sleeps.

Also:
- INITIAL_RETRY_DELAY: 1s → 50ms (exit + relay)
- Add RECONNECT_DELAY: 500ms for post-session drops with reset
- Remove unused _wait_for_host from init.sh
- Extract _qemu_base() to DRY up QEMU machine args in run_tests.py
- Add debug-shell subcommand: boots single interactive busybox VM
  via rdinit=/bin/sh for kernel/OS debugging
- gzip -1 for faster initrd packing

Result: wallhack handshake drops from ~1s → ~62ms (10×).
Smoke suite: 2/2 pass in ~24s wall clock (both transports).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
busybox nc -u -w N waits the full N-second inactivity timeout after
receiving the echo reply, even though the response is already in hand.
1s is still generous for any realistic tunnel RTT (netem max is 100ms).

Smoke suite: 2/2 pass in ~8s (was ~24s).

Also add `just bench shell` recipe: boots a single interactive busybox
VM via rdinit=/bin/sh for kernel/OS debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
parse_listen_addr expanded bare ':port' to '[::]:port', so the
display always claimed IPv6 regardless of intent. Two fixes:

1. Expand ':port' to '0.0.0.0:port' — explicit IPv4 wildcard that
   matches the common case and the exit node's connect behaviour.
   IPv6 listeners can still be configured explicitly via '[::]:port'.

2. Add local_addr() to the Server trait (impl via endpoint/listener)
   and use it for the "Listening on ..." log line, so the display
   reflects the address the OS actually bound rather than the pre-bind
   parsed argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0.0.0.0 was a breaking change — [::]  on Linux is dual-stack by
default (IPV6_V6ONLY=0), accepting both IPv4 and IPv6 clients, which
is the desired behaviour. The confusing display was the real issue,
not the bind address. local_addr() now shows the actual OS-bound
address ([::]:port) which is accurate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add crates/cli/src/net.rs with SocketAddrExt trait (.bind_addr()) and
  shared parse_listen_addr() that probes IPv6 availability and falls back
  to 0.0.0.0 on IPv4-only kernels, and resolves hostnames via ToSocketAddrs
- Remove three duplicate parse_listen_addr() functions and bind_for() from
  exit.rs; all callers now use crate::net
- Fix all ClientConfig constructions in entry.rs and relay.rs to set
  bind: addr.bind_addr() instead of inheriting DEFAULT_BIND_ADDRESS
  (Ipv6Addr::UNSPECIFIED), which crashed on IPv4-only kernels
- All listen displays now use server.local_addr()? (actual bound address)
  instead of the pre-bind parsed address

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bench/vm/kernel/Dockerfile: add -d IPV6 to build an IPv4-only kernel,
  add COPY of .config to final stage so kernel config is inspectable
- bench/vm/init.sh: explicitly bind entry on 0.0.0.0 for IPv4-only kernel,
  reduce _wait_for_tun timeout from 45s to 5s
- bench/run_tests.py: minor formatting fixes (black style), add panic=-1
  to debug-shell kernel cmdline so crashes exit immediately

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the hardcoded DEFAULT_BIND_ADDRESS (Ipv6Addr::UNSPECIFIED) with
a runtime probe using socket2: ask the kernel to create an AF_INET6 DGRAM
socket — no port allocated, purely a capability check. Falls back to
Ipv4Addr::UNSPECIFIED on IPv4-only kernels.

ipv6_supported() is public in wallhack::client::config so cli::net can
reuse it for parse_listen_addr() bare-port expansion, keeping the probe
logic in one place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bench/vm/kernel/Dockerfile: remove -d IPV6, restore default kernel
  (IPv6 available via olddefconfig default y)
- bench/vm/init.sh: revert 0.0.0.0 back to bare :port so parse_listen_addr
  probe mechanism is exercised in the bench environment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
socat version bumps no longer invalidate the kernel layer cache (~5min
build). Each now has its own Dockerfile and just recipe with independent
caching. bench/vm/socat/Dockerfile uses a lighter apt install (no kernel
build tools). build-initrd now depends on build-kernel and build-socat
separately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bench/vm/socat/Dockerfile: remove default from ARG SOCAT_VERSION;
  version is required from caller (bench.just is sole source of truth),
  consistent with KERNEL_VERSION pattern
- client/config.rs: add #[must_use] and AF_INET6 backtick to ipv6_supported()
- exit.rs: remove unused Context import, backtick INITIAL_RETRY_DELAY in
  doc, drop redundant ..Default::default() where all fields are explicit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminate ~150 lines of duplicated code between run_tests.py and
run_benchmarks.py by pulling shared constants, preflight, QEMU helpers,
log drain, and wait utilities into a common module.

qemu_cmd() is unified with optional netem/metric kwargs so both runners
share a single implementation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch inter-VM networking from AF_UNIX socketpair + fd= to TCP
  listen/connect ports. The socketpair approach triggered a QEMU 10.2.0
  assertion (net_fill_rstate: size == 0) under high-throughput iperf3
  TCP load. TCP listen/connect avoids this code path entirely and matches
  the TASK.md spec.

- Replace ping-based latency with iperf3 --json mean_rtt. Busybox ping
  hung indefinitely waiting for late ICMP replies when packets were
  dropped. iperf3 TCP RTT (mean_rtt field, microseconds) is more reliable
  and consistent with the other benchmark scenarios.

- Fix iperf3 JSON parsing: iperf3 3.20 uses tab after colon
  ("field":\tvalue) not space, so -F': ' never split. Use match() to
  extract the first digit sequence on the matching line.

- Switch all throughput scenarios from plain text -f m (integer Mbps) to
  --json bits_per_second / 1e6 for sub-Mbps precision.

- Fix wallhack_version() parsing: --version output is multi-line;
  split()[-1] was grabbing the compiler date. Use split()[1] instead.

- Add docstring to qemu_base() documenting append vs extra parameters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@maxholman maxholman merged commit 1514b46 into main Feb 22, 2026
4 checks passed
maxholman added a commit that referenced this pull request May 6, 2026
Sweep of website/ deps to latest within ranges, plus a vite downgrade
from 8 -> 7 to match astro's transitive vite (7.3.2) and avoid a
rolldown regression with @tailwindcss/vite 4.2.4.

Closes alerts #28 #29 #30 #31 #33 #34 #35 #36 #37 #38 #39 #40 #44 #48
covering vite, picomatch, postcss, yaml, astro, smol-toml.

- vite ^8.0.1 -> ^7.3.2 (drops the now-redundant vite 8 lineage; astro
  pulls 7.3.2 transitively, which is the patched version)
- astro 6.0.6 -> 6.2.2 (#44)
- @tailwindcss/vite 4.2.2 -> 4.2.4
- smol-toml: lockfile bump to 1.6.1 (#28)
- postcss: lockfile bump to 8.5.14 (#48)
- picomatch: lockfile bumps to 2.3.2 + 4.0.4 (#29 #30 #39 #40)
- yaml is now omitted entirely (it was an optional vite peer)

Verified: pnpm build succeeds; no @tailwindcss/vite peer-dep warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant