VirusAlex · VirusAlex · May 1, 2026 · Apr 30, 2026
diff --git a/README.md b/README.md
@@ -19,8 +19,9 @@ NetCopy is **not**:
   bearer token,
 - a backup tool — there is no scheduling, snapshotting, or retention.
 
-Tested on Linux. Runs on macOS and Windows too; see [Known issues](#known-issues)
-for platform caveats.
+Linux is the only platform under CI and the only one we ship release images
+for. The pure-Java parts run on macOS and Windows too, but a few platform
+quirks aren't tested on every commit — see [Known issues](#known-issues).
 
 ## Quick start
 
@@ -145,40 +146,71 @@ NetCopy splits cleanly into a **control plane** and a **data plane**.
 |     |          |                            |  data  |     |          |                            |
 |  HttpPuller  TcpPuller  (port 7778 server)  |<------>|  HttpPuller  TcpPuller  (port 7778 server)  |
 |     |          |                            |        |     |          |                            |
-|  SidecarStore (data.partial + chunks.bitmap + meta.json)                                            |
+|  SidecarStore (data.partial + chunks.bitmap + chunks.hashes + meta.json)                            |
 |  JsonJobStore (<state-dir>/jobs/<id>.json)                                                          |
 +---------------------------------------------+         +---------------------------------------------+
 ```
 
 **Control plane (HTTP + WebSocket via Javalin, port 7777):**
 
-- `GET /api/health` — liveness probe (no auth).
-- `GET /api/browse` — list a directory under one of the peer's `--shared-root`s.
-- `POST /api/manifest` — ask the peer to plan a transfer; returns a
-  `manifestId` plus a flat list of files, sizes, mtimes, and chunk plans.
-- `POST /api/transfers` — start a job locally that will pull a manifest from
-  a remote peer. Persists a `JobState` in `<state-dir>/jobs/<id>.json`.
-- `GET /api/transfers/{id}` — poll job state.
-- `WS /ws/progress` — live `ProgressEvent`s (subscribed per `transferId`).
+| Endpoint | Auth | Purpose |
+|---|---|---|
+| `GET /api/health` | no | Liveness probe (open). |
+| `GET /api/peer/info` | yes | Peer self-description: hostname, version, TCP blob port, root counts. |
+| `GET /api/browse?root=&path=` | yes | List a directory under a `--shared-root`. |
+| `GET /api/browse-local?root=&path=` | yes | Same shape, rooted under a `--receive-root` (UI uses it for the target panel). |
+| `POST /api/browse/stats` | yes | Recursive file count + byte total per path; powers the selection-stats footer. |
+| `POST /api/manifest` | yes | Plan a transfer. Returns the full manifest (entries, sizes, mtimes, chunk plans, `manifestId`). |
+| `POST /api/manifest/register` | yes | Re-register a previously-issued manifest (used by the puller after a source-side restart). |
+| `GET /api/blob/{manifestId}/{fileId}` | yes | HTTP data-plane: file bytes (with `Range` support, `X-Chunk-Hash` response header). |
+| `GET /api/hash/{manifestId}/{fileId}` | yes | Lazy XXH3-128 of a manifest entry; returns `202` while computing. |
+| `POST /api/transfers` | yes | Start a job (target host pulls from a remote source). |
+| `GET /api/transfers` | yes | List status snapshots (newest first). |
+| `GET /api/transfers/{id}` | yes | Single status snapshot, including per-file table and per-chunk metrics. |
+| `POST /api/transfers/{id}/{pause,resume,cancel}` | yes | Lifecycle controls. |
+| `DELETE /api/transfers/{id}` | yes | Dismiss a terminal-state job from the persistent store. |
+| `POST /api/relay/push` | yes | "Push from here to peer" — proxies `POST /api/transfers` to the peer using its token. |
+| `GET /api/metrics` | yes | Host metrics (CPU/RAM/disk/GC, top threads) + per-server serve metrics. |
+| `WS /ws/progress` | yes | Live `ProgressEvent` stream (subscribe per transfer or wildcard). |
 
 **Data plane (two interchangeable protocols):**
 
 - `GET /api/blob/{manifestId}/{fileId}` with HTTP `Range` headers, served by
   Javalin via `FileChannel.transferTo`.
 - A custom binary TCP protocol on port 7778: framed `[len:u32][type:u8][payload]`
-  with `HELLO`/`REQUEST`/`DATA_HEAD`/`DATA`/`DATA_END`/`ERR`/`BYE`. Designed to
-  reuse one connection across many `pullChunk` calls and avoid HTTP parsing
-  overhead at the price of a more interesting wire format.
+  with `HELLO` / `REQUEST` / `DATA_HEAD` / `DATA` / `DATA_END` / `DATA_END_V2`
+  (xxh3 trailer, single-pass; v0.3.0+) / `ERR` / `BYE`. Designed to reuse one
+  connection across many `pullChunk` calls and avoid HTTP parsing overhead at
+  the price of a more interesting wire format.
 
 The protocol is selected per job at start time. See
-[docs/protocol-comparison.md](docs/protocol-comparison.md) for benchmarks.
+[docs/protocol-comparison.md](docs/protocol-comparison.md) for design notes.
 
 **State and resume:**
 
 - Each in-progress target file owns a sidecar directory `<file>.netcopy/`
-  containing `data.partial` (sparse, written at offsets), `meta.json`
-  (size, mtime, chunk plan), and `chunks.bitmap` (1 bit per chunk, set after
-  the chunk is downloaded **and** its xxh3-128 hash verified).
+  containing four files:
+  - `data.partial` — sparse, pre-allocated to the final size, written via
+    positional FileChannel writes;
+  - `meta.json` — immutable per-file descriptor (relPath, size, sourceMtime,
+    chunk plan, `schemaVersion`);
+  - `chunks.bitmap` — one bit per chunk, set after the chunk's bytes are
+    fsynced **and** its **xxh3-128** chunk-level hash matches what the source
+    advertised on the wire;
+  - `chunks.hashes` — fixed-size array of XXH3-128 digests (16 bytes per
+    chunk), positionally written as each chunk completes. Used by the
+    selective re-verify path on full-file hash mismatch so resume re-pulls
+    only the corrupted chunks instead of the whole file.
+- Hashing has two layers:
+  - **Per-chunk** verification (and the on-the-wire `X-Chunk-Hash` /
+    `DATA_END_V2`) is **XXH3-128** — fast, ~10 GB/s on x86, allocates a small
+    per-chunk buffer.
+  - **Full-file finalize** is **SHA-256** in 256 KiB strides. Streaming
+    XXH3-128 in this codebase buffers all bytes into a `ByteArrayOutputStream`
+    that overflows the array-size limit on multi-GiB files — SHA-256 streams
+    cleanly via `MessageDigest.update`. The resulting digest lives in the
+    JSON's `hashHex` field for v0.x wire-format stability (the field name
+    will change in a future major bump).
 - After all chunks are verified, `FileFinalizer` rehashes the whole file and
   atomic-renames `data.partial` to the final target path.
 - A job's overall state lives in `<state-dir>/jobs/<id>.json` (one JSON per

diff --git a/docs/protocol-comparison.md b/docs/protocol-comparison.md
@@ -1,82 +1,88 @@
-# HTTP vs TCP — protocol comparison
-
-NetCopy ships two interchangeable data-plane protocols. This document is
-the home of the quantitative comparison between them. The numbers below
-are produced by task **V5 — protocol comparison** and are placeholders
-until that pass runs.
-
-## What we are measuring
-
-The same workload runs back-to-back over both protocols, on the same two
-hosts, with the same chunk plan. Each row in the table below should report
-median and p95 of three runs.
-
-- **Throughput**: useful payload bytes per wall-clock second, averaged over
-  the whole transfer.
-- **Time to first byte (TTFB)**: from `POST /api/transfers` accepting to the
-  first `ChunkCompleted` ProgressEvent.
-- **CPU time**: server-side and client-side `getrusage` deltas, normalised
-  per GB transferred.
-- **Connection count**: peak concurrent sockets the data plane opened.
-- **Behaviour under loss**: same transfer with `tc qdisc add ... netem
-  loss 1%` applied to the receive interface — does the protocol recover
-  cleanly, and what is the throughput delta.
-
-## Test workloads
+# HTTP vs TCP — protocol design notes
+
+NetCopy ships two interchangeable data-plane protocols. The user picks one
+per transfer. This document explains the trade-offs and points to a manual
+reproduction for benchmark numbers.
+
+## What's different
+
+Both protocols carry the same byte payload (file contents, in chunks, with
+XXH3-128 chunk-level verification). They differ in framing and how the
+hash gets onto the wire:
+
+- **HTTP** — `GET /api/blob/{manifestId}/{fileId}` with a `Range:
+  bytes=START-END` header per chunk. Connection reuse via keep-alive. Server
+  pre-computes the chunk's XXH3-128, sets it as `X-Chunk-Hash` response
+  header, then streams the body via `FileChannel.transferTo` (which on Linux
+  decays to `sendfile(2)`). Pro: trivial to debug with `curl`, plays well with
+  any HTTP-aware proxy. Con: HTTP parsing overhead per chunk, and HTTP/1.1
+  connection-per-concurrent-chunk.
+- **TCP** — one long-lived connection per peer, multiplexed by `reqId`.
+  Custom binary framing (see `tasks/contracts/data-formats.md`). Versioned
+  protocol: v1 is two-pass (hash → DataHead → stream → DataEnd, identical to
+  the HTTP path conceptually); **v2 (default since v0.3.0)** streams and
+  hashes in a single pass, putting the digest in a trailing `DataEndV2`
+  frame. Pro: fewer TCP connections (one per peer), no HTTP overhead, single
+  read pass on the source-side disk. Con: needs its own port (`--tcp-port`),
+  not curl-debuggable.
+
+## Where the difference matters
+
+- **Many small files (≤ 1 MB each).** TCP wins clearly. HTTP pays a full
+  request line + headers per chunk; with thousands of files this dominates.
+- **One big file (multi-GB) on a fast disk.** Mostly identical. Both
+  protocols are CPU-bound on the hash and IO-bound on the disk; framing
+  overhead is in the noise.
+- **One big file on a cold-cache HDD.** TCP v2 is meaningfully faster
+  because it does one disk read per chunk on the source instead of two.
+  v1's two-pass design was tractable on SSDs (the second pass came from the
+  page cache) but on HDD the source ended up reading the file twice with
+  cold seeks. v0.3.0 fixed that.
+- **Lossy network.** Both rely on the kernel's TCP retransmit; the
+  application layers don't differ. NetCopy retries failed chunks with
+  exponential backoff identically.
+
+In practice the user-visible bottleneck on a LAN is almost always **the
+slower of the two disks** (source HDD seek + receiver fsync), not the
+protocol. We've measured ~50–60 MB/s sustained from a single HDD source
+with 8 parallel chunks regardless of which protocol we pick.
+
+## Reproducing a comparison by hand
+
+1. Start two daemons with identical flags except `--port`, `--tcp-port`, and
+   roots. Pin the JVM with `-XX:ActiveProcessorCount=N` if you want to
+   compare across CPU budgets.
+2. Pre-generate the workload under one daemon's `--shared-root`.
+3. From the UI on the other daemon, plan a transfer, then start it twice in
+   a row — once with `protocol: "http"`, once with `"tcp"`. Record the
+   `TransferCompleted` event's `totalDurationMs` and `avgThroughputBps`,
+   and screenshot the Performance modal's "This transfer (chunks)" tile for
+   per-chunk timings.
+4. For loss runs:
+
+   ```bash
+   sudo tc qdisc add dev <iface> root netem loss 1%
+   # ... run the transfer ...
+   sudo tc qdisc del dev <iface> root
+   ```
+
+5. Repeat with the TCP server disabled (`--tcp-port 0`) on the source side
+   to confirm the HTTP fallback works.
+
+We deliberately don't ship a canned benchmark table here: numbers from a
+single hardware setup mislead readers comparing to their own. The
+Performance modal already exposes the per-chunk timings (source latency,
+wire time, persist time, pool acquire wait) you need to identify your own
+bottleneck.
+
+## Suggested workloads
 
 | ID | Description |
 |---|---|
-| W1 | One 32 GB file (large-chunk path) |
-| W2 | 1000 small files of ~64 KB each (small-chunk path, file-parallelism dominates) |
+| W1 | One 32 GB file (large-chunk path; tests sustained throughput) |
+| W2 | 1000 small files of ~64 KB each (request count dominates) |
 | W3 | Mixed: 4 GB ISO + 50 MB of small docs (typical real-world mix) |
-| W4 | W1 again, but with `--file-parallelism=1 --chunks-per-file=1` (single-stream baseline) |
-
-Each workload runs once over HTTP (`--tcp-port 0` on the server side) and
-once over TCP (`protocol: "tcp"` in the transfer request).
-
-## Results — placeholder
+| W4 | W1 again with `--file-parallelism=1 --chunks-per-file=1` (single-stream baseline) |
 
-Filled in by V5.
-
-| Workload | Protocol | Throughput (MB/s) | TTFB (ms) | Server CPU (s/GB) | Peak conns | Loss 1% throughput |
-|---|---|---|---|---|---|---|
-| W1 | HTTP | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W1 | TCP  | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W2 | HTTP | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W2 | TCP  | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W3 | HTTP | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W3 | TCP  | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W4 | HTTP | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-| W4 | TCP  | _TBD_ | _TBD_ | _TBD_ | _TBD_ | _TBD_ |
-
-## Provisional reasoning
-
-Until V5 produces real numbers, the design intuition is:
-
-- **W1 (one big file)**: the two protocols should be within a few percent.
-  Both are dominated by `FileChannel.transferTo` on the server and direct
-  `pwrite` on the client; the framing overhead is amortised across multi-MB
-  chunks.
-- **W2 (many small files)**: TCP should win materially. HTTP pays a full
-  request/response round-trip per chunk, plus header parsing; TCP reuses
-  one connection and sends only an 8-byte `REQUEST` frame per chunk.
-- **W3 (mixed)**: closer to W1 by byte count; closer to W2 by request count.
-  Expect TCP to be modestly ahead.
-- **W4 (single stream)**: both protocols saturate one TCP flow; the
-  bottleneck is the kernel and the NIC, not the framing.
-
-## Reproducing the benchmark
-
-V5 will publish a script under `verify/V5/` that drives both daemons in
-the same JVM (or two JVMs on the same host) using a tmpfs receive root
-to factor out disk speed. Until then, reproduce by hand:
-
-1. Start two daemons with identical flags except `--port`, `--tcp-port`,
-   and roots. Pin the JVM with `-XX:ActiveProcessorCount=N` if you want
-   to compare across CPU budgets.
-2. Pre-generate the workload under one daemon's `--shared-root`.
-3. From the UI on the other daemon, plan a transfer, then start it twice
-   in a row — once with protocol HTTP, once with TCP. Record the
-   `TransferCompleted` event's `totalDurationMs` and `avgThroughputBps`.
-4. For loss runs: `sudo tc qdisc add dev <iface> root netem loss 1%`.
-   Don't forget to `tc qdisc del` afterwards.
+W2 is the workload where TCP shows its largest advantage; W4 is where the
+two protocols converge.