Skip to content

zero-downtime daemon handoff (cooperative recovery handoff)#56

Merged
jaredLunde merged 21 commits into
mainfrom
jared/errs
May 16, 2026
Merged

zero-downtime daemon handoff (cooperative recovery handoff)#56
jaredLunde merged 21 commits into
mainfrom
jared/errs

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

Summary

Replaces the previous "restart = serve EIO for 0.5–15s" failure mode with a coordinated process-to-process handover. The new daemon does all slow startup work (foyer open, WAL replay,
ExportRouter build, S3 prefetch, manifest load) while the old daemon is still serving I/O — only the kernel ublk recovery ioctls happen in the cutover window.

Triggered by SIGHUP, glidefs handoff, or POST /admin/handoff. SCM_RIGHTS passes the NBD TCP/Unix and HTTP API listener fds to the successor so existing client TCP connections
survive without a single RST.

Why

systemctl restart glidefs used to:

  • 30k+ guest VMs see I/O hang for the full cold-start window
  • Postgres replicas drop sync, etcd loses quorum, kubelets fail health checks, applications hit 5–30s timeouts
  • The current recover_quiesced_devices is a crash backstop, not a restart strategy

This PR makes planned restarts (binary upgrades, config rollouts) invisible to guests.

Protocol — Cooperative Recovery Handoff (CRH)

Two processes coordinate over AF_UNIX SOCK_SEQPACKET at /run/glidefs/handoff.sock:

Predecessor                                Successor
SERVING                                    not started

  ── SIGHUP / glidefs handoff / POST ──►
  set_freeze_in_progress(true)
  fork+exec successor
                                           WARMING:
                                           open foyer
                                           replay WAL
                                           build router
                                           prefetch S3
                                           (freeze=true on every cache)
              ◄── HELLO ─────────────────
              ── HELLO_ACK + SCM_RIGHTS fds ──►
              ◄── READY ─────────────────
  freeze_all (handler.freeze + wal.flush + fence in-flight flush)
              ── CUTOVER ────────────────►
  drop UblkServer → kernel devices QUIESCED
              ── PREDS_DEAD ─────────────►
                                           tail-replay WAL
                                           recover_pending_flush_file
                                           reload manifest from S3
                                           recover_devices_by_id
              ◄── ALIVE ─────────────────
DEAD (P exits 0)                           SERVING (inherited fds)

Correctness machinery

Four independent fixes were needed to make the stress grid pass:

  1. Deferred flush-file recovery (passive mode). The successor's WriteCache::open no longer recovers the flushing file in passive mode — that recovery would race the
    predecessor's still-active S3 upload, unlinking the flushing file out from under the predecessor's open fd. Recovery is deferred to recover_pending_flush_file after PREDS_DEAD.

  2. Post-takeover flush-file recovery. After PREDS_DEAD: if the flushing file still exists, copy SYNCING blocks back into the active data file (predecessor exited mid-flush). If
    the flushing file is gone, demote SYNCING→NOT_PRESENT and let reads fall through to the (just-reloaded) manifest.

  3. Manifest reload from S3. The successor loaded the volume manifest at WARMING-time, before the predecessor's fence. The predecessor may complete a sync_manifest between then
    and PREDS_DEAD, registering new packs. Without a reload the successor reads those blocks as zeros and its own next manifest sync hits ETag PreconditionFailed.

  4. Freeze-time flush fence. freeze_all calls wait_for_inflight_flush (8s bound) on every cache; the flush scheduler holds flush_lock for the entire flush_packs + sync_manifest cycle. Without this fence, the predecessor exits between pack-upload and manifest-sync, leaving packs in S3 that no manifest references — visible as fio verify: bad magic header 0.

Test grid (all passing)

Test Result
handoff_durability_crh_per_pr
handoff_fault_injection_grid_crh (s_crash_after_warming/ready/cutover)
handoff_sequential_50_crh (50 sequential handoffs under continuous fio + verify=crc32c)
handoff_multi_export_5_crh (5 exports, 115k WAL replay)
handoff_during_snapshot_crh
handoff_nbd_transport_crh

Two independent oracles validate every test: fio's own --verify=crc32c --do_verify=1 plus a side-channel block scanner that classifies every block as WrittenByFio / NeverWritten
/ Corrupt. Zero corrupt blocks across the grid.

CI runs the per-PR + fault-injection variants on every push (.github/workflows/rust.yml).

Performance

Measured handoff cost = fence_wait + protocol_overhead:

Workload Total Fence wait Protocol+takeover
50× sequential under continuous fio 11.5s 8.0s 3.5s
1 export, 30s fio 8.95s 8.0s 0.95s
1 export, 30s fio + 10s drain 9.8s 8.0s 1.8s
1 export, 2s fio + 60s drain 8.67s 8.0s 0.67s

The 8s fence is the safety margin that protects correctness — it only fires when there's an in-flight flush_packs+sync_manifest cycle to wait on. With realistic write rates
(where the flush scheduler keeps up), the fence releases in microseconds and the whole handoff is ~670 ms, dominated by ublk recovery ioctls.

The 8s bound itself is a tunable correctness/latency dial — tighter = faster handoff but smaller safety margin for sync_manifest to land.

Run GLIDEFS_HANDOFF_TEST_IDLE_SLEEP_SECS=60 cargo test ... handoff_durability_crh_per_pr to reproduce the protocol-overhead measurement.

SCM_RIGHTS listener fd inheritance

NBD TCP/Unix and HTTP API listener fds travel from predecessor to successor as ancillary data on the HelloAck send (via fdpass::sendmsg_with_fds). The successor's
with_inherited_listener wraps the received OwnedFd as a tokio::net::TcpListener/UnixListener and skips bind entirely. Existing kernel sockets keep their accept queue and any
half-open client handshakes — no RST per client across cutover.

Falls back to rebind-with-retry if the predecessor passed no fds (older predecessor, ublk-only deployment, etc.).

Out of scope (future work)

  • PIOD strategy (kernel 6.16+ UBLK_F_PER_IO_DAEMON): per-tag handoff with sub-millisecond stall floor. The CutoverStrategy trait + capability negotiation are already in place;
    PIOD slots in via strategy::select runtime branch — protocol unchanged.
  • Cross-host handoff: that's the replication story, separate work.

jaredLunde and others added 21 commits May 15, 2026 10:48
Adds a Cooperative Recovery Handoff (CRH) protocol so a running glidefs
daemon can hand its ublk devices to a successor process without
interrupting guest VM I/O. SIGHUP (or `glidefs handoff` CLI) spawns a
successor; the successor does all slow startup work (foyer cache, WAL
replay, S3 prefetch, router build) while the predecessor keeps serving;
on cutover the predecessor drops its UblkServer and the successor
reattaches every QUIESCED device via UBLK_F_USER_RECOVERY.

End-to-end verified against /dev/ublkb26 under sustained 23k IOPS 4k
random-write fio workload with --verify=crc32c: zero errors, verify
passed, p99 42µs, p99.99 3.2ms, max single-I/O latency 190ms (the I/O
caught in the QUIESCED window). VM-visible kernel-stall window is
~260ms for one device.

Architecture is parameterized over a CutoverStrategy trait so PIOD
(per-IO-daemon, kernel 6.16+) slots in as a one-week add-on with no
protocol or state-machine rewrite — only the kernel cutover step
changes.

Notable correctness fix: the successor's `replay_wal_tail` step picks
up WAL entries the predecessor wrote between the successor's
WARMING-time WriteCache::open and the predecessor's freeze. Without
this, fio verify fails immediately under load (writes acked but read
back as zeros).

Phase 1 MVP scope:
- Trigger via SIGHUP or `glidefs handoff` CLI subcommand
- 4-message wire protocol over AF_UNIX SOCK_SEQPACKET (HELLO,
  HELLO_ACK, READY, CUTOVER, PREDS_DEAD, ALIVE, ABORT, StrategyMsg)
- Versioned protocol + capability negotiation
- Predecessor revival fallback on successor crash between PREDS_DEAD
  and ALIVE (ExportRouter kept alive across cutover for this)
- Skip-drain on successful handoff (successor inherits state)
- Bind-retry on successor (Phase 2 will add SCM_RIGHTS fd inheritance)
- Fault-injection hooks behind `test-fault-injection` feature

Out of scope (Phase 2):
- SCM_RIGHTS listener fd passing (NBD/HTTP clients reconnect)
- HTTP API endpoint for handoff
- systemd Type=notify integration
- Metrics

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `<wal>.lock` sibling file containing the owning daemon's PID.
On `Wal::open`:

- If lockfile holds OUR pid: skip flock (resize_export's drop+recreate
  cycle within the same process is benign — the prior instance is
  idle and tearing down).
- If lockfile holds another live pid: return WouldBlock immediately
  (a second daemon trying to mount the same WAL — the actually
  dangerous case).
- If lockfile is empty or holds a stale (dead) pid: acquire flock,
  write our pid, claim ownership.

This is the defense-in-depth that the original flock-on-the-WAL-fd
attempt couldn't provide: per-fd flock semantics conflict within a
process when an Arc<WriteCache> reference outlives teardown_export
(observed in resize_export). The PID check uses /proc/<pid> existence
and is Linux-only, matching the rest of the daemon's host targets.

New tests:
- test_wal_lockfile_blocks_foreign_process: simulates a foreign pid
  in the lockfile (init=1), confirms open returns WouldBlock.
- test_wal_lockfile_reclaims_stale_pid: writes a clearly-dead pid
  (16M+, above pid_max), confirms open succeeds and rewrites pid.
- test_wal_lockfile_allows_same_process_reopen: open twice in one
  process, both succeed and can append.

All 428 lib tests pass, including the previously-broken
test_resize_with_dirty_blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The infallible test now runs green:
- Spawns glidefs with one ublk export, 30s fio random 4k writes with
  --verify=crc32c (single job, iodepth 32 — multi-job verify warns
  about cross-job overwrites and aborts).
- Triggers SIGHUP mid-flight, waits for predecessor exit + successor
  takeover, polls API for readiness via curl, discovers new PID via
  /proc cmdline scan.
- Five assertions pass: zero fio errors, fio verify OK, side-channel
  oracle scan finds 0 corrupt blocks (using fio's 0xacca magic header
  to classify written/never-written/corrupt), p99 < 50ms, p99.9 <
  200ms, no NEW kernel taint bits.

Test orchestration fixes:
- Block device size via BLKGETSIZE64 ioctl (metadata().len() returns
  0 for /dev/ublkb*).
- Kernel taint check is delta-based — pre-existing taint from prior
  workloads on this host is OK, only new bits set during the test
  fail the assertion.
- HTTP API queried via curl subprocess (a hand-rolled hyper client
  hit "connection closed before message completed" intermittently).
- DaemonHandle keeps Option<Child> for the predecessor we spawned;
  after handoff the successor is an adopted process (we didn't spawn
  it) — exit detection switches to /proc/<pid> polling.
- SIGKILL on Drop, not SIGTERM — SIGTERM triggers drain to S3 which
  hangs after the tempdir's already gone.
- fio JSON output may have warnings before the JSON; trim to first
  '{' before deserialization.

WAL flock parent-process whitelist:
- During CRH handoff, the successor (spawned via fork+exec by the
  predecessor) opens the WAL while the predecessor still holds the
  lockfile. The successor's PPID == predecessor's PID, so we skip
  flock acquisition when lockfile holder == getppid() && process is
  alive. Cross-process protection still kicks in for any other
  process. New test: test_wal_lockfile_allows_parent_process_during_handoff.

The test is gated #[ignore] (requires sudo + ublk_drv + fio); CI will
explicitly run it with --ignored. Local invocation:
  sudo -E env PATH="$PATH" cargo test -p glidefs \
    --features ublk,fio-bench,test-fault-injection \
    --test handoff_durability --release \
    -- --ignored --nocapture handoff_durability_crh_per_pr

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fault-injection grid now exercises all three successor-crash
variants end-to-end and proves the predecessor's revival path keeps
serving with zero data loss:

- s_crash_after_warming:  PASSED (post-fio 24,628 IOPS, 0 corrupt blocks)
- s_crash_after_ready:    PASSED (post-fio 21,774 IOPS, 0 corrupt blocks)
- s_crash_after_cutover:  PASSED (post-fio 24,986 IOPS, 0 corrupt blocks)

Added sequential-handoff test (handoff_sequential_50_crh, 50 back-to-back
handoffs in 10min budget) — catches state-accumulation bugs across many
handoffs.

Test orchestration improvements:
- Unique export name per test run (PID + ns timestamp suffix). Previous
  shared name "handoff_test_0" let a stale QUIESCED ublk device from a
  prior test run get recovered with its old size, so fio writes against
  a smaller new-config device tried offsets past the new size and
  failed with EINVAL. Unique names also prevent any chance of an
  accidental collision with production exports.
- Bumped sandbox cache_dir disk_size_gb 8→32 and ssd_cache_size_gb
  2→16. Fault-injection grid runs N×workload cycles back-to-back; the
  smaller cache filled up because the file:// backend's flush_scheduler
  can't drain (manifest sync unimplemented). Larger cache headroom
  keeps post-fio from blocking on capacity_monitor's NoSpace gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related correctness fixes plus the architecture doc:

1. CacheInner.freeze_in_progress: AtomicBool. When set, checkpoint()
   still runs save_block_states() but skips wal.truncate(). Stops the
   per-export checkpoint timer from dropping WAL entries the
   successor's replay_wal_tail needs to absorb.

2. Successor passive mode: WriteCache instances built during WARMING
   start with freeze_in_progress=true. Cleared after handoff::run_successor
   completes takeover. Stops the successor's flush_scheduler from
   truncating the same WAL the predecessor is still appending to (both
   processes have the WAL open via the parent-process whitelist).

3. handoff/ARCHITECTURE.md: full design doc covering the protocol
   diagram, why each piece exists, what's not in scope, observed
   performance numbers, and failure-injection grid results.

Without (1)+(2), the sequential 50-handoff stress test reliably
triggered fio verify failures because the WAL got truncated mid-handoff
and replay_wal_tail picked up empty diffs. Per-PR + fault-injection
grid both still pass with the change.

Tasks 1A.1, 1A.2, 1B.1, 1C.1, 1C.2, 1C.3, 4.1 done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the operational scaffolding for handoff:

- `glidefs handoff --dry-run`: spawns successor that performs WARMING
  (proves it can open foyer, replay WAL, build router) then aborts
  cleanly. No destructive action. Useful as a canary check before
  fleet upgrades.
- Control socket listener (`/run/glidefs/handoff.ctl.sock`): the
  background task spawned in serve_with_router accepts the byte-
  protocol from `glidefs handoff` CLI invocations and triggers the
  same handoff dispatch as SIGHUP.
- `handoff::metrics` module: process-wide AtomicU64 counters for
  outcome (10 result variants) and stall duration histogram (8
  buckets, +Inf, total, sum). `render_prometheus()` produces standard
  text format for future /metrics endpoint.
- `tracing::instrument` spans on `run_predecessor` and `run_successor`
  for slow-trace analysis.
- `deploy/systemd/glidefs.service` with `ExecReload=kill -HUP`,
  `KillMode=mixed` (don't yank successor mid-handoff),
  `RuntimeDirectory=glidefs` (manages /run/glidefs/), and the existing
  health-probe wait.
- handoff/RUNBOOK.md: oncall guide for every failure mode the protocol
  can produce, including how to recover stale lockfiles, orphan ublk
  devices, and the dangerous "predecessor exited but no successor"
  case.
- block/ublk/ARCHITECTURE.md updated with handoff section.
- CI: kernel-devices job runs handoff_durability_crh_per_pr +
  handoff_fault_injection_grid_crh.

Reverts task #28's start_user_recover-retry "optimization" — it
tripled per-handoff latency under load. The QUIESCED-poll-then-recover
sequence is faster in practice (each ioctl is ~1ms; hammering
start_user_recover until kernel transitions burns more time than
polling for state).

Tasks 1A.5, 2.4, 2.5, 3.2 (reverted), 4.2, 4.3, 5.1, 5.2, 5.3 done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POST /admin/handoff (with optional ?dry_run=true) which proxies
to the daemon's local control socket via the same byte protocol the
glidefs handoff CLI uses. Wire constants moved to
crate::handoff::protocol::ctl_wire so the CLI, the HTTP handler, and
the daemon's listener all reference one source of truth.

Sets `freeze_in_progress=true` on every WriteCache at the START of
run_predecessor (not waiting for freeze_all later). Without this,
the predecessor's flush_scheduler can fire a checkpoint between
SIGHUP and the cache.flush() inside freeze_all, truncating WAL
entries the successor's replay_wal_tail needs. Cleared on all exit
paths so aborted/revived handoffs resume normal flushing.

Per-PR + fault-injection-grid still pass. Sequential 50 + multi-
export 5 still failing under stress with verify mismatches —
documented as in_progress for further investigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The init.rs WAL rewrite (read-replay-write-rename) is the recovery
path for torn writes after crash. In handoff successor mode, the
predecessor still has the OLD WAL fd open; rename() creates a new
inode that the predecessor can't reach (renames don't touch open
fds). Result: predecessor's subsequent appends go to the unreachable
old inode, successor's `replay_wal_tail` reads the new (empty-ish)
inode and misses entries — manifests as "verify: bad magic header 0"
in fio.

Skip the rewrite when GLIDEFS_HANDOFF_SUCCESSOR_PASSIVE=1. The
predecessor's own WAL replay path handled the torn-tail case at its
startup; until handoff completes, the predecessor's WAL is canonical
and we share the same inode through the parent-process whitelist in
Wal::open. After takeover, the successor's first checkpoint will
truncate cleanly.

Also: ARCHITECTURE.md update documenting why the worker-pool
pre-warming optimization (task 3.1) was deferred — kernel-side cost
dominates, savings are small.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_progress)

Per-PR + fault-injection grid pass; sequential 50 + multi-export 5
have a remaining race. Tail-replay machinery is correct (5191 entries
per handoff verified) but verify failures persist under continuous
write load. Likely in-flight bio interleaving during QUIESCED rather
than userspace state-machine — investigation continues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent fixes were needed to make the sequential 50-handoff
stress test pass with zero data loss under continuous fio writes:

1. **Deferred flush-file recovery in passive mode**
   `WriteCache::open` was running full flush-file recovery (read
   flushing → pwrite to active file → SYNCING→DIRTY → unlink) every
   time the successor opened a cache that had a flushing file from
   the predecessor's in-progress flush. That recovery raced against
   the predecessor's still-active S3 upload — both processes wrote
   to the same active.bin inode, the successor unlinked the
   flushing file out from under the predecessor's open fd, and
   recent writes got overwritten with stale flushing-file bytes.
   Fix: detect `GLIDEFS_HANDOFF_SUCCESSOR_PASSIVE=1` and skip the
   recovery loop. State stays SYNCING; flushing file stays alone.

2. **Post-takeover recovery method on WriteCache**
   After PREDS_DEAD the successor is the sole owner. Resolve the
   deferred state then: if the flushing file still exists (predecessor
   exited mid-upload), copy SYNCING blocks back into active.bin and
   transition them to DIRTY. If the flushing file is gone (predecessor
   completed sync_manifest before exit), the data is in S3 +
   manifest, so demote SYNCING→NOT_PRESENT and let reads fall through
   to the manifest.

3. **Manifest reload from S3 after takeover**
   The successor loaded the volume manifest at WARMING time, before
   the predecessor's freeze fence. The predecessor may complete a
   `sync_manifest` between then and PREDS_DEAD, registering new packs
   the successor's in-memory manifest never saw — the successor would
   then read those blocks as zeros (visible as fio "verify: bad magic
   header 0"), and its own next manifest sync would hit ETag
   PreconditionFailed. Reload the manifest + ETag from S3 inside
   `recover_handoff_devices`, before serving any reads.

4. **Flush + manifest-sync fence in `freeze_all`**
   `flush_scheduler::flush_and_sync` now holds `inner.flush_lock`
   for the entire flush_packs + sync_manifest cycle. The handoff
   freeze step acquires the same lock with an 8s bound — long
   enough for typical pack batches to complete, short enough to
   bound handoff stall under genuinely stuck uploads. On timeout
   we proceed (the manifest reload + post-takeover recovery cleanup
   above handles partial state).

`flush_scheduler.rs` now serializes the upload+sync window so the
fence is meaningful; previously the lock was held only during the
S3 upload portion of `flush_dirty_inner`, leaving sync_manifest
unprotected.

Test profile dropped to 2 GiB device + 60 s fio runtime to fit
inside tmpfs on the test box without changing what is being proven —
the 50 handoffs and the do_verify+oracle pass are unchanged.

handoff_durability_crh_per_pr  ✓
handoff_fault_injection_grid_crh  ✓
handoff_sequential_50_crh  ✓ (this commit; 573s, 50 clean handoffs,
fio do_verify clean, oracle scan 0 corrupt blocks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to a6e8cc8. Two iterations:

1. Tried to drop the fence and rely on `recover_pending_flush_file`
   + manifest reload alone. Failed verify with 'verify: bad magic
   header 0' at offset 52805632. The recovery machinery is correct
   but doesn't cover every race window — the predecessor can finish
   `flush_dirty_inner` (vm.append_pack done in-memory, blocks
   transitioned SYNCING→NOT_PRESENT) and exit before
   `sync_manifest` ever fires; the in-memory pack registration is
   lost, the S3 manifest stays stale, and `recover_pending_flush_file`
   sees the flushing file already removed (unrelated cleanup) so it
   demotes SYNCING→NOT_PRESENT. The next read goes through the stale
   manifest, finds nothing, returns zero.

2. Restored the fence. `flush_scheduler::flush_and_sync` now holds
   `inner.flush_lock` for the entire `flush_packs + sync_manifest`
   cycle, and `router::freeze_all` calls `wait_for_inflight_flush`
   on every cache with an 8s bound. Under heavy continuous fio the
   accumulated dirty backlog (~120k blocks per handoff via WAL
   replay) makes the flush longer than 8s, so the fence usually
   times out and we proceed; the additional wall-clock from the
   timeout itself is enough for the in-flight `sync_manifest` to
   complete in practice (manifest visible to the successor's
   reload). For production workloads with realistic dirty counts
   the flush completes in tens of ms and the fence releases
   immediately — sub-second handoffs.

handoff_sequential_50_crh ✓ (this run; 578s, 50 clean handoffs,
fio do_verify clean, oracle scan 0 corrupt blocks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire listener fds from predecessor to successor via SCM_RIGHTS in
the HelloAck message, so existing TCP/Unix connections to the NBD
TCP/Unix and HTTP API listeners survive the cutover instead of
seeing a one-RST-per-client reset.

* New `handoff/listener_registry.rs` with two types:
  - `ListenerRegistry`: shared map of `ListenerKind -> RawFd` that
    NBDServer/ApiServer register their listener fd with on bind.
  - `InheritedFds`: typed map of `ListenerKind -> OwnedFd` built on
    the successor side from the SCM_RIGHTS payload, drained as each
    server adopts its listener.

* `NBDServer` and `ApiServer` grow `with_listener_registry` and
  `with_inherited_listener` builders. On `start`:
  - If an inherited fd is set, wrap it as `std::*::Listener::from`
    + `set_nonblocking` + `tokio::*::Listener::from_std` and skip
    `bind` entirely. Existing kernel sockets keep their accept
    queue and any half-open client handshakes.
  - Otherwise bind fresh, then register the listener's RawFd with
    the registry (no-op if no registry attached).

* `ExportRouter` exposes a `pub listener_registry` field so the
  predecessor's handoff code can `snapshot()` it without coupling
  through the cli layer.

* Predecessor's `send_one_with_fds` (built on `fdpass::sendmsg_with_fds`)
  attaches `SCM_RIGHTS` ancillary fds to the HelloAck send. The fds
  are dup'd via `F_DUPFD_CLOEXEC` so the originals stay live for
  the still-running NBD/HTTP server tasks, and the dup'd copies are
  owned by the kernel until the successor `recvmsg`s them.

* Successor's `recv_one_with_fds` (built on `fdpass::recvmsg_with_fds`)
  picks up the cmsg fds along with the HelloAck. The kinds list in
  the message is zipped with the received fds to build an
  `InheritedFds` map. Length mismatch logs a warning and falls back
  to rebind.

* `HelloAck` gets a `listener_kinds: Vec<ListenerKind>` field
  (default-empty for backwards compat with predecessors that ship
  zero fds — e.g. ublk-only deployments).

* `TakeoverResult` carries the `InheritedFds` out of `run_successor`.
  `run_server_as_successor` checks `!inherited_fds.is_empty()` and
  routes through a new `serve_with_router_with_inherited_fds` that
  hands each fd to its server constructor before they `start`.

* Fallback path (no inherited fds) keeps the old port-rebind retry
  loop, so older predecessors and dry-run scenarios still work.

handoff_sequential_50_crh ✓ (this run; 575s, 50 clean handoffs,
"shipping listener fds via SCM_RIGHTS count=1" + "HTTP API server
inheriting listener fd from handoff predecessor" observed on every
handoff, fio do_verify clean, oracle scan 0 corrupt blocks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GLIDEFS_HANDOFF_TEST_IDLE_SLEEP_SECS env var. When set, the
test waits `workload_runtime + N` seconds before firing the first
handoff, so the handoff measures protocol+takeover overhead with
the flush backlog drained — instead of the fence-timeout floor that
dominates when fio is still actively writing.

Used to characterize the actual handoff cost vs. the fence safety
margin:

  GLIDEFS_HANDOFF_TEST_IDLE_SLEEP_SECS=60 cargo test ... \
    handoff_durability_crh_per_pr

  handoff 0: pid X → Y in 8.667s
    of which:
      - 8.000s: fence timeout (in-flight flush from 2s of fio still
        draining on tmpfs+file://)
      - 0.667s: actual protocol + ublk recovery + post-takeover

The 670ms is what production sees when the daemon isn't mid-flush
at SIGHUP. The 8s fence timeout is a configurable safety bound that
only fires when there's an in-flight flush_packs+sync_manifest
cycle to wait on; it's the price of the correctness guarantee
demonstrated in a6e8cc8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits the Restart Behavior section into the new graceful handoff
path (preferred for planned restarts: binary upgrades, config
rollouts) and the existing crash-recovery backstop (unplanned
restarts: OOM, bug, host reboot). Documents the four trigger
mechanisms (`glidefs handoff`, SIGHUP, POST /admin/handoff,
systemctl reload), the dry-run canary, what guests see during
cutover (sub-second protocol, in-flight client TCP connections
survive via SCM_RIGHTS fd inheritance), failure-handling
guarantees, and pointers to the architecture doc + runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `roundtrip_all_messages` test in protocol.rs wasn't updated when
`Hello.dry_run` (from c74db26) and `HelloAck.listener_kinds` (from
1c624a0) were added — `cargo build` compiled because tests are gated,
but `cargo clippy --all-targets` did not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure mechanical move. Nine handoff-specific methods come off
`ExportRouter` and onto a new `HandoffCoordinator` in
`glidefs/src/handoff/coordinator.rs`:

- handoff_snapshot → HandoffCoordinator::snapshot
- freeze_all, unfreeze_all
- set_all_caches_freeze
- take_ublk_server, recover_handoff_devices, revive_after_failed_handoff
- is_per_io_daemon_supported
- get_handler_sync (also kept ExportRouter::get_handler async variant for NBD)

The coordinator wraps `Arc<ExportRouter>` and reaches per-export
state through three new `pub(crate)` accessors on the router:
- `exports_map() -> &DashMap<String, ExportState>`
- `cache_dir_path() -> &Path`
- `ublk_server_mutex() -> &Mutex<UblkServer>` (cfg-gated)

The single `cache.inner.manifest_etag.lock()` reach in
`recover_handoff_devices` is replaced by a new `pub(crate)` method
`WriteCache::set_manifest_etag(Option<String>)`, so the coordinator
never reaches into `pub(super) inner`.

`PredecessorCutoverCtx` and `SuccessorTakeoverCtx` now carry
`Arc<HandoffCoordinator>` instead of `Arc<ExportRouter>`. CRH and
the trait's default `get_handler` impl follow.

`run_predecessor` and `run_successor` take `Arc<HandoffCoordinator>`.
`cli/server.rs` constructs the coordinator next to the router build
(both predecessor SIGHUP path and successor entry point).

`router.rs` shrinks from ~3232 lines of handoff cruft to its actual
job: per-export I/O dispatch.

handoff_sequential_50_crh ✓ (this run; 575s, 50 clean handoffs,
fio do_verify clean, oracle scan zero corrupt blocks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encodes the handoff state machine as a typed enum (Idle/Warming/Freezing/
Cutover) instead of a single AtomicBool. Preserves exact "any non-Idle"
semantics in the two read sites (flush_packs gate, checkpoint gate) via
HandoffPhase::is_active(), so behavior is unchanged.

Sets up future per-phase behavior (#1 atomic flush refactor, future PIOD
work) without forcing it now. Validated by handoff_sequential_50_crh
(50/50 clean, 574s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eckpoint→Cleanup)

Brings production code in line with the proven Stateright model.
Inside flush_dirty_body, between vm.append_pack and the SYNCING→NP
eviction, we now PUT the manifest to S3 (with 3-attempt backoff and
PreconditionFailed short-circuit). After eviction we checkpoint and
delete the flushing file. Failure modes:
- pack upload fails → blocks stay SYNCING, outer recovery re-dirties
- manifest PUT fails → blocks stay SYNCING (eviction not reached),
  outer recovery re-dirties
- checkpoint fails → blocks evicted in-memory but flushing file
  preserved; reads via S3 work, on crash recovery re-flushes
  idempotently (content-addressed pack IDs cross-dedup)

Removes the manifest_pending plumbing from flush_scheduler.rs
(deferred sync retries are no longer needed — the cache retries
internally). Compaction-only manifest changes (no dirty blocks)
still need a separate sync_manifest call, kept inline in the
scheduler's compaction path.

Replaces the 8s wait_for_inflight_flush fence with a 30s flush_lock
acquire in coordinator.rs::freeze_all. The bound is now atomic flush
latency (~1-7s typically) rather than a worst-case timeout floor.

handoff_sequential_50_crh: 50/50 clean in 157.84s (vs 574.54s before
this refactor, a 3.6x speedup driven by the dropped fence floor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…suite

The 217-test tests/integration suite (which my local --tests run missed)
caught four real issues from refactor #1 (c3e560b atomic flush_packs):

1. flush_to_s3 must always push the manifest, not just when packs were
   uploaded. Cold readers depend on the S3 manifest's presence to
   bootstrap; an all-zero-write export that cross-dedups to zero packs
   was leaving S3 empty (prop_zero_block_roundtrip).

2. flush_dirty_inner skipped flushes when a flushing file was on disk
   even after every claimed block had been promoted back to DIRTY
   (via guest write) — a stale rotation the cache couldn't recover from
   on its own. Now detected via `has_any_syncing()`: if `flushing_active`
   is set but no block is SYNCING, clean up the orphan and proceed
   (state_transition_table_completeness).

3,4. test_c1_…_causes_data_loss and test_manifest_failure_in_drain_…
   asserted the OLD non-atomic ordering's failure modes (blocks evicted
   to NP, data lost across manifest failure + crash). Atomic flush
   eliminates those windows: manifest failure returns Err BEFORE
   eviction; outer recovery re-dirties via the flushing file; crash
   recovery preserves data. Updated assertions to verify the new
   (stronger) invariants.

Also: handoff_durability test fixture now bumps RLIMIT_NOFILE to 65536
so foyer's SSD cache (16 GB, many segment fds) doesn't hit EMFILE on
CI runners with the default 1024 soft limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 885fabe into main May 16, 2026
21 checks passed
@jaredLunde jaredLunde deleted the jared/errs branch May 16, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant