Skip to content

perf: cut create-export latency by ~50% — three independent fixes#57

Merged
jaredLunde merged 16 commits into
mainfrom
jared/long-pole
May 25, 2026
Merged

perf: cut create-export latency by ~50% — three independent fixes#57
jaredLunde merged 16 commits into
mainfrom
jared/long-pole

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

@jaredLunde jaredLunde commented May 24, 2026

Summary

End-to-end VM create (NATS claim → Running) dropped from ~1063ms → ~535ms on identical traces (same image, same agent-hash, same host). The "GlideFS CoW" span that was the visible long pole in Beyond's trace UI went from 415ms → 171ms — almost all of which is now the unavoidable S3 PUT for save_export.

metric before after delta
GlideFS CoW span 415 ms 171 ms −244 ms (−59%)
boot_duration_ms 584 ms 265 ms −319 ms (−55%)
NATS claim → VM Running 1063 ms 535 ms −528 ms (−50%)
register_device_ms 252 ms 1 ms −251 ms
START_DEV kernel ioctl 250 ms 0.5 ms −249 ms
Sync sysfs queue tuning 99 ms 0 ms −99 ms
manifest_fetch_ms (warm) 123 ms 0 ms −123 ms

The three fixes

1. Snapshot manifest cache (router.rs)

snapshots/{name}/{seq:020} is keyed by a monotonic, append-only sequence — write-once by construction, byte-identical for any given (s3_prefix, name, seq) triple. A misleading comment in the fork path claimed "snapshots are mutable" and the code refused to cache them, so every fork-from-snapshot paid a fresh ~123ms S3 GET. box-manager's ensure_derived_snapshot forks every VM from the same staging snapshot, so every single VM was re-fetching the identical bytes.

Adds a bounded HashMap<(s3_prefix, manifest_name, sequence), Arc<VolumeManifest>> next to base_manifest_cache. Pre-populates from snapshot_export so the daemon that wrote a snapshot can serve forks of it for free.

2. Background the sysfs queue-tuning writes (ublk/device.rs)

wbt_lat_usec=0 and scheduler=none were written synchronously inside register_inner, costing ~50ms each — the block layer reconfigure is surprisingly heavy on this kernel. They're tuning hints; the device is fully functional without them. spawn_blocking-ing them off the response path saves ~100ms per device-create.

3. Tick the executor before io_uring_enter (ublk/worker_pool.rs) — the big one

The biggest win and the most surprising bug. One block of code moved up bought 250ms per device-create.

The kernel's ublk_ctrl_start_dev blocks on wait_for_completion_interruptible(&ub->completion) until every queue's nr_io_ready reaches queue_depth — i.e. until every io_task has submitted its initial UBLK_IO_FETCH_REQ uring_cmd.

The worker loop order was:

  1. drain inbox (handle_add_queue spawns 64 io_task futures per queue)
  2. submit_with_args(to_wait=1, ...) ← blocks for up to WORKER_IDLE_NSEC = 250ms
  3. drain CQEs
  4. executor.tick()

io_tasks submit their FETCH_REQ SQEs on first poll — but the first poll only ran after the io_uring_enter wait. So the worker slept the full 250ms timeout waiting for CQEs that physically couldn't arrive (no SQE submitted yet), while START_DEV sat blocked on the matching completion the kernel was waiting for.

Moving the executor tick to before the submit flushes the FETCH_REQ SQEs into the ring first, the submit pushes them to the kernel immediately, ublk_mark_io_ready fires, complete_all(&ub->completion) runs, and START_DEV returns essentially instantly. The 250ms kernel ioctl is now 0.5-1ms.

How we found it

Structured timing logs at target="glidefs.timing" on each step of create_export, register_inner, the tokio::join! legs in api.rs, and inside ublk-core's start_dev (prep/wait_buf_reg/start_ioctl breakdown). These were essential to localize the 250ms — initial guesses (S3 latency, partition scan, udev blkid) were disproven by an empirical warmup experiment before we found the actual cause in the worker loop ordering.

The instrumentation stays in for ongoing observability.

Incidental cleanup

ublk-core doctests referenced libublk::* (old vendored crate name) and UblkQueue<'_> (lifetime removed in a prior refactor). 10/10 doctests now pass.

Test plan

  • cargo test -p glidefs --features ublk — 893 / 893 pass
  • cargo test -p ublk-core — 10 / 10 pass (including doctests)
  • cargo clippy -p glidefs --features ublk --all-targets — no errors
  • Live deploy via systemctl reload glidefs (zero-downtime handoff) and verified create-VM E2E timing against the homelab. Numbers in the table above are from real boxman vm logs traces.
  • Watch first few production VM creates for any regression in steady-state I/O (the worker loop change has a per-iteration extra executor.tick() call which is O(woken-tasks) — cheap when no work, but worth eyeballing once.)

🤖 Generated with Claude Code

jaredLunde and others added 16 commits May 24, 2026 12:52
End-to-end VM create (NATS claim → Running) dropped from ~1063ms to
~535ms on identical traces (same image, same agent-hash). The
"GlideFS CoW" span that was the visible long pole in trace UI went
from 415ms to 171ms — almost all of which is now the unavoidable S3
PUT for save_export.

Three independent fixes, each landed against measured numbers from
structured timing logs added alongside.

1. Snapshot manifest cache (router.rs)

   `snapshots/{name}/{seq:020}` is keyed by a monotonic, append-only
   sequence — write-once by construction, byte-identical for any given
   (s3_prefix, name, seq) triple. A misleading comment claimed
   "snapshots are mutable" and the code refused to cache them, so every
   fork-from-snapshot paid a fresh ~123ms S3 GET. box-manager's
   ensure_derived_snapshot flow forks every VM from the same staging
   snapshot — i.e. every single VM was re-fetching the same bytes.

   Adds a bounded HashMap keyed by (s3_prefix, manifest_name, sequence)
   alongside base_manifest_cache. Pre-populates on snapshot_export so
   the daemon that wrote the snapshot can serve forks of it for free.
   Cache hit on warm path drops manifest_fetch_ms from 123 → 0.

2. Background sysfs queue tuning (ublk/device.rs)

   The wbt_lat_usec and scheduler sysfs writes ran inside register_inner
   and cost ~50ms each on this kernel — the block layer reconfigure is
   surprisingly heavy. They're tuning hints; the device works fine
   without them. spawn_blocking them off the response path saves ~100ms
   per device-create.

3. Tick the executor BEFORE io_uring_enter (ublk/worker_pool.rs)

   The biggest win and the most surprising bug. The kernel's
   `ublk_ctrl_start_dev` blocks on `wait_for_completion_interruptible`
   until every queue's `nr_io_ready` reaches `queue_depth` — i.e. until
   every io_task has submitted its initial UBLK_IO_FETCH_REQ uring_cmd.

   The worker loop order was: drain inbox (handle_add_queue spawns 64
   io_task futures per queue), `submit_with_args(to_wait=1, ...)`,
   drain CQEs, then finally `executor.tick()`. But io_tasks submit
   their FETCH_REQ SQEs on first poll — and the first poll only ran
   AFTER the io_uring_enter wait. So the worker slept the entire
   `WORKER_IDLE_NSEC = 250_000_000` (250ms) timeout waiting for CQEs
   that physically couldn't arrive, while START_DEV sat blocked on
   the matching completion.

   Moving the executor tick to BEFORE the submit flushes the FETCH_REQ
   SQEs into the ring first, the submit pushes them to the kernel,
   ublk_mark_io_ready fires, complete_all(&ub->completion) runs, and
   START_DEV returns essentially instantly. One block-of-code moved
   up bought 250ms per device-create. The 250ms ublk START_DEV
   ioctl is now 0.5-1ms.

Verification

  | metric                      | before  | after   | delta       |
  | GlideFS CoW span            | 415 ms  | 171 ms  | -244 ms     |
  | boot_duration_ms            | 584 ms  | 265 ms  | -319 ms     |
  | NATS claim → VM Running     | 1063 ms | 535 ms  | -528 ms     |
  | register_device_ms          | 252 ms  | 1 ms    | -251 ms     |
  | START_DEV kernel ioctl      | 250 ms  | 0.5 ms  | -249 ms     |
  | sysfs queue-tuning (sync)   | 99 ms   | 0 ms    | -99 ms      |
  | manifest_fetch_ms (warm)    | 123 ms  | 0 ms    | -123 ms     |

  cargo test -p glidefs --features ublk:  893 / 893 pass
  cargo test -p ublk-core:                  10 / 10 pass

Also: structured tracing logs at target="glidefs.timing" on each step
of create_export, register_inner, the tokio::join legs in api.rs, and
inside ublk-core's start_dev (prep/wait_buf_reg/start_ioctl breakdown).
These are what made the bug findable in the first place — keeping them
in for ongoing observability.

Incidental: ublk-core doctests referenced `libublk::*` (old vendored
crate name) and `UblkQueue<'_>` (lifetime removed in a prior refactor).
Fixed those — 10/10 doctests now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added `state.executor.tick()` unconditionally
before every `submit_with_args` so newly spawned `io_task` futures
could flush their initial FETCH_REQ SQEs into the ring before the
worker blocked in `io_uring_enter`. That cut `START_DEV` latency
from 250ms to <1ms in production — verified across many VM creates.

But the docker-tests `test_overwrite_survives_restart_ublk` (and
likely other tests exercising the shutdown → restart cycle) hung
with that change. Without the change: passes in 3s. With it:
indefinite. Root cause is a steady-state interaction I could not
isolate without running the test on the homelab host (which I can't
do safely — that path wedged the kernel earlier in this session).

Scope the tick to ONLY the iteration of the worker loop that just
processed an AddQueue message. In steady-state I/O — and during
shutdown / RemoveQueue drain — behavior is byte-for-byte identical
to the pre-fix code, so whatever invariant the test depends on is
preserved. The AddQueue speedup is unchanged: handle_add_queue
spawns the io_tasks, the new tick polls them, FETCH_REQs land in
the ring, the same iteration's submit_with_args pushes them to the
kernel, and `start_dev`'s `wait_for_completion_interruptible`
returns essentially instantly.

Verification

  - `test_overwrite_survives_restart_ublk`: passes in 2.33s
    (previously hung indefinitely with the broad fix).
  - Production VM create on this binary: `register_device_ms=1`,
    `start_ioctl_us=478` cold / 419 warm, `PUT total_ms=49` warm.
    Same end-to-end speedup as before — ~10x.
  - 65-device recovery on daemon handoff: every `start_dev_us`
    sub-millisecond (60-2000 µs), no 250ms outliers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier `spawn_blocking` of `wbt_lat_usec=0` + `scheduler=none`
saved ~99ms per device-create by moving those sysfs writes off the
critical path. Production VM-create kept working because the
firecracker boot (~150ms of guest kernel boot) gave the async sysfs
writes plenty of time to land before any I/O hit the device.

But four `docker_integration` ublk tests hung in CI:

  - test_unwritten_blocks_return_zeros_ublk
  - test_overwrite_survives_restart_ublk
  - test_cold_wake_from_different_node_ublk
  - test_export_discovery_from_s3_ublk

All four issue I/O to the device almost immediately after add — no
firecracker boot in between. With the sysfs writes backgrounded the
device still had the default `mq-deadline` scheduler when those
reads landed, and mq-deadline's deadline queue appears to hold
single, idle-device requests long enough that the tests don't make
progress within their timeout. The simple
`test_unwritten_blocks_return_zeros_ublk` case — single server,
single read at offset 512KB, no restart cycle — was the clearest
fingerprint.

Restore the synchronous writes. Costs us the 99ms back. The tick
fix in `worker_pool.rs` (250ms START_DEV → 1ms) is unaffected.

Verified locally with the four tests above all passing in 2.0-2.6s
after the revert.

Future direction: apply `scheduler=none` BEFORE `add_disk` rather
than after — either via a `udev` rule keyed on `KERNEL=="ublkb*"`
or via a kernel-side ublk_param. Either path eliminates the
post-add tuning window entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts 1c3bb8d. That revert was based on a wrong attribution.

When the prior commit landed on the PR, CI reported four ublk
tests `running for over 60s` and I assumed the sysfs backgrounding
was at fault — the mq-deadline-vs-sync-scheduler-write theory was
plausible. Reverted it to "be safe."

But running the failing tests locally in --test-threads=4 parallel
mode (matching CI's contention model) under three configurations,
10 runs each:

  PR with ASYNC sysfs:  8/10 pass, 2/10 fail
  PR with SYNC  sysfs:  6/10 pass, 4/10 fail
  MAIN, no PR changes:  6/10 pass, 4/10 fail   ← same as sync!

The flakes are pre-existing on `main` — most likely MinIO under
parallel-test contention (each test spawns its own testcontainer,
4 of them compete for host resources). The CI "hanging" reports
were these intermittent EIO failures surfacing as "still running"
status before the panic; not actual hangs.

So the sync sysfs version isn't fixing anything. Restoring the
async path reclaims ~99ms per device-create with no observable
downside vs the sync path.

Net PR result: snapshot cache + conservative tick + sysfs bg →
~470ms savings per warm-cache create, same flake rate as main.
Flake fix tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `docker_integration` job has been intermittently failing with
EIO on reads and empty discover_exports asserts. Hit rate suggested
~40% on this PR; verified the same rate on `main` (10x runs of the
4 most-flaky ublk tests with --test-threads=4: 6 pass / 4 fail on
main; 5/5 pass with --test-threads=1).

Root cause is contention from running multiple testcontainers in
parallel — each test calls `TestContext::new()` which spawns its
own MinIO container. On a 4-vCPU CI runner with the default cargo
test parallelism (= num_cpus = 4), four MinIO containers compete
for host resources, and MinIO returns transient errors that bubble
out as either `Input/output error (os error 5)` on data reads
(handler.read_into → cache.read → content_store S3 GET failure)
or as empty list results (`should discover at least one export
from S3`).

Cleanest near-term fix: run docker_integration tests one at a
time. Adds ~5min to the docker_integration job (137 tests at
~3-7s each instead of /4 parallelism) but removes the flake. A
more elegant follow-up would be to share a single MinIO container
across tests via per-test bucket prefixes, but that's structural
test-harness work that doesn't need to ride this PR.

The integrity-suite job (filter=integrity_suite) already has
--ignored --nocapture and runs few tests, so it's not affected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each TestContext used to spawn its own MinIO testcontainer. Under
parallel test execution (cargo default = num_cpus) this produced
~40% flake rate on the homelab + CI runners: transient S3 errors
that surfaced as EIO on reads, empty discover_exports listings,
and "ublk read failed". Verified pre-existing on `main` — not a PR
regression — but worth fixing properly since the prior workaround
of "--test-threads=1 in CI" papered over the contention rather
than removing it.

The previous setup tested glidefs against four MinIOs competing
for host CPU/IO, not against a single S3 endpoint. That isn't
representative of production (one S3 backend per glidefs daemon,
even when serving many concurrent VMs).

This commit:

- Spins up ONE MinIO process-wide via a `tokio::sync::OnceCell`,
  reused for every `TestContext`. Container lives for the duration
  of the test process; teardown happens automatically at exit.
- Each `TestContext::new()` allocates a unique bucket
  (`test-bucket-NNNNNN`) from a monotonic counter, giving each
  test a fully isolated S3 namespace.
- Adds a `/minio/health/ready` probe loop on container startup —
  `start()` returns before the HTTP listener actually answers on
  heavily loaded hosts, which produced spurious "connection
  refused" failures during bucket creation.

Verification

  Before:  PASS=6 / FAIL=4 / 10 runs (--test-threads=4, four MinIOs)
  After:   PASS=10 / FAIL=0 / 10 runs (--test-threads=4, one MinIO)

Each run is also ~25% faster (no per-test container startup): 2.6-3.5s
vs 3.5-4.5s. Re-enables parallel CI execution by reverting the
`--test-threads=1` workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shared-MinIO refactor (dd26dbf) eliminated the per-test
MinIO contention we'd been seeing, and locally the full ublk
suite (30 tests) runs reliably in --test-threads=4 with the
shared MinIO. But CI hung again on test_unwritten_blocks_return_zeros_ublk
on the most recent push.

Bisecting across the PR commits, all of them pass 10/10 locally
in isolation for that test (main, 1c3bb8d, 23dc94f, dd26dbf). So
the regression isn't tied to any single commit. The hang seems
to depend on something specific about the CI runner (kernel
version, num_cpus=4, ext-of-test concurrency).

Falling back to --test-threads=1 in CI: we don't have a story
for what specifically races, and running storage tests serially
when we can't reproduce the failure is the conservative call.
Locally with --test-threads=1 we measured ~7s per run instead of
~3.5s parallel — adds maybe 5min to docker_integration CI total.

This is *not* a satisfying resolution. Track-down items:

- The hang reproduces on host runs at ~1/10 in isolation when the
  kernel has hundreds of leaked QUIESCED ublk devices (from prior
  SIGTERM'd test runs) but passes 10/10 when device count is low.
  Suggests kernel ublk resource pressure interacts with our daemon
  path, but the specific deadlock is unidentified.
- The shared-MinIO refactor in dd26dbf was a real improvement and
  stays in; the bug we found there was real (per-test MinIO
  contention caused 40% flake at threads=4 on main).
- A real follow-up should investigate the test_unwritten ublk hang
  with kernel tracing in a CI-shaped environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Found the actual cause of the docker_integration ublk test hangs.

Standing up an Ubuntu 24.04 VM with the same kernel CI runs
(linux-image-6.17.0-1013-azure) and capturing the kernel stack of
the hung thread reveals:

  blk_mq_freeze_queue_wait+0x97/0xe0
  blk_mq_freeze_queue_nomemsave+0x22/0x30
  elevator_change+0x79/0x180
  elv_iosched_store+0x18b/0x1e0
  queue_attr_store+0xe4/0x120
  sysfs_kf_write+0x4c/0x60
  ...

This is the `tokio::task::spawn_blocking` task that writes
`/sys/block/ublkbN/queue/scheduler=none`. On 6.17 the kernel's
`elv_iosched_store` calls `blk_mq_freeze_queue` which waits for
in-flight requests to drain — and the kernel counts our armed
FETCH_REQ uring_cmds as in-flight. They never "complete" because
they're long-lived (parked waiting for the next I/O). The freeze
waits forever. The spawn_blocking task hangs, the device is
otherwise functional but our test process eventually times out
waiting on something downstream that depends on it.

(The same code on kernel 6.12 happens to work — either earlier
kernels don't count uring_cmds toward the freeze or the timing
happens to never overlap. Either way, 6.17 made it deterministic.)

Fix: drop the `scheduler=none` write. Keep `wbt_lat_usec=0` (a
simple per-queue store, no freeze, safe on any kernel). The
default `mq-deadline` scheduler costs us some throughput overhead
under heavy load but is functionally fine for ublk. Reclaiming
the perf cleanly requires either a udev rule that fires during
`add_disk`'s KOBJ_ADD uevent (BEFORE FETCH_REQs are armed) or a
kernel-side ublk_param flag — tracked as follow-up.

Verification

  - 6.17 VM (the failing kernel):
      30/30 ublk tests pass in --test-threads=4, 22s
      (before this fix: test_export_discovery_from_s3_ublk hangs
      indefinitely with the kernel stack above)
  - 6.12 homelab (production kernel):
      29/30 ublk tests pass; the one failure
      (test_fs_crash_fsync_honored_ublk) is a pre-existing
      parallel-test flake unrelated to this fix — passes 1/1
      isolated, same flake exists on `main`.

Also retires the 56-day-old memory `project_ublk_617.md`
("START_DEV hangs on Azure 6.17, tests skip until fixed"). The
hang wasn't in START_DEV; it was in our sysfs cleanup running
after `add_disk`. The fix is in our code, not in skipping the
kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reclaim scheduler=none + the other tunables we wanted, properly.

The previous commit dropped the post-add_disk sysfs write because it
deadlocks on kernel 6.17 (`elv_iosched_store` → `blk_mq_freeze_queue
_wait` blocks forever waiting on armed FETCH_REQ uring_cmds to
"complete"). That fixed the hang but left us running with the kernel
default `mq-deadline` scheduler — functional, but a real performance
loss under load.

Real fix: apply the tunables via a udev rule that fires during the
kernel's `add_disk` KOBJ_ADD uevent — BEFORE userspace can open the
device and BEFORE any bios are routed through it. At that moment
there are no in-flight requests and no held queue references, so the
`blk_mq_freeze_queue_wait` inside `elv_iosched_store` completes
immediately. Verified on the Azure 6.17.0-1013 kernel: previously-
hanging tests pass with `scheduler=[none]` active on every ublk
device.

Files:

- `deploy/udev/99-glidefs-ublk.rules` (new): the rule. Sets
  scheduler=none, wbt_lat_usec=0, add_random=0, read_ahead_kb=0 on
  ublkb* device-add. Each tunable is documented inline with WHY it
  applies to ublk specifically (different from spinning rust /
  default-SSD assumptions baked into the kernel's defaults).

- `glidefs/src/cli/server.rs`: `run_server` now calls
  `install_ublk_udev_rule()` at startup. The rule body is
  `include_str!`-embedded from the file above, so the binary is the
  source of truth — operators can't accidentally ship a stale rule,
  and there's no out-of-band file to keep in sync. Idempotent:
  reads the existing file and skips the write+udevadm-reload if
  content already matches. Non-fatal on failure (read-only fs, no
  udevadm on PATH, etc.): daemon comes up with a warning and the
  devices fall back to kernel defaults.

- `glidefs/src/block/ublk/device.rs`: removed the `tokio::task::
  spawn_blocking` that was writing wbt_lat_usec post-add_disk.
  Redundant now that udev sets it at add-time, plus the spawn was
  a detached task that could leak its thread if the write ever
  blocked (as we proved it could on 6.17).

No changes needed in beyond/ansible — the binary handles installation
itself.

Verification

  Manually applied the rule on the 6.17 VM and ran the previously-
  hanging test set:

    30/30 ublk tests pass at --test-threads=4 in 22.58s
    `scheduler=[none]` active on every ublk device

  On 6.12 (homelab): no behavior change — the rule overrides what
  the old in-code write was already doing, just via a different
  mechanism. 29/30 tests pass at --test-threads=4; the one parallel
  flake (test_fs_crash_fsync_honored_ublk) is pre-existing and
  unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, a 503 from PUT /api/exports leaves the export in
`self.exports` but absent from S3. The next retry hits
`create_export`'s idempotency check, returns 200 immediately, but
`export.json` is still missing — so the export silently vanishes on
the next daemon restart.

`cleanup_failed_create` drops the in-memory entry, removes any
kernel device, tears down flush/prefetch tasks, and clears local
cache files so a retry re-runs `create_export` from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PutFailingStore wraps InMemory and fails put_opts on demand; the
test arms it, fires PUT /api/exports/vol1 through the real handler,
and asserts the full Stage 2b contract:

  1. response is 503
  2. GET /api/exports/vol1 returns 404 (in-memory state torn down)
  3. retry after un-arming returns 201 (not 200) — proves the path
     re-ran create_export rather than hitting the idempotency check

Without the fix the retry would 200 from the idempotency branch and
S3 would still have no export.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`run_server_as_successor` skipped the install_ublk_udev_rule() that
`run_server` calls on cold start. Result: on rolling deploys (handoff
predecessor → successor) the rule never lands on the host, and any
new ublk device created by the successor came up with default
tunables (mq-deadline, wbt_lat_usec=2000us, kernel readahead) — a
silent regression on every handoff.

The install function is idempotent (compares content, skips if
matches) and non-fatal on failure, so calling it from both paths is
safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces base_manifest_cache + snapshot_cache (two count-bounded
HashMaps with refusal-on-full) with a single foyer::Cache keyed by
encoded prefix ("b:..."/"s:...") and weighted by VolumeManifest's
estimated heap bytes.

Two problems the previous design had:

1. Count-bounded, not memory-bounded. A 128GB-volume manifest is
   ~70KB; a 10TB-volume manifest is ~5.5MB. The same 64-entry cap
   sized the cache at 4.5MB or 350MB depending on the working-set
   geometry — invisible to the operator either way.

2. Refusal-on-full evicts nothing. The first 64 distinct manifests
   pinned the cache and every miss after that re-fetched from S3
   forever. Fine for tiny base fleets, broken once snapshot churn
   or volume diversity entered the mix.

S3-FIFO eviction (same policy as the block cache) handles the
working-set drift. 64MiB default budget, configurable via
RouterConfig.manifest_cache_bytes. Entries are immutable by
construction (base = sealed at bless, snapshot = monotonic-sequence
addressed), so no staleness concern regardless of eviction policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I missed the standalone integration tests when adding the new
RouterConfig field. Build was failing in CI on every job that
compiled the integration test crates (Build and Test, Data
Integrity Suite, Docker Integration Tests, Kernel Devices,
Clippy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace-wide autofix for safely widening casts (`x as u64` where x
is a narrower unsigned → `u64::from(x)`). 38 files, mechanical.
No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 6a93e02 into main May 25, 2026
21 checks passed
@jaredLunde jaredLunde deleted the jared/long-pole branch May 25, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant