ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool#55
Merged
Conversation
USER_COPY decouples buffer ownership from tag arming: instead of every
tag holding a stable 128 KB IoBuf for its entire lifetime, io_task
acquires a buffer on demand from a fixed-size per-worker pool when an
I/O CQE arrives and releases it after commit. Total bounce RSS is now
bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device
count, rather than scaling linearly with armed tags.
Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap):
- VmRSS 3.07 GB (~20 GB projected for bounce at same density)
- 750 KB per export
- Throughput +7% in N=1025 fio randrw test vs bounce
- All 6 fio_verify integrity cases pass under USER_COPY+pool
Other changes in this commit, related but smaller:
- IO_BUF_BYTES 512K -> 128K matches WriteCache block size; 4x VAS
cut per tag with no measured throughput cost beyond noise.
- DEFAULT_MAX_EXPORTS 256 -> 20_000 lifts the previous hard cap on
the router that refused exports past 256.
- IoSizeHistogram per-export, lock-free, Prometheus-exported as
glidefs_{read,write}_io_size_bytes — instrument for validating
IO_BUF_BYTES against real traffic post-launch.
- GLIDEFS_BOUNCE_MODE=1 env reverts to the legacy per-tag-stable
path as a kill switch.
New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed,
LIFO free-list, lazy thread-local init per worker, malloc fallback
on pool exhaustion).
See scripts/bench-results/{iobuf-128k-step1, density-stress-1025,
usercopy-pool, usercopy-pool-5k}.txt for measurement traces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small production-readiness improvements for the USER_COPY buffer pool:
- Expose pool stats via /metrics:
glidefs_ublk_buffer_pool_acquires_total
glidefs_ublk_buffer_pool_exhaust_fallbacks_total
glidefs_ublk_buffer_pool_workers_initialized
The fallback counter is the operationally important one — a sustained
non-zero value means the pool is undersized for the workload's
concurrent-S3-await pattern and the structural RSS bound is being
broken via the malloc fallback path. Alert on it.
- Compile-time coupling between buffer_pool::SLOT_SIZE and
device::IO_BUF_BYTES. They MUST match; the build now fails on
mismatch rather than silently truncating I/Os or writing past slot
boundaries.
Also documents the deliberate choice of blocking libc::pread / libc::pwrite
over io_uring for the USER_COPY data transfer: empirically faster (28 %
throughput delta on single-device 4-job randwrite). The cdev op completes
in ~1 µs in the kernel — async SQE/CQE/waker overhead exceeds the syscall
cost. The "blocking syscall in async path" concern was theoretical;
measurement shows synchronous is the right choice for this specific path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes so the USER_COPY buffer pool's RSS bound holds
unconditionally, no monitoring required.
1. Async backpressure on pool exhaustion.
Replaces the silent malloc fallback with a FIFO waiter queue. When
a worker's pool is empty, `acquire()` returns a future that parks
on a waker list; `PoolSlot::Drop` wakes the oldest waiter. Total
bounce RSS is now a TRUE hard cap (K × POOL_SLOTS × SLOT_SIZE =
512 MB on 16 workers) — pool can't be silently broken by a
concurrent S3-await spike; throughput degrades gracefully instead.
Metric rename: `exhaust_fallbacks` → `backpressure_waits`. Same
"your pool is undersized" signal, different mechanism.
2. Bulletproof pre-fault for NUMA-local placement.
Replaces `ptr::write_bytes` with a per-page `write_volatile` loop.
`write_bytes` to MAP_ANONYMOUS memory can be elided by LLVM since
the destination is "already" zero from the kernel's zero-page CoW;
volatile forbids the elision so every page is actually faulted.
Pre-fault now runs from the CPU-pinned worker thread (pinning
happens before lazy pool init), so Linux's MPOL_DEFAULT first-touch
places each page on the worker's local NUMA node. Multi-socket
production hosts get NUMA-correct placement by construction, no
mbind needed.
Empirical:
- 5 pool unit tests pass (try-acquire, exhaustion, LIFO, async-park,
FIFO-wake).
- fio_verify sequential passes (data integrity preserved).
- Throughput on this homelab: single-device 880k IOPS, 8-device
aggregate 1.327M IOPS (no measurable regression from the async
path).
- All 16 worker pools now initialize cleanly post-restart; pre-fault
commits the expected 512 MB.
- Crash + recovery at N=1025: daemon kill -9, systemd auto-restart,
1025 USER_COPY devices recovered in 120 s. 4 randomly-sampled
recovered devices pass fio CRC32C verify (err=0). 0 backpressure
waits during the entire recovery + verify cycle.
See scripts/bench-results/usercopy-pool-backpressure.txt for traces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eliminate the SLOT_SIZE/IO_BUF_BYTES coupling rather than enforce it with a const-assert. `buffer_pool::SLOT_SIZE` now derives directly from `device::IO_BUF_BYTES_USIZE`. Single source of truth; bumping `IO_BUF_BYTES` Just Works without a parallel edit elsewhere. Compile-time identical to the previous const-assert version — same constant value, same emitted bytes — so existing deployments don't need to rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
USER_COPY decouples buffer ownership from tag arming: instead of every tag holding a stable 128 KB IoBuf for its entire lifetime, io_task acquires a buffer on demand from a fixed-size per-worker pool when an I/O CQE arrives and releases it after commit. Total bounce RSS is now bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device count, rather than scaling linearly with armed tags.
Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap):
Other changes in this commit, related but smaller:
New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed, LIFO free-list, lazy thread-local init per worker, malloc fallback on pool exhaustion).
See scripts/bench-results/{iobuf-128k-step1, density-stress-1025, usercopy-pool, usercopy-pool-5k}.txt for measurement traces.