ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool by jaredLunde · Pull Request #55 · beyondoss/glidefs

jaredLunde · 2026-05-13T07:18:20Z

USER_COPY decouples buffer ownership from tag arming: instead of every tag holding a stable 128 KB IoBuf for its entire lifetime, io_task acquires a buffer on demand from a fixed-size per-worker pool when an I/O CQE arrives and releases it after commit. Total bounce RSS is now bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device count, rather than scaling linearly with armed tags.

Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap):

VmRSS 3.07 GB (~20 GB projected for bounce at same density)
750 KB per export
Throughput +7% in N=1025 fio randrw test vs bounce
All 6 fio_verify integrity cases pass under USER_COPY+pool

Other changes in this commit, related but smaller:

IO_BUF_BYTES 512K -> 128K matches WriteCache block size; 4x VAS cut per tag with no measured throughput cost beyond noise.
DEFAULT_MAX_EXPORTS 256 -> 20_000 lifts the previous hard cap on the router that refused exports past 256.
IoSizeHistogram per-export, lock-free, Prometheus-exported as glidefs_{read,write}_io_size_bytes — instrument for validating IO_BUF_BYTES against real traffic post-launch.
GLIDEFS_BOUNCE_MODE=1 env reverts to the legacy per-tag-stable path as a kill switch.

New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed, LIFO free-list, lazy thread-local init per worker, malloc fallback on pool exhaustion).

See scripts/bench-results/{iobuf-128k-step1, density-stress-1025, usercopy-pool, usercopy-pool-5k}.txt for measurement traces.

USER_COPY decouples buffer ownership from tag arming: instead of every tag holding a stable 128 KB IoBuf for its entire lifetime, io_task acquires a buffer on demand from a fixed-size per-worker pool when an I/O CQE arrives and releases it after commit. Total bounce RSS is now bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device count, rather than scaling linearly with armed tags. Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap): - VmRSS 3.07 GB (~20 GB projected for bounce at same density) - 750 KB per export - Throughput +7% in N=1025 fio randrw test vs bounce - All 6 fio_verify integrity cases pass under USER_COPY+pool Other changes in this commit, related but smaller: - IO_BUF_BYTES 512K -> 128K matches WriteCache block size; 4x VAS cut per tag with no measured throughput cost beyond noise. - DEFAULT_MAX_EXPORTS 256 -> 20_000 lifts the previous hard cap on the router that refused exports past 256. - IoSizeHistogram per-export, lock-free, Prometheus-exported as glidefs_{read,write}_io_size_bytes — instrument for validating IO_BUF_BYTES against real traffic post-launch. - GLIDEFS_BOUNCE_MODE=1 env reverts to the legacy per-tag-stable path as a kill switch. New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed, LIFO free-list, lazy thread-local init per worker, malloc fallback on pool exhaustion). See scripts/bench-results/{iobuf-128k-step1, density-stress-1025, usercopy-pool, usercopy-pool-5k}.txt for measurement traces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three small production-readiness improvements for the USER_COPY buffer pool: - Expose pool stats via /metrics: glidefs_ublk_buffer_pool_acquires_total glidefs_ublk_buffer_pool_exhaust_fallbacks_total glidefs_ublk_buffer_pool_workers_initialized The fallback counter is the operationally important one — a sustained non-zero value means the pool is undersized for the workload's concurrent-S3-await pattern and the structural RSS bound is being broken via the malloc fallback path. Alert on it. - Compile-time coupling between buffer_pool::SLOT_SIZE and device::IO_BUF_BYTES. They MUST match; the build now fails on mismatch rather than silently truncating I/Os or writing past slot boundaries. Also documents the deliberate choice of blocking libc::pread / libc::pwrite over io_uring for the USER_COPY data transfer: empirically faster (28 % throughput delta on single-device 4-job randwrite). The cdev op completes in ~1 µs in the kernel — async SQE/CQE/waker overhead exceeds the syscall cost. The "blocking syscall in async path" concern was theoretical; measurement shows synchronous is the right choice for this specific path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two structural fixes so the USER_COPY buffer pool's RSS bound holds unconditionally, no monitoring required. 1. Async backpressure on pool exhaustion. Replaces the silent malloc fallback with a FIFO waiter queue. When a worker's pool is empty, `acquire()` returns a future that parks on a waker list; `PoolSlot::Drop` wakes the oldest waiter. Total bounce RSS is now a TRUE hard cap (K × POOL_SLOTS × SLOT_SIZE = 512 MB on 16 workers) — pool can't be silently broken by a concurrent S3-await spike; throughput degrades gracefully instead. Metric rename: `exhaust_fallbacks` → `backpressure_waits`. Same "your pool is undersized" signal, different mechanism. 2. Bulletproof pre-fault for NUMA-local placement. Replaces `ptr::write_bytes` with a per-page `write_volatile` loop. `write_bytes` to MAP_ANONYMOUS memory can be elided by LLVM since the destination is "already" zero from the kernel's zero-page CoW; volatile forbids the elision so every page is actually faulted. Pre-fault now runs from the CPU-pinned worker thread (pinning happens before lazy pool init), so Linux's MPOL_DEFAULT first-touch places each page on the worker's local NUMA node. Multi-socket production hosts get NUMA-correct placement by construction, no mbind needed. Empirical: - 5 pool unit tests pass (try-acquire, exhaustion, LIFO, async-park, FIFO-wake). - fio_verify sequential passes (data integrity preserved). - Throughput on this homelab: single-device 880k IOPS, 8-device aggregate 1.327M IOPS (no measurable regression from the async path). - All 16 worker pools now initialize cleanly post-restart; pre-fault commits the expected 512 MB. - Crash + recovery at N=1025: daemon kill -9, systemd auto-restart, 1025 USER_COPY devices recovered in 120 s. 4 randomly-sampled recovered devices pass fio CRC32C verify (err=0). 0 backpressure waits during the entire recovery + verify cycle. See scripts/bench-results/usercopy-pool-backpressure.txt for traces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eliminate the SLOT_SIZE/IO_BUF_BYTES coupling rather than enforce it with a const-assert. `buffer_pool::SLOT_SIZE` now derives directly from `device::IO_BUF_BYTES_USIZE`. Single source of truth; bumping `IO_BUF_BYTES` Just Works without a parallel edit elsewhere. Compile-time identical to the previous const-assert version — same constant value, same emitted bytes — so existing deployments don't need to rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jaredLunde and others added 4 commits May 13, 2026 00:16

jaredLunde merged commit 3b0944b into main May 13, 2026

jaredLunde deleted the jared/memories branch May 13, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool#55

ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool#55
jaredLunde merged 4 commits into
mainfrom
jared/memories

jaredLunde commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaredLunde commented May 13, 2026 •

edited

Loading