Skip to content

ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool#55

Merged
jaredLunde merged 4 commits into
mainfrom
jared/memories
May 13, 2026
Merged

ublk: bound bounce-buffer RSS via USER_COPY + per-worker pool#55
jaredLunde merged 4 commits into
mainfrom
jared/memories

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

@jaredLunde jaredLunde commented May 13, 2026

Screenshot 2026-05-13 at 12 19 27 AM

USER_COPY decouples buffer ownership from tag arming: instead of every tag holding a stable 128 KB IoBuf for its entire lifetime, io_task acquires a buffer on demand from a fixed-size per-worker pool when an I/O CQE arrives and releases it after commit. Total bounce RSS is now bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device count, rather than scaling linearly with armed tags.

Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap):

  • VmRSS 3.07 GB (~20 GB projected for bounce at same density)
  • 750 KB per export
  • Throughput +7% in N=1025 fio randrw test vs bounce
  • All 6 fio_verify integrity cases pass under USER_COPY+pool

Other changes in this commit, related but smaller:

  • IO_BUF_BYTES 512K -> 128K matches WriteCache block size; 4x VAS cut per tag with no measured throughput cost beyond noise.
  • DEFAULT_MAX_EXPORTS 256 -> 20_000 lifts the previous hard cap on the router that refused exports past 256.
  • IoSizeHistogram per-export, lock-free, Prometheus-exported as glidefs_{read,write}_io_size_bytes — instrument for validating IO_BUF_BYTES against real traffic post-launch.
  • GLIDEFS_BOUNCE_MODE=1 env reverts to the legacy per-tag-stable path as a kill switch.

New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed, LIFO free-list, lazy thread-local init per worker, malloc fallback on pool exhaustion).

See scripts/bench-results/{iobuf-128k-step1, density-stress-1025, usercopy-pool, usercopy-pool-5k}.txt for measurement traces.

jaredLunde and others added 4 commits May 13, 2026 00:16
USER_COPY decouples buffer ownership from tag arming: instead of every
tag holding a stable 128 KB IoBuf for its entire lifetime, io_task
acquires a buffer on demand from a fixed-size per-worker pool when an
I/O CQE arrives and releases it after commit. Total bounce RSS is now
bounded by K * POOL_SLOTS * SLOT_SIZE = 512 MB regardless of device
count, rather than scaling linearly with armed tags.

Empirical at N=5025 exports (4092 ublk-backed; ublks_max kernel cap):
  - VmRSS 3.07 GB (~20 GB projected for bounce at same density)
  - 750 KB per export
  - Throughput +7% in N=1025 fio randrw test vs bounce
  - All 6 fio_verify integrity cases pass under USER_COPY+pool

Other changes in this commit, related but smaller:

  - IO_BUF_BYTES 512K -> 128K matches WriteCache block size; 4x VAS
    cut per tag with no measured throughput cost beyond noise.
  - DEFAULT_MAX_EXPORTS 256 -> 20_000 lifts the previous hard cap on
    the router that refused exports past 256.
  - IoSizeHistogram per-export, lock-free, Prometheus-exported as
    glidefs_{read,write}_io_size_bytes — instrument for validating
    IO_BUF_BYTES against real traffic post-launch.
  - GLIDEFS_BOUNCE_MODE=1 env reverts to the legacy per-tag-stable
    path as a kill switch.

New module: glidefs/src/block/ublk/buffer_pool.rs (mmap-backed,
LIFO free-list, lazy thread-local init per worker, malloc fallback
on pool exhaustion).

See scripts/bench-results/{iobuf-128k-step1, density-stress-1025,
usercopy-pool, usercopy-pool-5k}.txt for measurement traces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small production-readiness improvements for the USER_COPY buffer pool:

  - Expose pool stats via /metrics:
      glidefs_ublk_buffer_pool_acquires_total
      glidefs_ublk_buffer_pool_exhaust_fallbacks_total
      glidefs_ublk_buffer_pool_workers_initialized
    The fallback counter is the operationally important one — a sustained
    non-zero value means the pool is undersized for the workload's
    concurrent-S3-await pattern and the structural RSS bound is being
    broken via the malloc fallback path. Alert on it.

  - Compile-time coupling between buffer_pool::SLOT_SIZE and
    device::IO_BUF_BYTES. They MUST match; the build now fails on
    mismatch rather than silently truncating I/Os or writing past slot
    boundaries.

Also documents the deliberate choice of blocking libc::pread / libc::pwrite
over io_uring for the USER_COPY data transfer: empirically faster (28 %
throughput delta on single-device 4-job randwrite). The cdev op completes
in ~1 µs in the kernel — async SQE/CQE/waker overhead exceeds the syscall
cost. The "blocking syscall in async path" concern was theoretical;
measurement shows synchronous is the right choice for this specific path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes so the USER_COPY buffer pool's RSS bound holds
unconditionally, no monitoring required.

  1. Async backpressure on pool exhaustion.
     Replaces the silent malloc fallback with a FIFO waiter queue. When
     a worker's pool is empty, `acquire()` returns a future that parks
     on a waker list; `PoolSlot::Drop` wakes the oldest waiter. Total
     bounce RSS is now a TRUE hard cap (K × POOL_SLOTS × SLOT_SIZE =
     512 MB on 16 workers) — pool can't be silently broken by a
     concurrent S3-await spike; throughput degrades gracefully instead.

     Metric rename: `exhaust_fallbacks` → `backpressure_waits`. Same
     "your pool is undersized" signal, different mechanism.

  2. Bulletproof pre-fault for NUMA-local placement.
     Replaces `ptr::write_bytes` with a per-page `write_volatile` loop.
     `write_bytes` to MAP_ANONYMOUS memory can be elided by LLVM since
     the destination is "already" zero from the kernel's zero-page CoW;
     volatile forbids the elision so every page is actually faulted.

     Pre-fault now runs from the CPU-pinned worker thread (pinning
     happens before lazy pool init), so Linux's MPOL_DEFAULT first-touch
     places each page on the worker's local NUMA node. Multi-socket
     production hosts get NUMA-correct placement by construction, no
     mbind needed.

Empirical:
  - 5 pool unit tests pass (try-acquire, exhaustion, LIFO, async-park,
    FIFO-wake).
  - fio_verify sequential passes (data integrity preserved).
  - Throughput on this homelab: single-device 880k IOPS, 8-device
    aggregate 1.327M IOPS (no measurable regression from the async
    path).
  - All 16 worker pools now initialize cleanly post-restart; pre-fault
    commits the expected 512 MB.
  - Crash + recovery at N=1025: daemon kill -9, systemd auto-restart,
    1025 USER_COPY devices recovered in 120 s. 4 randomly-sampled
    recovered devices pass fio CRC32C verify (err=0). 0 backpressure
    waits during the entire recovery + verify cycle.

See scripts/bench-results/usercopy-pool-backpressure.txt for traces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eliminate the SLOT_SIZE/IO_BUF_BYTES coupling rather than enforce it
with a const-assert. `buffer_pool::SLOT_SIZE` now derives directly from
`device::IO_BUF_BYTES_USIZE`. Single source of truth; bumping
`IO_BUF_BYTES` Just Works without a parallel edit elsewhere.

Compile-time identical to the previous const-assert version — same
constant value, same emitted bytes — so existing deployments don't need
to rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 3b0944b into main May 13, 2026
@jaredLunde jaredLunde deleted the jared/memories branch May 13, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant