Summary
On unified-memory APUs with a large BIOS VRAM carve-out (e.g. AMD Strix Halo / Ryzen AI MAX+ 395 set to a 96 GB iGPU split), ds4's ROCm backend OOMs at higher context even though the box has plenty of total memory. The model loads fine; session creation then fails:
ds4-bench: context buffers 1164.22 MiB (ctx=16417, backend=rocm, prefill_chunk=8192, ...)
ds4: ROCm tensor alloc failed: out of memory
ds4: ROCm tensor alloc failed: out of memory
... (repeated)
ds4-bench: failed to create session
Repro: Radeon 8060S (gfx1151), 128 GB LPDDR5X, BIOS UMA = 96 GB VRAM / ~32 GB OS. q2-imatrix model (~80.76 GiB). ds4-bench --ctx-start 16384 --ctx-max 16384 → OOM. 2k works; 16k fails.
Root cause
ds4_gpu_tensor_alloc() (rocm/ds4_rocm_runtime.cuh) allocates the general tensor class (prefill scratch, activations) with device-only cudaMalloc. Only the KV-cache class is conditionally routed through cudaMallocManaged (ds4_gpu_should_use_managed_kv_cache). With ~81 GB of weights resident in the 96 GB carve-out, the remaining VRAM headroom (~15 GB) is too small for the prefill scratch at higher context (prefill_chunk=8192), so the device-only allocations OOM — while the full unified pool (incl. GTT beyond the carve-out) sits unused.
Proposed fix
An opt-in env (DS4_CUDA_MANAGED=1) that routes ds4_gpu_tensor_alloc through cudaMallocManaged, so the general allocator can draw from the full UMA pool, complementing the existing auto-managed KV class. Zero-overhead when unset.
Measured (Strix Halo, gfx1151, q2 80.76 GiB, 96 GB split): 16k context goes from OOM → 195 tok/s prefill / 13.5 tok/s gen.
This is platform-general (affects any large-carve-out UMA APU, Linux included), not Windows-specific. Happy to send the PR (one-line opt-in in the allocator).
Environment
- GPU: AMD Radeon 8060S (gfx1151, Strix Halo). ROCm 7.1. Model: DeepSeek-V4-Flash q2-imatrix.
Summary
On unified-memory APUs with a large BIOS VRAM carve-out (e.g. AMD Strix Halo / Ryzen AI MAX+ 395 set to a 96 GB iGPU split), ds4's ROCm backend OOMs at higher context even though the box has plenty of total memory. The model loads fine; session creation then fails:
Repro: Radeon 8060S (gfx1151), 128 GB LPDDR5X, BIOS UMA = 96 GB VRAM / ~32 GB OS. q2-imatrix model (~80.76 GiB).
ds4-bench --ctx-start 16384 --ctx-max 16384→ OOM. 2k works; 16k fails.Root cause
ds4_gpu_tensor_alloc()(rocm/ds4_rocm_runtime.cuh) allocates the general tensor class (prefill scratch, activations) with device-onlycudaMalloc. Only the KV-cache class is conditionally routed throughcudaMallocManaged(ds4_gpu_should_use_managed_kv_cache). With ~81 GB of weights resident in the 96 GB carve-out, the remaining VRAM headroom (~15 GB) is too small for the prefill scratch at higher context (prefill_chunk=8192), so the device-only allocations OOM — while the full unified pool (incl. GTT beyond the carve-out) sits unused.Proposed fix
An opt-in env (
DS4_CUDA_MANAGED=1) that routesds4_gpu_tensor_allocthroughcudaMallocManaged, so the general allocator can draw from the full UMA pool, complementing the existing auto-managed KV class. Zero-overhead when unset.Measured (Strix Halo, gfx1151, q2 80.76 GiB, 96 GB split): 16k context goes from OOM → 195 tok/s prefill / 13.5 tok/s gen.
This is platform-general (affects any large-carve-out UMA APU, Linux included), not Windows-specific. Happy to send the PR (one-line opt-in in the allocator).
Environment