Skip to content

ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only #359

@jamesburton

Description

@jamesburton

Summary

On unified-memory APUs with a large BIOS VRAM carve-out (e.g. AMD Strix Halo / Ryzen AI MAX+ 395 set to a 96 GB iGPU split), ds4's ROCm backend OOMs at higher context even though the box has plenty of total memory. The model loads fine; session creation then fails:

ds4-bench: context buffers 1164.22 MiB (ctx=16417, backend=rocm, prefill_chunk=8192, ...)
ds4: ROCm tensor alloc failed: out of memory
ds4: ROCm tensor alloc failed: out of memory
... (repeated)
ds4-bench: failed to create session

Repro: Radeon 8060S (gfx1151), 128 GB LPDDR5X, BIOS UMA = 96 GB VRAM / ~32 GB OS. q2-imatrix model (~80.76 GiB). ds4-bench --ctx-start 16384 --ctx-max 16384 → OOM. 2k works; 16k fails.

Root cause

ds4_gpu_tensor_alloc() (rocm/ds4_rocm_runtime.cuh) allocates the general tensor class (prefill scratch, activations) with device-only cudaMalloc. Only the KV-cache class is conditionally routed through cudaMallocManaged (ds4_gpu_should_use_managed_kv_cache). With ~81 GB of weights resident in the 96 GB carve-out, the remaining VRAM headroom (~15 GB) is too small for the prefill scratch at higher context (prefill_chunk=8192), so the device-only allocations OOM — while the full unified pool (incl. GTT beyond the carve-out) sits unused.

Proposed fix

An opt-in env (DS4_CUDA_MANAGED=1) that routes ds4_gpu_tensor_alloc through cudaMallocManaged, so the general allocator can draw from the full UMA pool, complementing the existing auto-managed KV class. Zero-overhead when unset.

Measured (Strix Halo, gfx1151, q2 80.76 GiB, 96 GB split): 16k context goes from OOM → 195 tok/s prefill / 13.5 tok/s gen.

This is platform-general (affects any large-carve-out UMA APU, Linux included), not Windows-specific. Happy to send the PR (one-line opt-in in the allocator).

Environment

  • GPU: AMD Radeon 8060S (gfx1151, Strix Halo). ROCm 7.1. Model: DeepSeek-V4-Flash q2-imatrix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions