Skip to content

rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361

Open
jamesburton wants to merge 1 commit into
antirez:mainfrom
jamesburton:pr-rocm-managed
Open

rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361
jamesburton wants to merge 1 commit into
antirez:mainfrom
jamesburton:pr-rocm-managed

Conversation

@jamesburton

Copy link
Copy Markdown

Fixes #359.

On unified-memory APUs with a large BIOS VRAM carve-out (Strix Halo 96 GB split), ds4_gpu_tensor_alloc uses device-only cudaMalloc for the general tensor class (prefill scratch/activations); only the KV class is auto-managed. Once ~81 GB of weights are resident the ~15 GB VRAM headroom is too small for the prefill scratch at higher context, so it OOMs (16k ctx fails; model loads fine).

This adds an opt-in DS4_CUDA_MANAGED=1 that routes the general allocator through cudaMallocManaged so it draws from the full UMA pool (incl. GTT beyond the carve-out). Zero-overhead when unset; complements the existing auto-managed KV class.

Measured (Radeon 8060S gfx1151, q2 80.76 GiB, 96 GB split): 16k context OOM → 195 tok/s prefill / 13.5 gen. +11/-1 lines, ROCm-only.

…anaged memory

ds4_gpu_tensor_alloc used device-only cudaMalloc for all non-KV tensors
(prefill scratch, activations). On unified-memory APUs with a large BIOS
VRAM carve-out (e.g. Strix Halo, 96 GB), once ~81 GB of weights are resident
the remaining VRAM headroom is too small for the prefill scratch at higher
context, so cudaMalloc OOMs (observed: 16k ctx fails with repeated
"ROCm tensor alloc failed: out of memory", model loads fine).

DS4_CUDA_MANAGED=1 routes this allocator through cudaMallocManaged so it can
draw from the full UMA pool (incl. GTT beyond the carve-out), complementing
the existing auto-managed KV-cache class. Opt-in, zero-overhead when unset.

Measured on Strix Halo (Radeon 8060S, gfx1151), q2 model 80.76 GiB, BIOS
96 GB VRAM split: 16k context goes from OOM to 195 tok/s prefill / 13.5 gen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in path to allocate general GPU tensors from managed (unified) memory to reduce OOM risk on UMA systems with large VRAM carve-outs.

Changes:

  • Introduce DS4_CUDA_MANAGED env var to switch ds4_gpu_tensor_alloc from cudaMalloc to cudaMallocManaged.
  • Cache the env-var decision to avoid repeated getenv calls and include allocation-mode context in the error label.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rocm/ds4_rocm_runtime.cuh
Comment on lines +1357 to +1360
static int managed_all = -1;
if (managed_all < 0) managed_all = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
cudaError_t rc = managed_all ? cudaMallocManaged(&t->ptr, (size_t)bytes)
: cudaMalloc(&t->ptr, (size_t)bytes);
@jamesburton

Copy link
Copy Markdown
Author

Correction — withdrawing the original justification. On a clean re-test (gfx1151, 96 GB-VRAM split, GPU exclusively free, idle-sleep disabled), managed ON vs OFF are within noise at every practical context (16k: off 186.9 / on 195.2 prefill tok/s; both succeed; 32k both ~185). The "16k OOMs without managed" I first reported was a VRAM-contention artifact (a concurrent GPU consumer), not deterministic — my mistake.

managed OFF does genuinely OOM only at very large KV allocation (--ctx-alloc 1048576 → ~30.8 GiB context buffers > ~15 GiB VRAM headroom). But managed ON doesn't rescue that on this split: with ~32 GB OS-visible RAM, the ~30 GiB managed spill exhausts host memory (process OOM). So on a large-VRAM carve-out this override has no usable window — it just trades a device OOM for a host OOM. Its real benefit is the opposite config (small VRAM + large OS RAM, à la #313), which I can't exercise on this BIOS split.

The change is still gated/opt-in/zero-overhead-when-unset and correct in principle, but I can't substantiate a benefit on the hardware I have, so treat this as unvalidated — happy to close it, or leave it for someone who can test the small-VRAM UMA config. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only

2 participants