rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs by jamesburton · Pull Request #361 · antirez/ds4

jamesburton · 2026-06-08T18:31:55Z

Fixes #359.

On unified-memory APUs with a large BIOS VRAM carve-out (Strix Halo 96 GB split), ds4_gpu_tensor_alloc uses device-only cudaMalloc for the general tensor class (prefill scratch/activations); only the KV class is auto-managed. Once ~81 GB of weights are resident the ~15 GB VRAM headroom is too small for the prefill scratch at higher context, so it OOMs (16k ctx fails; model loads fine).

This adds an opt-in DS4_CUDA_MANAGED=1 that routes the general allocator through cudaMallocManaged so it draws from the full UMA pool (incl. GTT beyond the carve-out). Zero-overhead when unset; complements the existing auto-managed KV class.

Measured (Radeon 8060S gfx1151, q2 80.76 GiB, 96 GB split): 16k context OOM → 195 tok/s prefill / 13.5 gen. +11/-1 lines, ROCm-only.

…anaged memory ds4_gpu_tensor_alloc used device-only cudaMalloc for all non-KV tensors (prefill scratch, activations). On unified-memory APUs with a large BIOS VRAM carve-out (e.g. Strix Halo, 96 GB), once ~81 GB of weights are resident the remaining VRAM headroom is too small for the prefill scratch at higher context, so cudaMalloc OOMs (observed: 16k ctx fails with repeated "ROCm tensor alloc failed: out of memory", model loads fine). DS4_CUDA_MANAGED=1 routes this allocator through cudaMallocManaged so it can draw from the full UMA pool (incl. GTT beyond the carve-out), complementing the existing auto-managed KV-cache class. Opt-in, zero-overhead when unset. Measured on Strix Halo (Radeon 8060S, gfx1151), q2 model 80.76 GiB, BIOS 96 GB VRAM split: 16k context goes from OOM to 195 tok/s prefill / 13.5 gen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in path to allocate general GPU tensors from managed (unified) memory to reduce OOM risk on UMA systems with large VRAM carve-outs.

Changes:

Introduce DS4_CUDA_MANAGED env var to switch ds4_gpu_tensor_alloc from cudaMalloc to cudaMallocManaged.
Cache the env-var decision to avoid repeated getenv calls and include allocation-mode context in the error label.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    static int managed_all = -1;
+    if (managed_all < 0) managed_all = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;
+    cudaError_t rc = managed_all ? cudaMallocManaged(&t->ptr, (size_t)bytes)
+                                 : cudaMalloc(&t->ptr, (size_t)bytes);


jamesburton · 2026-06-09T01:11:35Z

Correction — withdrawing the original justification. On a clean re-test (gfx1151, 96 GB-VRAM split, GPU exclusively free, idle-sleep disabled), managed ON vs OFF are within noise at every practical context (16k: off 186.9 / on 195.2 prefill tok/s; both succeed; 32k both ~185). The "16k OOMs without managed" I first reported was a VRAM-contention artifact (a concurrent GPU consumer), not deterministic — my mistake.

managed OFF does genuinely OOM only at very large KV allocation (--ctx-alloc 1048576 → ~30.8 GiB context buffers > ~15 GiB VRAM headroom). But managed ON doesn't rescue that on this split: with ~32 GB OS-visible RAM, the ~30 GiB managed spill exhausts host memory (process OOM). So on a large-VRAM carve-out this override has no usable window — it just trades a device OOM for a host OOM. Its real benefit is the opposite config (small VRAM + large OS RAM, à la #313), which I can't exercise on this BIOS split.

The change is still gated/opt-in/zero-overhead-when-unset and correct in principle, but I can't substantiate a benefit on the hardware I have, so treat this as unvalidated — happy to close it, or leave it for someone who can test the small-VRAM UMA config. Sorry for the noise.

Copilot AI review requested due to automatic review settings June 8, 2026 18:31

jamesburton mentioned this pull request Jun 8, 2026

win: native Windows (AMD HIP SDK) ROCm + MinGW CPU build for gfx1151 + >2 GiB fix #362

Open

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread rocm/ds4_rocm_runtime.cuh

Comment on lines +1357 to +1360

static int managed_all = -1;

if (managed_all < 0) managed_all = (getenv("DS4_CUDA_MANAGED") != NULL) ? 1 : 0;

cudaError_t rc = managed_all ? cudaMallocManaged(&t->ptr, (size_t)bytes)

: cudaMalloc(&t->ptr, (size_t)bytes);

jamesburton mentioned this pull request Jun 8, 2026

AMD GPU (ROCm/HIP) and especially Strix Halo discussion group #16

Open

jamesburton mentioned this pull request Jun 9, 2026

ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only #359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361

rocm: DS4_CUDA_MANAGED opt-in to avoid high-context OOM on large-carveout UMA APUs#361
jamesburton wants to merge 1 commit into
antirez:mainfrom
jamesburton:pr-rocm-managed

jamesburton commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jamesburton commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesburton commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jamesburton commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants