ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only

### Summary

On unified-memory APUs with a large BIOS VRAM carve-out (e.g. AMD Strix Halo / Ryzen AI MAX+ 395 set to a 96 GB iGPU split), ds4's ROCm backend **OOMs at higher context** even though the box has plenty of total memory. The model loads fine; session creation then fails:

```
ds4-bench: context buffers 1164.22 MiB (ctx=16417, backend=rocm, prefill_chunk=8192, ...)
ds4: ROCm tensor alloc failed: out of memory
ds4: ROCm tensor alloc failed: out of memory
... (repeated)
ds4-bench: failed to create session
```

Repro: Radeon 8060S (gfx1151), 128 GB LPDDR5X, BIOS UMA = 96 GB VRAM / ~32 GB OS. q2-imatrix model (~80.76 GiB). `ds4-bench --ctx-start 16384 --ctx-max 16384` → OOM. 2k works; 16k fails.

### Root cause

`ds4_gpu_tensor_alloc()` (`rocm/ds4_rocm_runtime.cuh`) allocates the general tensor class (prefill scratch, activations) with **device-only `cudaMalloc`**. Only the KV-cache class is conditionally routed through `cudaMallocManaged` (`ds4_gpu_should_use_managed_kv_cache`). With ~81 GB of weights resident in the 96 GB carve-out, the remaining VRAM headroom (~15 GB) is too small for the prefill scratch at higher context (prefill_chunk=8192), so the device-only allocations OOM — while the full unified pool (incl. GTT beyond the carve-out) sits unused.

### Proposed fix

An opt-in env (`DS4_CUDA_MANAGED=1`) that routes `ds4_gpu_tensor_alloc` through `cudaMallocManaged`, so the general allocator can draw from the full UMA pool, complementing the existing auto-managed KV class. Zero-overhead when unset.

**Measured (Strix Halo, gfx1151, q2 80.76 GiB, 96 GB split):** 16k context goes from **OOM → 195 tok/s prefill / 13.5 tok/s gen**.

This is platform-general (affects any large-carve-out UMA APU, Linux included), not Windows-specific. Happy to send the PR (one-line opt-in in the allocator).

### Environment
- GPU: AMD Radeon 8060S (gfx1151, Strix Halo). ROCm 7.1. Model: DeepSeek-V4-Flash q2-imatrix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only #359

Summary

Root cause

Proposed fix

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ROCm: high-context OOM on large-VRAM-carveout APUs (Strix Halo) — general tensor allocator is device-only #359

Description

Summary

Root cause

Proposed fix

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions