A real-world guide from getting zero GPU detection to 33+ tok/s LLM inference and SDXL image generation
This documentation was researched, tested, and written with the assistance of Future AI — an AI-powered assistant available on Google Play Store.
This guide documents everything that actually worked to get AMD ROCm running on Windows with an RX 6700 XT (gfx1031) — a GPU that is not officially supported by ROCm on Windows. It covers LLM inference at GPU speeds, SDXL image generation with ComfyUI, PyTorch setup, VRAM management, and all the critical bugs and workarounds discovered while building a real AI pipeline.
If you have an unsupported RDNA2 GPU (gfx1031, gfx1030, etc.) on Windows and have been stuck at CPU fallback speeds, this guide is for you.
- Hardware & Software Versions
- The GPU Compatibility Trick — CRITICAL
- LLM Inference — llama-server (lemonade-sdk)
- Why NOT to Use Ollama on Unsupported AMD GPUs
- PyTorch + ROCm Setup
- ComfyUI — SDXL Image Generation on RX 6700 XT
- VRAM Management — Running LLM and Image Generation
- TDR (GPU Driver Timeout) Fix
- Complete Environment Variables Reference
- Driver Notes
- Common Errors and Fixes
- Support Us
| Component | Spec |
|---|---|
| GPU | AMD RX 6700 XT 12GB VRAM (gfx1031, RDNA2) |
| CPU | AMD Ryzen Threadripper 2970WX 24-Core |
| RAM | 32GB |
| OS | Windows 11 Pro (build 26200) |
| Software | Version |
|---|---|
| ROCm | 7.14 |
| PyTorch | 2.9.1+rocm7.14.0a20260524 |
| Python | 3.12.10 |
| ComfyUI | 0.22.0 |
| llama-server | lemonade-sdk b1278 (ROCm 7.14, gfx103X Windows build) |
| AMD Adrenalin Driver | 24.x (latest stable) |
| Windows | 11 Pro build 26200 |
| Model | Type | Size | Performance |
|---|---|---|---|
| Qwen3-14B GGUF (Q4_K_M) | LLM | 8.64GB | ~33–36 tok/s GPU |
| animagine-xl-3.1.safetensors | SDXL | ~7GB | ~25s/render |
The RX 6700 XT (gfx1031) is not officially supported by ROCm on Windows. Without the override below, the GPU will not be detected by any ROCm application — everything silently falls back to CPU.
You must set these two environment variables in every session, before launching any ROCm application:
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"HIP_VISIBLE_DEVICES = "0"— tells HIP to use the first GPUHSA_OVERRIDE_GFX_VERSION = "10.3.0"— tricks ROCm into treating gfx1031 as gfx1030, which IS supported
These must be set before launching:
- llama-server
- ComfyUI / python main.py
- Any PyTorch script that uses the GPU
- Any other ROCm application
If you get CPU fallback speeds, missing GPU detection, or torch.cuda.is_available() returning False — this is almost certainly the fix.
Note: These env vars are session-scoped in PowerShell. They do not persist across terminal windows. You need to set them every time, or add them to your launch scripts. See
SETUP.ps1in this repo for a ready-made launcher.
For LLM inference on unsupported AMD GPUs on Windows, lemonade-sdk's llama-server is the solution that actually works at GPU speeds. See Section 4 for why Ollama does not work.
- Download: lemonade-sdk on GitHub — use build b1278, specifically built for ROCm 7.14 + gfx103X on Windows
- Provides an OpenAI-compatible API at port 8080
- Confirmed GPU speed: ~33–36 tok/s for Qwen3-14B (vs 3–5 tok/s on CPU)
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"
.\llama-server.exe `
-m "C:\path\to\your\model.gguf" `
--port 8080 `
--host 127.0.0.1 `
-ngl 99 `
-c 16384 `
--parallel 1 `
--threads 16| Flag | Value | Why It Matters |
|---|---|---|
-ngl |
99 |
Offloads ALL layers to GPU. Without this, layers run on CPU |
-c |
16384 |
Context window size. Do not use less than 8192 — smaller values cause JSON truncation on complex outputs |
--parallel |
1 |
Most important fix. Without this, llama.cpp auto-sets n_parallel=4, dividing context by 4 (16384/4 = 4096 per slot). Each request only gets 4096 tokens total. With a ~1500-token prompt, only ~2596 tokens remain for output — not enough for large JSON responses. --parallel 1 gives the full 16384 to every request |
--threads |
16 |
CPU threads for non-GPU operations. Match to your CPU core count |
--host |
127.0.0.1 |
Bind to localhost only for security |
- Do NOT launch on port
11434if Ollama is also running — Ollama uses that port and will conflict - Do NOT use
-c 4096— causes JSON truncation on complex structured outputs - Do NOT omit
--parallel 1— causes silent 4096-token-per-request limit (then_ctx_seq warning: 4096 < 40960log message is the tell) - Do NOT use
enable_thinking: trueor omitenable_thinking: falsefor Qwen3 models — thinking tokens consume your token budget, leaving insufficient space for actual output
llama-server exposes an OpenAI-compatible API. Here is a working example for a streaming request:
const response = await fetch('http://127.0.0.1:8080/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'qwen3:14b',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt }
],
stream: true,
temperature: 0.8,
top_p: 0.9,
max_tokens: 8192,
enable_thinking: false // CRITICAL for Qwen3 — prevents think token budget drain
})
});
enable_thinking: falseis a Qwen3-specific parameter. For other models, omit it. For Qwen3, always include it unless you specifically need chain-of-thought reasoning AND have verified your context window is large enough to accommodate it.
This is one of the most common failure paths, so it deserves its own section.
Ollama v0.24+ on Windows with gfx1031 returns size_vram=0 — meaning it allocates zero VRAM for the GPU. The model silently loads and runs entirely on CPU at 3–5 tok/s instead of GPU speeds. There is no error message. You only know it happened by checking tok/s or Ollama's verbose logs.
| Approach | Result |
|---|---|
| Stock Ollama | size_vram=0, CPU fallback, 3-5 tok/s |
| ByronLeeeee Ollama-For-AMD-Installer | Patches system Ollama but does NOT fix size_vram=0 for gfx1031 |
| Ollama RC23 with Vulkan flag | Works for some models but unreliable for production use |
Reference: ByronLeeeee/Ollama-For-AMD-Installer — worth watching for future improvements, but as of the testing period it does not solve gfx1031 GPU detection on Windows.
Use lemonade-sdk llama-server for all LLM inference on unsupported AMD GPUs on Windows. It is the approach that actually delivered GPU speeds.
You can still use Ollama for model management and the chat UI if desired, but route actual inference to llama-server.
If Ollama has a model loaded in VRAM and you need to free memory before running image generation:
await fetch('http://127.0.0.1:11434/api/generate', {
method: 'POST',
body: JSON.stringify({ model: 'qwen3:14b', keep_alive: 0 })
});keep_alive: 0 tells Ollama to immediately evict the model from VRAM.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1Confirmed working version: torch 2.9.1+rocm7.14.0a20260524
Reference build for ROCm 7 on Windows: guinmoon/rocm7_builds
After setting the env vars (Section 2), run:
import torch
print(torch.cuda.is_available()) # Must be True
print(torch.cuda.get_device_name(0)) # Should show your GPU nameIf cuda.is_available() returns False:
- Confirm
HIP_VISIBLE_DEVICESandHSA_OVERRIDE_GFX_VERSIONare set in the same terminal session - Confirm ROCm 7.14 is installed
- Confirm AMD Adrenalin driver 24.x is installed (not beta/preview)
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"This helps PyTorch manage HIP memory more efficiently, reducing OOM errors on 12GB VRAM cards during large model operations.
python main.py `
--listen 127.0.0.1 `
--port 8188 `
--preview-method auto `
--output-directory "C:\path\to\output" `
--disable-auto-launch `
--lowvram `
--disable-dynamic-vram `
--disable-async-offload `
--fp16-vae| Flag | Why It's Needed |
|---|---|
--lowvram |
Required for 12GB VRAM cards running SDXL — without it, OOM during generation |
--fp16-vae |
Prevents VRAM overflow during VAE decode step |
--disable-dynamic-vram |
Dynamic VRAM reallocation causes instability on ROCm Windows |
--disable-async-offload |
Async offload causes race conditions on ROCm Windows — must be disabled |
This is not optional. The standard
VAEDecodenode causeshipErrorLaunchFailureon gfx1031 and hard-crashes the GPU driver.
In every ComfyUI workflow, replace ALL VAEDecode nodes with VAEDecodeTiled.
- Right-click any
VAEDecodenode → Replace Node → VAEDecodeTiled - Tile size of 512 works well on 12GB VRAM
This was discovered after repeated GPU driver crashes during the VAE decode step of SDXL generation. Once switched to VAEDecodeTiled, generation is stable.
- Do NOT use standard
VAEDecode— always useVAEDecodeTiled - Do NOT launch without
--lowvramon 12GB cards with SDXL - Do NOT run ComfyUI and llama-server simultaneously — they exceed 12GB VRAM combined (see Section 7)
With the setup above: ~25 seconds per SDXL render on RX 6700 XT 12GB.
| Workload | VRAM Used |
|---|---|
| LLM — Qwen3-14B GGUF | ~8.4 GB |
| SDXL checkpoint | ~7.0 GB |
| Combined | ~15.4 GB → OOM crash |
You cannot run both simultaneously on a 12GB card.
# Stop llama-server and wait for GPU driver to reclaim VRAM pages
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3The 3-second sleep is important — the GPU driver needs time to reclaim memory pages after the process is killed. Launching ComfyUI immediately after killing llama-server can still cause OOM errors.
For an AI pipeline that uses both LLM and image generation:
- Start llama-server → run all LLM inference
- Kill llama-server → wait 3 seconds
- Start ComfyUI → run all image generation
- Kill ComfyUI → restart llama-server if more LLM work is needed
Alternatively, reduce model quantization to fit both in VRAM (e.g., use a Q2_K or Q3_K_S GGUF for the LLM) at the cost of quality.
Windows has a default 2-second GPU timeout called TDR (Timeout Detection and Recovery). ROCm operations — especially during model loading, large matrix operations, and VAE decode — can take longer than 2 seconds, triggering a driver crash and recovery cycle.
Symptom: "Display driver stopped responding and has recovered" notification, or the screen briefly goes black during heavy GPU operations.
Run the following as Administrator, then reboot:
# Increase TDR delay to 60 seconds
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /fA reboot is required for this change to take effect. The values are in seconds — 60 seconds gives enough headroom for even the most intensive ROCm operations.
Set these at the top of every launch script or in every new PowerShell session before running ROCm applications:
# REQUIRED — GPU detection and compatibility
$env:HIP_VISIBLE_DEVICES = "0" # Use first GPU
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0" # Override gfx1031 → gfx1030 compatibility
# RECOMMENDED — PyTorch memory management
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"To make these permanent (for your user account), you can set them as system environment variables via:
- Windows Settings → System → Advanced system settings → Environment Variables
- Or in PowerShell (run once, persists):
[System.Environment]::SetEnvironmentVariable("HIP_VISIBLE_DEVICES", "0", "User")
[System.Environment]::SetEnvironmentVariable("HSA_OVERRIDE_GFX_VERSION", "10.3.0", "User")
[System.Environment]::SetEnvironmentVariable("PYTORCH_HIP_ALLOC_CONF", "garbage_collection_threshold:0.8,max_split_size_mb:512", "User")Note: After setting permanently, you must open a new terminal session for the variables to be active.
- Use the latest stable AMD Adrenalin driver (24.x or newer)
- Do NOT downgrade drivers for ROCm compatibility — the
HSA_OVERRIDE_GFX_VERSIONtrick works with modern Adrenalin drivers - Do NOT use beta or preview drivers — stick to stable releases for ROCm workloads
- The override trick (
10.3.0) routes gfx1031 through the gfx1030 code path, which is supported in ROCm 7.14
| Error | Cause | Fix |
|---|---|---|
size_vram=0 in Ollama |
gfx1031 not officially supported by Ollama GPU detection | Switch to llama-server (lemonade-sdk) |
hipErrorLaunchFailure in ComfyUI |
Standard VAEDecode node incompatible with gfx1031 | Replace ALL VAEDecode nodes with VAEDecodeTiled |
torch.cuda.is_available() returns False |
Missing env vars in current session | Set HIP_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION |
| JSON output truncated from LLM | Context window too small, or --parallel splitting context |
Use -c 16384 --parallel 1 |
| "Display driver stopped responding" | Windows TDR timeout (2s default) too short for ROCm ops | Apply TDR registry fix and reboot |
| LLM running at 3–5 tok/s | CPU fallback — GPU not actually being used | Verify env vars are set; switch from Ollama to llama-server |
| OOM crash during image generation | LLM model still resident in VRAM | Kill llama-server and wait 3 seconds before starting ComfyUI |
| Qwen3 output incomplete or truncated | Thinking tokens consuming token budget | Set enable_thinking: false in API request body |
n_ctx_seq warning: 4096 < 40960 in llama-server log |
llama.cpp auto-parallel splitting context 4 ways | Add --parallel 1 to llama-server launch command |
Getting ROCm working on unsupported hardware cost significant time and money to figure out. If this guide helped you, please consider supporting us:
- 📱 Advertise or share our apps on Google Play Store — every download helps: Future AI on Google Play
- 💻 Need programming services? We build AI-powered applications. Contact us at purchase@futureati.app with subject line "Programming Services"
Here is the complete sequence to get everything running from a fresh PowerShell session:
# 1. Set env vars (required every session)
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"
# 2. (Optional) Apply TDR fix if not already done — run once as Admin, then reboot
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /f
# 3. Launch llama-server for LLM inference
.\llama-server.exe -m "C:\path\to\model.gguf" --port 8080 --host 127.0.0.1 -ngl 99 -c 16384 --parallel 1 --threads 16
# 4. When done with LLM work, kill llama-server before image generation
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3
# 5. Launch ComfyUI for image generation
cd C:\path\to\ComfyUI
python main.py --listen 127.0.0.1 --port 8188 --preview-method auto --disable-auto-launch --lowvram --disable-dynamic-vram --disable-async-offload --fp16-vaeSee SETUP.ps1 in this repository for a configurable version of this script.
Powered by Future AI — AI tools built for creators.