Skip to content

fpresiado/Future-AI-ROCM-support

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

AMD ROCm on Windows — Complete Setup Guide for Unsupported GPUs (RX 6700 XT / gfx1031)

A real-world guide from getting zero GPU detection to 33+ tok/s LLM inference and SDXL image generation


This documentation was researched, tested, and written with the assistance of Future AI — an AI-powered assistant available on Google Play Store.


Overview

This guide documents everything that actually worked to get AMD ROCm running on Windows with an RX 6700 XT (gfx1031) — a GPU that is not officially supported by ROCm on Windows. It covers LLM inference at GPU speeds, SDXL image generation with ComfyUI, PyTorch setup, VRAM management, and all the critical bugs and workarounds discovered while building a real AI pipeline.

If you have an unsupported RDNA2 GPU (gfx1031, gfx1030, etc.) on Windows and have been stuck at CPU fallback speeds, this guide is for you.


Table of Contents

  1. Hardware & Software Versions
  2. The GPU Compatibility Trick — CRITICAL
  3. LLM Inference — llama-server (lemonade-sdk)
  4. Why NOT to Use Ollama on Unsupported AMD GPUs
  5. PyTorch + ROCm Setup
  6. ComfyUI — SDXL Image Generation on RX 6700 XT
  7. VRAM Management — Running LLM and Image Generation
  8. TDR (GPU Driver Timeout) Fix
  9. Complete Environment Variables Reference
  10. Driver Notes
  11. Common Errors and Fixes
  12. Support Us

1. Hardware & Software Versions

Hardware Used

Component Spec
GPU AMD RX 6700 XT 12GB VRAM (gfx1031, RDNA2)
CPU AMD Ryzen Threadripper 2970WX 24-Core
RAM 32GB
OS Windows 11 Pro (build 26200)

Confirmed Working Software Versions

Software Version
ROCm 7.14
PyTorch 2.9.1+rocm7.14.0a20260524
Python 3.12.10
ComfyUI 0.22.0
llama-server lemonade-sdk b1278 (ROCm 7.14, gfx103X Windows build)
AMD Adrenalin Driver 24.x (latest stable)
Windows 11 Pro build 26200

Models Tested

Model Type Size Performance
Qwen3-14B GGUF (Q4_K_M) LLM 8.64GB ~33–36 tok/s GPU
animagine-xl-3.1.safetensors SDXL ~7GB ~25s/render

2. The GPU Compatibility Trick — CRITICAL

The RX 6700 XT (gfx1031) is not officially supported by ROCm on Windows. Without the override below, the GPU will not be detected by any ROCm application — everything silently falls back to CPU.

You must set these two environment variables in every session, before launching any ROCm application:

$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"
  • HIP_VISIBLE_DEVICES = "0" — tells HIP to use the first GPU
  • HSA_OVERRIDE_GFX_VERSION = "10.3.0" — tricks ROCm into treating gfx1031 as gfx1030, which IS supported

These must be set before launching:

  • llama-server
  • ComfyUI / python main.py
  • Any PyTorch script that uses the GPU
  • Any other ROCm application

If you get CPU fallback speeds, missing GPU detection, or torch.cuda.is_available() returning False — this is almost certainly the fix.

Note: These env vars are session-scoped in PowerShell. They do not persist across terminal windows. You need to set them every time, or add them to your launch scripts. See SETUP.ps1 in this repo for a ready-made launcher.


3. LLM Inference — llama-server (lemonade-sdk)

Why lemonade-sdk llama-server?

For LLM inference on unsupported AMD GPUs on Windows, lemonade-sdk's llama-server is the solution that actually works at GPU speeds. See Section 4 for why Ollama does not work.

  • Download: lemonade-sdk on GitHub — use build b1278, specifically built for ROCm 7.14 + gfx103X on Windows
  • Provides an OpenAI-compatible API at port 8080
  • Confirmed GPU speed: ~33–36 tok/s for Qwen3-14B (vs 3–5 tok/s on CPU)

Exact Launch Command

$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"

.\llama-server.exe `
  -m "C:\path\to\your\model.gguf" `
  --port 8080 `
  --host 127.0.0.1 `
  -ngl 99 `
  -c 16384 `
  --parallel 1 `
  --threads 16

Critical Flags Explained

Flag Value Why It Matters
-ngl 99 Offloads ALL layers to GPU. Without this, layers run on CPU
-c 16384 Context window size. Do not use less than 8192 — smaller values cause JSON truncation on complex outputs
--parallel 1 Most important fix. Without this, llama.cpp auto-sets n_parallel=4, dividing context by 4 (16384/4 = 4096 per slot). Each request only gets 4096 tokens total. With a ~1500-token prompt, only ~2596 tokens remain for output — not enough for large JSON responses. --parallel 1 gives the full 16384 to every request
--threads 16 CPU threads for non-GPU operations. Match to your CPU core count
--host 127.0.0.1 Bind to localhost only for security

What NOT to Do with llama-server

  • Do NOT launch on port 11434 if Ollama is also running — Ollama uses that port and will conflict
  • Do NOT use -c 4096 — causes JSON truncation on complex structured outputs
  • Do NOT omit --parallel 1 — causes silent 4096-token-per-request limit (the n_ctx_seq warning: 4096 < 40960 log message is the tell)
  • Do NOT use enable_thinking: true or omit enable_thinking: false for Qwen3 models — thinking tokens consume your token budget, leaving insufficient space for actual output

API Usage (OpenAI-Compatible)

llama-server exposes an OpenAI-compatible API. Here is a working example for a streaming request:

const response = await fetch('http://127.0.0.1:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'qwen3:14b',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userPrompt }
    ],
    stream: true,
    temperature: 0.8,
    top_p: 0.9,
    max_tokens: 8192,
    enable_thinking: false  // CRITICAL for Qwen3 — prevents think token budget drain
  })
});

enable_thinking: false is a Qwen3-specific parameter. For other models, omit it. For Qwen3, always include it unless you specifically need chain-of-thought reasoning AND have verified your context window is large enough to accommodate it.


4. Why NOT to Use Ollama on Unsupported AMD GPUs

This is one of the most common failure paths, so it deserves its own section.

The Core Problem

Ollama v0.24+ on Windows with gfx1031 returns size_vram=0 — meaning it allocates zero VRAM for the GPU. The model silently loads and runs entirely on CPU at 3–5 tok/s instead of GPU speeds. There is no error message. You only know it happened by checking tok/s or Ollama's verbose logs.

What Was Tried and Failed

Approach Result
Stock Ollama size_vram=0, CPU fallback, 3-5 tok/s
ByronLeeeee Ollama-For-AMD-Installer Patches system Ollama but does NOT fix size_vram=0 for gfx1031
Ollama RC23 with Vulkan flag Works for some models but unreliable for production use

Reference: ByronLeeeee/Ollama-For-AMD-Installer — worth watching for future improvements, but as of the testing period it does not solve gfx1031 GPU detection on Windows.

Recommendation

Use lemonade-sdk llama-server for all LLM inference on unsupported AMD GPUs on Windows. It is the approach that actually delivered GPU speeds.

You can still use Ollama for model management and the chat UI if desired, but route actual inference to llama-server.

VRAM Eviction via Ollama (if you do use it)

If Ollama has a model loaded in VRAM and you need to free memory before running image generation:

await fetch('http://127.0.0.1:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({ model: 'qwen3:14b', keep_alive: 0 })
});

keep_alive: 0 tells Ollama to immediately evict the model from VRAM.


5. PyTorch + ROCm Setup

Installation

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1

Confirmed working version: torch 2.9.1+rocm7.14.0a20260524

Reference build for ROCm 7 on Windows: guinmoon/rocm7_builds

Verify GPU Detection

After setting the env vars (Section 2), run:

import torch
print(torch.cuda.is_available())    # Must be True
print(torch.cuda.get_device_name(0))  # Should show your GPU name

If cuda.is_available() returns False:

  1. Confirm HIP_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION are set in the same terminal session
  2. Confirm ROCm 7.14 is installed
  3. Confirm AMD Adrenalin driver 24.x is installed (not beta/preview)

Additional Recommended Environment Variable

$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

This helps PyTorch manage HIP memory more efficiently, reducing OOM errors on 12GB VRAM cards during large model operations.


6. ComfyUI — SDXL Image Generation on RX 6700 XT

Launch Command

python main.py `
  --listen 127.0.0.1 `
  --port 8188 `
  --preview-method auto `
  --output-directory "C:\path\to\output" `
  --disable-auto-launch `
  --lowvram `
  --disable-dynamic-vram `
  --disable-async-offload `
  --fp16-vae

Critical Flags

Flag Why It's Needed
--lowvram Required for 12GB VRAM cards running SDXL — without it, OOM during generation
--fp16-vae Prevents VRAM overflow during VAE decode step
--disable-dynamic-vram Dynamic VRAM reallocation causes instability on ROCm Windows
--disable-async-offload Async offload causes race conditions on ROCm Windows — must be disabled

Critical Patch: VAEDecodeTiled (MANDATORY)

This is not optional. The standard VAEDecode node causes hipErrorLaunchFailure on gfx1031 and hard-crashes the GPU driver.

In every ComfyUI workflow, replace ALL VAEDecode nodes with VAEDecodeTiled.

  • Right-click any VAEDecode node → Replace Node → VAEDecodeTiled
  • Tile size of 512 works well on 12GB VRAM

This was discovered after repeated GPU driver crashes during the VAE decode step of SDXL generation. Once switched to VAEDecodeTiled, generation is stable.

What NOT to Do in ComfyUI

  • Do NOT use standard VAEDecode — always use VAEDecodeTiled
  • Do NOT launch without --lowvram on 12GB cards with SDXL
  • Do NOT run ComfyUI and llama-server simultaneously — they exceed 12GB VRAM combined (see Section 7)

Performance

With the setup above: ~25 seconds per SDXL render on RX 6700 XT 12GB.


7. VRAM Management — Running LLM and Image Generation

The Problem

Workload VRAM Used
LLM — Qwen3-14B GGUF ~8.4 GB
SDXL checkpoint ~7.0 GB
Combined ~15.4 GB → OOM crash

You cannot run both simultaneously on a 12GB card.

Solution: Kill llama-server Before Image Generation

# Stop llama-server and wait for GPU driver to reclaim VRAM pages
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3

The 3-second sleep is important — the GPU driver needs time to reclaim memory pages after the process is killed. Launching ComfyUI immediately after killing llama-server can still cause OOM errors.

Workflow Pattern

For an AI pipeline that uses both LLM and image generation:

  1. Start llama-server → run all LLM inference
  2. Kill llama-server → wait 3 seconds
  3. Start ComfyUI → run all image generation
  4. Kill ComfyUI → restart llama-server if more LLM work is needed

Alternatively, reduce model quantization to fit both in VRAM (e.g., use a Q2_K or Q3_K_S GGUF for the LLM) at the cost of quality.


8. TDR (GPU Driver Timeout) Fix

Windows has a default 2-second GPU timeout called TDR (Timeout Detection and Recovery). ROCm operations — especially during model loading, large matrix operations, and VAE decode — can take longer than 2 seconds, triggering a driver crash and recovery cycle.

Symptom: "Display driver stopped responding and has recovered" notification, or the screen briefly goes black during heavy GPU operations.

Fix

Run the following as Administrator, then reboot:

# Increase TDR delay to 60 seconds
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /f

A reboot is required for this change to take effect. The values are in seconds — 60 seconds gives enough headroom for even the most intensive ROCm operations.


9. Complete Environment Variables Reference

Set these at the top of every launch script or in every new PowerShell session before running ROCm applications:

# REQUIRED — GPU detection and compatibility
$env:HIP_VISIBLE_DEVICES = "0"                  # Use first GPU
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"        # Override gfx1031 → gfx1030 compatibility

# RECOMMENDED — PyTorch memory management
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

To make these permanent (for your user account), you can set them as system environment variables via:

  • Windows Settings → System → Advanced system settings → Environment Variables
  • Or in PowerShell (run once, persists):
[System.Environment]::SetEnvironmentVariable("HIP_VISIBLE_DEVICES", "0", "User")
[System.Environment]::SetEnvironmentVariable("HSA_OVERRIDE_GFX_VERSION", "10.3.0", "User")
[System.Environment]::SetEnvironmentVariable("PYTORCH_HIP_ALLOC_CONF", "garbage_collection_threshold:0.8,max_split_size_mb:512", "User")

Note: After setting permanently, you must open a new terminal session for the variables to be active.


10. Driver Notes

  • Use the latest stable AMD Adrenalin driver (24.x or newer)
  • Do NOT downgrade drivers for ROCm compatibility — the HSA_OVERRIDE_GFX_VERSION trick works with modern Adrenalin drivers
  • Do NOT use beta or preview drivers — stick to stable releases for ROCm workloads
  • The override trick (10.3.0) routes gfx1031 through the gfx1030 code path, which is supported in ROCm 7.14

11. Common Errors and Fixes

Error Cause Fix
size_vram=0 in Ollama gfx1031 not officially supported by Ollama GPU detection Switch to llama-server (lemonade-sdk)
hipErrorLaunchFailure in ComfyUI Standard VAEDecode node incompatible with gfx1031 Replace ALL VAEDecode nodes with VAEDecodeTiled
torch.cuda.is_available() returns False Missing env vars in current session Set HIP_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION
JSON output truncated from LLM Context window too small, or --parallel splitting context Use -c 16384 --parallel 1
"Display driver stopped responding" Windows TDR timeout (2s default) too short for ROCm ops Apply TDR registry fix and reboot
LLM running at 3–5 tok/s CPU fallback — GPU not actually being used Verify env vars are set; switch from Ollama to llama-server
OOM crash during image generation LLM model still resident in VRAM Kill llama-server and wait 3 seconds before starting ComfyUI
Qwen3 output incomplete or truncated Thinking tokens consuming token budget Set enable_thinking: false in API request body
n_ctx_seq warning: 4096 < 40960 in llama-server log llama.cpp auto-parallel splitting context 4 ways Add --parallel 1 to llama-server launch command

Support Us

Getting ROCm working on unsupported hardware cost significant time and money to figure out. If this guide helped you, please consider supporting us:

  • 📱 Advertise or share our apps on Google Play Store — every download helps: Future AI on Google Play
  • 💻 Need programming services? We build AI-powered applications. Contact us at purchase@futureati.app with subject line "Programming Services"

Quick Reference: Full Startup Sequence

Here is the complete sequence to get everything running from a fresh PowerShell session:

# 1. Set env vars (required every session)
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

# 2. (Optional) Apply TDR fix if not already done — run once as Admin, then reboot
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /f

# 3. Launch llama-server for LLM inference
.\llama-server.exe -m "C:\path\to\model.gguf" --port 8080 --host 127.0.0.1 -ngl 99 -c 16384 --parallel 1 --threads 16

# 4. When done with LLM work, kill llama-server before image generation
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3

# 5. Launch ComfyUI for image generation
cd C:\path\to\ComfyUI
python main.py --listen 127.0.0.1 --port 8188 --preview-method auto --disable-auto-launch --lowvram --disable-dynamic-vram --disable-async-offload --fp16-vae

See SETUP.ps1 in this repository for a configurable version of this script.


Powered by Future AI — AI tools built for creators.

About

Complete guide to running AMD ROCm on unsupported GPUs (RX 6700 XT / gfx1031) on Windows — LLM inference + SDXL image generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors