AMD ROCm on Windows — Complete Setup Guide for Unsupported GPUs (RX 6700 XT / gfx1031)

A real-world guide from getting zero GPU detection to 33+ tok/s LLM inference and SDXL image generation

This documentation was researched, tested, and written with the assistance of Future AI — an AI-powered assistant available on Google Play Store.

Overview

This guide documents everything that actually worked to get AMD ROCm running on Windows with an RX 6700 XT (gfx1031) — a GPU that is not officially supported by ROCm on Windows. It covers LLM inference at GPU speeds, SDXL image generation with ComfyUI, PyTorch setup, VRAM management, and all the critical bugs and workarounds discovered while building a real AI pipeline.

If you have an unsupported RDNA2 GPU (gfx1031, gfx1030, etc.) on Windows and have been stuck at CPU fallback speeds, this guide is for you.

Hardware & Software Versions
The GPU Compatibility Trick — CRITICAL
LLM Inference — llama-server (lemonade-sdk)
Why NOT to Use Ollama on Unsupported AMD GPUs
PyTorch + ROCm Setup
ComfyUI — SDXL Image Generation on RX 6700 XT
VRAM Management — Running LLM and Image Generation
TDR (GPU Driver Timeout) Fix
Complete Environment Variables Reference
Driver Notes
Common Errors and Fixes
Support Us

1. Hardware & Software Versions

Hardware Used

Component	Spec
GPU	AMD RX 6700 XT 12GB VRAM (gfx1031, RDNA2)
CPU	AMD Ryzen Threadripper 2970WX 24-Core
RAM	32GB
OS	Windows 11 Pro (build 26200)

Confirmed Working Software Versions

Software	Version
ROCm	7.14
PyTorch	2.9.1+rocm7.14.0a20260524
Python	3.12.10
ComfyUI	0.22.0
llama-server	lemonade-sdk b1278 (ROCm 7.14, gfx103X Windows build)
AMD Adrenalin Driver	24.x (latest stable)
Windows	11 Pro build 26200

Models Tested

Model	Type	Size	Performance
Qwen3-14B GGUF (Q4_K_M)	LLM	8.64GB	~33–36 tok/s GPU
animagine-xl-3.1.safetensors	SDXL	~7GB	~25s/render

2. The GPU Compatibility Trick — CRITICAL

The RX 6700 XT (gfx1031) is not officially supported by ROCm on Windows. Without the override below, the GPU will not be detected by any ROCm application — everything silently falls back to CPU.

You must set these two environment variables in every session, before launching any ROCm application:

$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"

HIP_VISIBLE_DEVICES = "0" — tells HIP to use the first GPU
HSA_OVERRIDE_GFX_VERSION = "10.3.0" — tricks ROCm into treating gfx1031 as gfx1030, which IS supported

These must be set before launching:

llama-server
ComfyUI / python main.py
Any PyTorch script that uses the GPU
Any other ROCm application

If you get CPU fallback speeds, missing GPU detection, or torch.cuda.is_available() returning False — this is almost certainly the fix.

Note: These env vars are session-scoped in PowerShell. They do not persist across terminal windows. You need to set them every time, or add them to your launch scripts. See SETUP.ps1 in this repo for a ready-made launcher.

3. LLM Inference — llama-server (lemonade-sdk)

Why lemonade-sdk llama-server?

For LLM inference on unsupported AMD GPUs on Windows, lemonade-sdk's llama-server is the solution that actually works at GPU speeds. See Section 4 for why Ollama does not work.

Download: lemonade-sdk on GitHub — use build b1278, specifically built for ROCm 7.14 + gfx103X on Windows
Provides an OpenAI-compatible API at port 8080
Confirmed GPU speed: ~33–36 tok/s for Qwen3-14B (vs 3–5 tok/s on CPU)

Exact Launch Command

$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"

.\llama-server.exe `
  -m "C:\path\to\your\model.gguf" `
  --port 8080 `
  --host 127.0.0.1 `
  -ngl 99 `
  -c 16384 `
  --parallel 1 `
  --threads 16

Critical Flags Explained

Flag	Value	Why It Matters
`-ngl`	`99`	Offloads ALL layers to GPU. Without this, layers run on CPU
`-c`	`16384`	Context window size. Do not use less than 8192 — smaller values cause JSON truncation on complex outputs
`--parallel`	`1`	Most important fix. Without this, llama.cpp auto-sets n_parallel=4, dividing context by 4 (16384/4 = 4096 per slot). Each request only gets 4096 tokens total. With a ~1500-token prompt, only ~2596 tokens remain for output — not enough for large JSON responses. `--parallel 1` gives the full 16384 to every request
`--threads`	`16`	CPU threads for non-GPU operations. Match to your CPU core count
`--host`	`127.0.0.1`	Bind to localhost only for security

What NOT to Do with llama-server

Do NOT launch on port 11434 if Ollama is also running — Ollama uses that port and will conflict
Do NOT use -c 4096 — causes JSON truncation on complex structured outputs
Do NOT omit --parallel 1 — causes silent 4096-token-per-request limit (the n_ctx_seq warning: 4096 < 40960 log message is the tell)
Do NOT use enable_thinking: true or omit enable_thinking: false for Qwen3 models — thinking tokens consume your token budget, leaving insufficient space for actual output

API Usage (OpenAI-Compatible)

llama-server exposes an OpenAI-compatible API. Here is a working example for a streaming request:

const response = await fetch('http://127.0.0.1:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'qwen3:14b',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userPrompt }
    ],
    stream: true,
    temperature: 0.8,
    top_p: 0.9,
    max_tokens: 8192,
    enable_thinking: false  // CRITICAL for Qwen3 — prevents think token budget drain
  })
});

enable_thinking: false is a Qwen3-specific parameter. For other models, omit it. For Qwen3, always include it unless you specifically need chain-of-thought reasoning AND have verified your context window is large enough to accommodate it.

4. Why NOT to Use Ollama on Unsupported AMD GPUs

This is one of the most common failure paths, so it deserves its own section.

The Core Problem

Ollama v0.24+ on Windows with gfx1031 returns size_vram=0 — meaning it allocates zero VRAM for the GPU. The model silently loads and runs entirely on CPU at 3–5 tok/s instead of GPU speeds. There is no error message. You only know it happened by checking tok/s or Ollama's verbose logs.

What Was Tried and Failed

Approach	Result
Stock Ollama	`size_vram=0`, CPU fallback, 3-5 tok/s
ByronLeeeee Ollama-For-AMD-Installer	Patches system Ollama but does NOT fix `size_vram=0` for gfx1031
Ollama RC23 with Vulkan flag	Works for some models but unreliable for production use

Reference: ByronLeeeee/Ollama-For-AMD-Installer — worth watching for future improvements, but as of the testing period it does not solve gfx1031 GPU detection on Windows.

Recommendation

Use lemonade-sdk llama-server for all LLM inference on unsupported AMD GPUs on Windows. It is the approach that actually delivered GPU speeds.

You can still use Ollama for model management and the chat UI if desired, but route actual inference to llama-server.

VRAM Eviction via Ollama (if you do use it)

If Ollama has a model loaded in VRAM and you need to free memory before running image generation:

await fetch('http://127.0.0.1:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({ model: 'qwen3:14b', keep_alive: 0 })
});

keep_alive: 0 tells Ollama to immediately evict the model from VRAM.

5. PyTorch + ROCm Setup

Installation

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1

Confirmed working version: torch 2.9.1+rocm7.14.0a20260524

Reference build for ROCm 7 on Windows: guinmoon/rocm7_builds

Verify GPU Detection

After setting the env vars (Section 2), run:

import torch
print(torch.cuda.is_available())    # Must be True
print(torch.cuda.get_device_name(0))  # Should show your GPU name

If cuda.is_available() returns False:

Confirm HIP_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION are set in the same terminal session
Confirm ROCm 7.14 is installed
Confirm AMD Adrenalin driver 24.x is installed (not beta/preview)

Additional Recommended Environment Variable

$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

This helps PyTorch manage HIP memory more efficiently, reducing OOM errors on 12GB VRAM cards during large model operations.

6. ComfyUI — SDXL Image Generation on RX 6700 XT

Launch Command

python main.py `
  --listen 127.0.0.1 `
  --port 8188 `
  --preview-method auto `
  --output-directory "C:\path\to\output" `
  --disable-auto-launch `
  --lowvram `
  --disable-dynamic-vram `
  --disable-async-offload `
  --fp16-vae

Critical Flags

Flag	Why It's Needed
`--lowvram`	Required for 12GB VRAM cards running SDXL — without it, OOM during generation
`--fp16-vae`	Prevents VRAM overflow during VAE decode step
`--disable-dynamic-vram`	Dynamic VRAM reallocation causes instability on ROCm Windows
`--disable-async-offload`	Async offload causes race conditions on ROCm Windows — must be disabled

Critical Patch: VAEDecodeTiled (MANDATORY)

This is not optional. The standard VAEDecode node causes hipErrorLaunchFailure on gfx1031 and hard-crashes the GPU driver.

In every ComfyUI workflow, replace ALL VAEDecode nodes with VAEDecodeTiled.

Right-click any VAEDecode node → Replace Node → VAEDecodeTiled
Tile size of 512 works well on 12GB VRAM

This was discovered after repeated GPU driver crashes during the VAE decode step of SDXL generation. Once switched to VAEDecodeTiled, generation is stable.

What NOT to Do in ComfyUI

Do NOT use standard VAEDecode — always use VAEDecodeTiled
Do NOT launch without --lowvram on 12GB cards with SDXL
Do NOT run ComfyUI and llama-server simultaneously — they exceed 12GB VRAM combined (see Section 7)

Performance

With the setup above: ~25 seconds per SDXL render on RX 6700 XT 12GB.

7. VRAM Management — Running LLM and Image Generation

The Problem

Workload	VRAM Used
LLM — Qwen3-14B GGUF	~8.4 GB
SDXL checkpoint	~7.0 GB
Combined	~15.4 GB → OOM crash

You cannot run both simultaneously on a 12GB card.

Solution: Kill llama-server Before Image Generation

# Stop llama-server and wait for GPU driver to reclaim VRAM pages
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3

The 3-second sleep is important — the GPU driver needs time to reclaim memory pages after the process is killed. Launching ComfyUI immediately after killing llama-server can still cause OOM errors.

Workflow Pattern

For an AI pipeline that uses both LLM and image generation:

Start llama-server → run all LLM inference
Kill llama-server → wait 3 seconds
Start ComfyUI → run all image generation
Kill ComfyUI → restart llama-server if more LLM work is needed

Alternatively, reduce model quantization to fit both in VRAM (e.g., use a Q2_K or Q3_K_S GGUF for the LLM) at the cost of quality.

8. TDR (GPU Driver Timeout) Fix

Windows has a default 2-second GPU timeout called TDR (Timeout Detection and Recovery). ROCm operations — especially during model loading, large matrix operations, and VAE decode — can take longer than 2 seconds, triggering a driver crash and recovery cycle.

Symptom: "Display driver stopped responding and has recovered" notification, or the screen briefly goes black during heavy GPU operations.

Fix

Run the following as Administrator, then reboot:

# Increase TDR delay to 60 seconds
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /f

A reboot is required for this change to take effect. The values are in seconds — 60 seconds gives enough headroom for even the most intensive ROCm operations.

9. Complete Environment Variables Reference

Set these at the top of every launch script or in every new PowerShell session before running ROCm applications:

# REQUIRED — GPU detection and compatibility
$env:HIP_VISIBLE_DEVICES = "0"                  # Use first GPU
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"        # Override gfx1031 → gfx1030 compatibility

# RECOMMENDED — PyTorch memory management
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

To make these permanent (for your user account), you can set them as system environment variables via:

Windows Settings → System → Advanced system settings → Environment Variables
Or in PowerShell (run once, persists):

[System.Environment]::SetEnvironmentVariable("HIP_VISIBLE_DEVICES", "0", "User")
[System.Environment]::SetEnvironmentVariable("HSA_OVERRIDE_GFX_VERSION", "10.3.0", "User")
[System.Environment]::SetEnvironmentVariable("PYTORCH_HIP_ALLOC_CONF", "garbage_collection_threshold:0.8,max_split_size_mb:512", "User")

Note: After setting permanently, you must open a new terminal session for the variables to be active.

10. Driver Notes

Use the latest stable AMD Adrenalin driver (24.x or newer)
Do NOT downgrade drivers for ROCm compatibility — the HSA_OVERRIDE_GFX_VERSION trick works with modern Adrenalin drivers
Do NOT use beta or preview drivers — stick to stable releases for ROCm workloads
The override trick (10.3.0) routes gfx1031 through the gfx1030 code path, which is supported in ROCm 7.14

11. Common Errors and Fixes

Error	Cause	Fix
`size_vram=0` in Ollama	gfx1031 not officially supported by Ollama GPU detection	Switch to llama-server (lemonade-sdk)
`hipErrorLaunchFailure` in ComfyUI	Standard VAEDecode node incompatible with gfx1031	Replace ALL VAEDecode nodes with VAEDecodeTiled
`torch.cuda.is_available()` returns `False`	Missing env vars in current session	Set `HIP_VISIBLE_DEVICES` and `HSA_OVERRIDE_GFX_VERSION`
JSON output truncated from LLM	Context window too small, or `--parallel` splitting context	Use `-c 16384 --parallel 1`
"Display driver stopped responding"	Windows TDR timeout (2s default) too short for ROCm ops	Apply TDR registry fix and reboot
LLM running at 3–5 tok/s	CPU fallback — GPU not actually being used	Verify env vars are set; switch from Ollama to llama-server
OOM crash during image generation	LLM model still resident in VRAM	Kill llama-server and wait 3 seconds before starting ComfyUI
Qwen3 output incomplete or truncated	Thinking tokens consuming token budget	Set `enable_thinking: false` in API request body
`n_ctx_seq warning: 4096 < 40960` in llama-server log	llama.cpp auto-parallel splitting context 4 ways	Add `--parallel 1` to llama-server launch command

Support Us

Getting ROCm working on unsupported hardware cost significant time and money to figure out. If this guide helped you, please consider supporting us:

📱 Advertise or share our apps on Google Play Store — every download helps: Future AI on Google Play
💻 Need programming services? We build AI-powered applications. Contact us at purchase@futureati.app with subject line "Programming Services"

Quick Reference: Full Startup Sequence

Here is the complete sequence to get everything running from a fresh PowerShell session:

# 1. Set env vars (required every session)
$env:HIP_VISIBLE_DEVICES = "0"
$env:HSA_OVERRIDE_GFX_VERSION = "10.3.0"
$env:PYTORCH_HIP_ALLOC_CONF = "garbage_collection_threshold:0.8,max_split_size_mb:512"

# 2. (Optional) Apply TDR fix if not already done — run once as Admin, then reboot
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDelay /t REG_DWORD /d 60 /f
# reg add "HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" /v TdrDdiDelay /t REG_DWORD /d 60 /f

# 3. Launch llama-server for LLM inference
.\llama-server.exe -m "C:\path\to\model.gguf" --port 8080 --host 127.0.0.1 -ngl 99 -c 16384 --parallel 1 --threads 16

# 4. When done with LLM work, kill llama-server before image generation
Get-Process -Name "llama-server" -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3

# 5. Launch ComfyUI for image generation
cd C:\path\to\ComfyUI
python main.py --listen 127.0.0.1 --port 8188 --preview-method auto --disable-auto-launch --lowvram --disable-dynamic-vram --disable-async-offload --fp16-vae

See SETUP.ps1 in this repository for a configurable version of this script.

Powered by Future AI — AI tools built for creators.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
SETUP.ps1		SETUP.ps1

Folders and files

Latest commit

History

Repository files navigation

AMD ROCm on Windows — Complete Setup Guide for Unsupported GPUs (RX 6700 XT / gfx1031)

Overview

Table of Contents

1. Hardware & Software Versions

Hardware Used

Confirmed Working Software Versions

Models Tested

2. The GPU Compatibility Trick — CRITICAL

3. LLM Inference — llama-server (lemonade-sdk)

Why lemonade-sdk llama-server?

Exact Launch Command

Critical Flags Explained

What NOT to Do with llama-server

API Usage (OpenAI-Compatible)

4. Why NOT to Use Ollama on Unsupported AMD GPUs

The Core Problem

What Was Tried and Failed

Recommendation

VRAM Eviction via Ollama (if you do use it)

5. PyTorch + ROCm Setup

Installation

Verify GPU Detection

Additional Recommended Environment Variable

6. ComfyUI — SDXL Image Generation on RX 6700 XT

Launch Command

Critical Flags

Critical Patch: VAEDecodeTiled (MANDATORY)

What NOT to Do in ComfyUI

Performance

7. VRAM Management — Running LLM and Image Generation

The Problem

Solution: Kill llama-server Before Image Generation

Workflow Pattern

8. TDR (GPU Driver Timeout) Fix

Fix

9. Complete Environment Variables Reference

10. Driver Notes

11. Common Errors and Fixes

Support Us

Quick Reference: Full Startup Sequence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages