# Running RAG on AMD Radeon GPU

`Date`: 2025/1/15 | `Author`: Alex He

`Tag`: RAG, ROCm, LlamaIndex, Ollama

`Category`: AI, Inference

AMD Radeon GPUs are officially supported by [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/), ensuring compatibility with industry-standard software frameworks. This Jupyter notebook leverages [Ollama](https://ollama.com/) and [Llamaindex]((https://docs.llamaindex.ai/en/stable/)), powered by ROCm, to build a Retrieval-Augmented Generation (RAG) application. Llamaindex facilitates the creation of a pipeline from reading PDFs to indexing datasets and building a query engine, while Ollama provides the backend service for Large Language Model (LLM) inference.

## Install ROCm

Please install ROCm for Radeon GPU by refering https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-radeon.html at first.

## Install Ollama

Ollama offers seamless support for AMD ROCm GPUs "out of the box." We use it as the LLM inference provider in this jupyter notebook. Just one command to install it onto linux with the latest version. 

The ollama service will auto start after the installation and could be launched by hand or checked as bellow.

In [1]:
!ollama serve

Error: listen tcp 127.0.0.1:11434: bind: address already in use


Then pull llama3.1(8b) as the RAG LLM and nomic-embed-text as the embedding model. Or seraching another one the LLM from https://ollama.com/search as you need.

In [2]:
!ollama pull nomic-embed-text

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████

In [3]:
!ollama list llama3.1

NAME           ID              SIZE      MODIFIED     
llama3.1:8b    42182419e950    4.7 GB    2 months ago    


Please refer to https://github.com/ollama/ollama to get more info about the usage.

## Install Torch [optional]

For this example, PyTorch is optional and we just use it to query GPU for checking if it is work.

In [4]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

Looking in indexes: https://download.pytorch.org/whl/rocm6.2


Check the packages of torch/rocm packages

In [5]:
!pip list | grep torch

pytorch-triton-rocm                      3.1.0
torch                                    2.5.1+rocm6.2
torchaudio                               2.5.1+rocm6.2
torchvision                              0.20.1+rocm6.2


In [6]:
import os
import torch

In [7]:
# Query GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    print('Using GPU:', torch.cuda.get_device_name(0))
    print('GPU properties:', torch.cuda.get_device_properties(0))
else:
    device = torch.device("cpu")
    print('Using CPU')

Using GPU: AMD Radeon PRO W7900
GPU properties: _CudaDeviceProperties(name='AMD Radeon PRO W7900', major=11, minor=0, gcnArchName='gfx1100', total_memory=46064MB, multi_processor_count=48, uuid=36373564-3232-3936-3661-373939393363, L2_cache_size=6MB)


## Build RAG with LLamaIndex & Ollama

Conver the the cell below to `Code` to install the llama-index packages at the first time. And then change it to `Raw` or `Markdown` to saving time.

In [8]:
%pip install llama-index
%pip install llama-index-llms-ollama
%pip install llama-index-embeddings-ollama
%pip install llama-index-vector-stores-chroma
%pip install chromadb

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [9]:
!pip list | grep llama-index

llama-index                              0.12.10
llama-index-agent-openai                 0.4.1
llama-index-cli                          0.4.0
llama-index-core                         0.12.10.post1
llama-index-embeddings-ollama            0.5.0
llama-index-embeddings-openai            0.3.1
llama-index-indices-managed-llama-cloud  0.6.3
llama-index-llms-ollama                  0.5.0
llama-index-llms-openai                  0.3.13
llama-index-multi-modal-llms-openai      0.4.2
llama-index-program-openai               0.3.1
llama-index-question-gen-openai          0.3.0
llama-index-readers-file                 0.4.3
llama-index-readers-llama-parse          0.4.0
llama-index-readers-web                  0.3.3
llama-index-vector-stores-chroma         0.4.1


In [10]:
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

LlamaIndex implement the Ollama client interface to interact with Ollama service. Here we request the embedding and LLM service both from Ollama.

In [11]:
# Set embedding model
emb_fn="nomic-embed-text"
Settings.embed_model = OllamaEmbedding(model_name=emb_fn)

# Set ollama model
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120.0)

Here we use the AMD Radeon ROCm usage documentaion pdf file as the RAG index source. Download this pdf file from "https://rocm.docs.amd.com/_/downloads/radeon/en/latest/pdf/" and save it in the ./data direcotry.

The SimpleDirectoryReader is the most commonly used data connector that just works.
Simply pass in a input directory or a list of files.
It will select the best file reader based on the file extensions.

In [12]:
documents = SimpleDirectoryReader(input_dir="./data/").load_data()

In [13]:
# Check the content
print(documents[10])

Doc ID: 7277da52-3909-4eec-9479-8ce40262f898
Text: CHAPTER TWO HOW TO GUIDES These guides walk you through the
various installation processes required to pair ROCm™ with the latest
high-end AMD Radeon™ 7000 series desktop GPUs. Linux WSL Linux How to
guide WSL How to guide 2.1 Linux How to guide - Use ROCm on Radeon
GPUs This guide walks you through the various installation processes
required to...


Chroma is a AI-native open-source vector database focused on developer productivity and happiness which also be well integrated in LlamaIndex. We use it to create the vector dataset by sourcing the pdf file. 

In [14]:
# Initialize client and save data
db = chromadb.PersistentClient(path="./chroma_db/rocm_db")
# create collection
chroma_collection = db.get_or_create_collection("rocm_db")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [15]:
# Build vector index per-document
vector_index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=20)],
)

Next to create the querry engine with response mode. You can use the response mode according to your purpose. Here is the details https://docs.llamaindex.ai/en/v0.10.19/module_guides/deploying/query_engine/response_modes.html

In [16]:
# Query your data
query_engine = vector_index.as_query_engine(response_mode="refine", similarity_top_k=10)

Define the prompt according the task what you want from the RAG pipeline.

In [17]:
# Updating Prompt for Q&A
from llama_index.core import PromptTemplate

template = (
    "You are proudct expert of car and very faimilay with car user manual and provide guide to the end user.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the information from multiple sources and not prior knowledge\n"
    "answer the question according to the index dataset.\n"
    "if the question is not releate with ROCm and Radeon GPU, just say it is not releated with my knowledge base.\n"
    "if you don't know the answer, just say that I don't know.\n"
    "Answers need to be precise and concise.\n"
    "if the question is in chinese, please transclate chinese to english in advance"
    "Query: {query_str}\n"
    "Answer: "
)
qa_template = PromptTemplate(template)
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": qa_template}
)

template = (
    "The original query is as follows: {query_str}.\n"
    "We have provided an existing answer: {existing_answer}.\n"
    "We have the opportunity to refine the existing answer (only if needed) with some more context below.\n"
    "-------------\n"
    "{context_msg}\n"
    "-------------\n"
    "Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\n"
    "if the question is 'who are you' , just say I am expert of AMD ROCm.\n"
    "Answers need to be precise and concise.\n"
    "Refined Answer: "
)

qa_template = PromptTemplate(template)

query_engine.update_prompts(
    {"response_synthesizer:refine_template": qa_template}
)

In [18]:
response = query_engine.query("brief the steps to install the ROCm?")
print(response)

Based on the provided context, here is a refined version of the original answer:

To install ROCm, follow these brief steps:

1. Run `amdgpu-install` script.
2. Specify the use case (Graphics or Workstation).
3. Choose the combination of components (Pro stack, user selection).

These steps guide you through installing the ROCm Software Stack and other Radeon software for Linux components on your system.

For a complete installation experience, refer to Option A: Install from pre-built binaries in the provided documentation.
Additionally, note that for advanced customization use cases or specific component installations, consider using pip3 install commands (e.g., `pip3 install /onnxruntime/build/Linux/Release/dist/*.whl`) as needed.

Note: The refined answer includes a mention of additional steps and context from the provided manual to provide a more comprehensive experience.


In [19]:
response = query_engine.query("Which chapter is about installing PyTorch?")
print(response)

The chapter about installing PyTorch using Docker (including instructions for checking GPU availability and displaying component information) is under "How to guides" (Chapter 2) in the section "Option B: Docker installation".


In [20]:
response = query_engine.query("How to verify PyTorch Installation?")
print(response)

Here's a refined version of the existing answer with more context:

To verify PyTorch installation on an AMD Radeon GPU setup using ROCm, follow these steps:

1. Run `python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'` in the terminal.
	* This command checks if PyTorch is installed and can detect a GPU compute device.
	* If successful, it should print "Success". Otherwise, it will print "Failure".
2. To test if the GPU is available, run `nvidia-smi` or `rocm-smi` (for ROCm setup) in the terminal.
	* This command checks if the Radeon GPU is recognized by the system and is ready for use.

Additionally, to verify Torch-MIGraphX installation:

1. Run `python3 -c 'import torch_migraphx' 2> /dev/null && echo 'Success' || echo 'Failure'` in the terminal.
	* This command checks if Torch-MIGraphX can be imported as a Python module.
	* If successful, it should print "Success". Otherwise, it will print "Failure".
2. Run `pytest ./torch_migraphx/tests` to run unit tests.

N

In [21]:
response = query_engine.query("Could run ONNX on Radeon GPU?")
print(response)

Yes, with ROCm and ONNX Runtime installed on a Radeon GPU, you can run ONNX models. However, note that according to the ROCm documentation (Chapter 5, Section 5.1), there are known issues with running ONNX RT EP on some models (e.g., Llama2-7B) which may fall back to CPU execution.
