# ChatQnA vLLM Deployment and Performance Evaluation Tutorial

## Table of Contents
1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [System Architecture](#system-architecture)
4. [Deployment Guide](#deployment-guide)
5. [Performance Evaluation](#performance-evaluation)
6. [Monitoring and Troubleshooting](#monitoring-and-troubleshooting)
7. [Advanced Configuration](#advanced-configuration)
8. [Troubleshooting](#troubleshooting)

---

## Overview

ChatQnA is a Retrieval-Augmented Generation (RAG) system that combines document retrieval with LLM inference. This tutorial provides a comprehensive guide for deploying ChatQnA using vLLM on AMD GPUs with ROCm support, and performing pipeline performance evaluation.

### Key Features
- **vLLM Integration**: LLM serving with optimized inference on AMD Instinct GPUs
- **AMD GPU Support**: ROCm-based GPU acceleration
- **Vector Search**: Redis-based document retrieval
- **RAG Pipeline**: Complete question-answering system
- **Performance Monitoring**: Built-in metrics and evaluation tools

## Prerequisites

- **AMD Developer Cloud**: 1xMI300X GPU / 192 GB VRAM / 20 vCPU / 240 GB RAM Droplet
- **Hugging Face Token**: For model access

## System Architecture

### Service Components

The following is the complete system architecture diagram.

**Architecture Overview:**
```
┌───────────────────────────────────────────────────────────────────────────────────┐
│                               EXTERNAL ACCESS                                     │
│                                                                                   │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────────┐   │
│   │   Web Browser   │    │   API Clients   │    │      Monitoring Tools       │   │
│   │                 │    │                 │    │    (Grafana, Prometheus)    │   │
│   └─────────────────┘    └─────────────────┘    └─────────────────────────────┘   │
│           │                       │                           │                   │
│           │                       │                           │                   │
│           ▼                       ▼                           ▼                   │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────────┐   │
│   │   Nginx Proxy   │    │   Backend API   │    │        Redis Insight        │   │
│   │   (Port 8081)   │    │   (Port 8890)   │    │         (Port 8002)         │   │
│   └─────────────────┘    └─────────────────┘    └─────────────────────────────┘   │
│           │                       │                           │                   │
│           │                       │                           │                   │
│           ▼                       ▼                           ▼                   │
│   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────────┐   │
│   │   Frontend UI   │    │     Backend     │    │   Redis Vector Database     │   │
│   │   (Port 5174)   │    │     Server      │    │         (Port 6380)         │   │
│   │   (React App)   │    │    (FastAPI)    │    │      (Vector Storage)       │   │
│   └─────────────────┘    └─────────────────┘    └─────────────────────────────┘   │
│                                   │                           │                   │
│                                   │                           │                   │
│                                   ▼                           ▼                   │
│  ┌─────────────────────────────────────────────────────────────────────────────┐  │
│  │                             RAG PIPELINE                                    │  │
│  │                                                                             │  │
│  │  ┌───────────────────┐ ┌─────────────────────┐ ┌─────────────────────────┐  │  │
│  │  │ Retriever Service │ │TEI Embedding Service│ │  TEI Reranking Service  │  │  │
│  │  │                   │ │                     │ │                         │  │  │
│  │  │   (Port 7001)     │ │    (Port 18091)     │ │      (Port 18809)       │  │  │
│  │  │                   │ │                     │ │                         │  │  │
│  │  │ • Vector Search   │ │ • Text Embedding    │ │ • Document Reranking    │  │  │
│  │  │ • Similarity      │ │ • BGE Model         │ │ • Relevance Scoring     │  │  │
│  │  │   Matching        │ │ • CPU Inference     │ │ • CPU Inference         │  │  │
│  │  └───────────────────┘ └─────────────────────┘ └─────────────────────────┘  │  │
│  │            │                      │                         │               │  │
│  │            │                      │                         │               │  │
│  │            ▼                      ▼                         ▼               │  │
│  │  ┌───────────────────────────────────────────────────────────────────────┐  │  │
│  │  │                           vLLM Service                                │  │  │
│  │  │                           (Port 18009)                                │  │  │
│  │  │                                                                       │  │  │
│  │  │                  • High-Performance LLM Inference                     │  │  │
│  │  │                  • AMD GPU Acceleration (ROCm)                        │  │  │
│  │  │                  • Qwen2.5-7B-Instruct Model                          │  │  │
│  │  │                  • Optimized for Throughput & Latency                 │  │  │
│  │  │                  • Tensor Parallel Support                            │  │  │
│  │  └───────────────────────────────────────────────────────────────────────┘  │  │
│  └─────────────────────────────────────────────────────────────────────────────┘  │
│                                      │                                            │
│                                      │                                            │
│                                      ▼                                            │
│  ┌─────────────────────────────────────────────────────────────────────────────┐  │
│  │                            DATA PIPELINE                                    │  │
│  │                                                                             │  │
│  │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────┐  │  │
│  │  │   Dataprep      │    │   Model Cache   │    │   Document Storage      │  │  │
│  │  │   Service       │    │   (./data)      │    │   (Redis Vector DB)     │  │  │
│  │  │   (Port 18104)  │    │                 │    │                         │  │  │
│  │  │                 │    │ • Downloaded    │    │ • Vector Embeddings     │  │  │
│  │  │ • Document      │    │   Models        │    │ • Metadata Index        │  │  │
│  │  │   Processing    │    │ • Model Weights │    │ • Full-Text Search      │  │  │
│  │  │ • Text          │    │ • Cache Storage │    │ • Similarity Search     │  │  │
│  │  │   Extraction    │    │ • Shared Volume │    │ • Redis Stack           │  │  │
│  │  └─────────────────┘    └─────────────────┘    └─────────────────────────┘  │  │
│  └─────────────────────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────────────────────┘
```
**Additional Services:**
- **Dataprep Service** (Port 18104): Document processing and ingestion
- **Redis Insight** (Port 8002): Database monitoring interface
- **Model Cache** (./data): Shared volume for model storage

### Data Flow
1. **User Input**: Question submitted via frontend
2. **Embedding**: Question converted to vector using TEI service
3. **Retrieval**: Similar documents retrieved from Redis vector database
4. **Reranking**: Retrieved documents reranked for relevance
5. **LLM Inference**: vLLM generates answer using retrieved context
6. **Response**: Answer returned to user via frontend

## Deployment Guide

### Step 1: Pull source code from GitHub
First, we'll clone the Open Platform for Enterprise AI (OPEA) GenAIExamples repository, which contains the ChatQnA implementation and other AI examples needed for our deployment. Next, we'll clone the LaunchPad repository that provides one-click deployment scripts and configuration files specifically designed for the ChatQnA use case on AMD GPU environments.


In [None]:
# Open Platform for Enterprise AI (OPEA)
!git clone https://github.com/opea-project/GenAIExamples.git

In [None]:
# One click deployment scripts for the use case
!git clone https://github.com/Yu-amd/LaunchPad.git

The LaunchPad project uses the same hierarchy as OPEA project. You need to copy the scripts and yaml files from the directory:  to the corresponding directory in OPEA folder:

In [None]:
# Copy necessary scripts and configuration files to the OPEA directory
!cp /root/LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/*.sh /root/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/
!cp /root/LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/*.yaml /root/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/
!cp /root/LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/.env /root/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/


### Step 2: Environment Setup
Now we need to navigate to the OPEA deployment directory where all the configuration files and scripts are located.


In [None]:
# Navigate to the OPEA deployment directory
%cd /root/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm

We'll need to set up environment variable management. First, we install the python-dotenv package which allows us to load environment variables from a .env file. Then we import the necessary modules and load the environment variables from the .env file, which contains our configuration settings.


In [None]:
# Install and load .env
!pip install python-dotenv

In [None]:
# Load environment variables
from dotenv import load_dotenv
import os
load_dotenv()  # Loads variables from .env file

Now we need to configure your Hugging Face API token, which is required to download the AI models used by the ChatQnA system. Make sure to replace 'your_token_here' with your actual Hugging Face token.


In [None]:
# Set up your Hugging Face token and environment
import os
os.environ['CHATQNA_HUGGINGFACEHUB_API_TOKEN'] = 'your_token_here'

Now we'll set up the vLLM environment using the provided script. This will configure all the necessary components for high-performance LLM inference on AMD GPUs.


In [None]:
# Setup vLLM environment
!./run_chatqna.sh setup-vllm

With the environment configured, we can now start the vLLM services. This will launch all the necessary containers and services for the ChatQnA system.
### Step 3: Deploy the workload


In [None]:
# Start vLLM services
!./run_chatqna.sh start-vllm

Let's check the status of all running services to ensure everything started correctly and is functioning properly.


In [None]:
# Check service status
!./run_chatqna.sh status

We'll monitor the vLLM service logs for 60 seconds to verify that the service is starting up correctly and to check for any potential issues during initialization.


In [None]:
# Check chatqna-vllm-service status
!timeout 60 docker logs -f chatqna-vllm-service

Confirm chatqna-vllm-service is ready, go straight to Step 4.


### GPU Memory Management

Before proceeding to verify deployment, it's important to ensure your GPU memory is properly managed.

#### Check GPU Memory Status


In [None]:
# Check current GPU memory usage
# Expected output shows VRAM% and GPU% usage
# If VRAM% is high (>80%) but GPU% is low, memory may be fragmented
!rocm-smi

#### Clear GPU Memory (If Needed)

If you encounter GPU memory issues or high VRAM usage with low GPU utilization:

**Option 1: Kill GPU Processes**


In [None]:
# Find processes using GPU
!sudo fuser -v /dev/kfd

In [None]:
# Kill GPU-related processes
!sudo pkill -f "python|vllm|docker"

**Option 2: Restart GPU Services**


In [None]:
# Restart AMD GPU services
!sudo systemctl restart amdgpu
!sudo systemctl restart kfd

**Option 3: System Reboot (Most Reliable)**

In [None]:
# If other methods don't work, reboot the system
# Note: If you're on a remote server, wait approximately 30 seconds to 1 minute
# before attempting to SSH back into the server
!sudo reboot

In [None]:
After clearing GPU memory, verify it's free:
# Check GPU memory is now available
# Expected: VRAM% should be low (<20%) and GPU% should be 0%
!rocm-smi

### Step 4: Verify Deployment
Let's verify that all Docker containers are running properly by checking their current status and port mappings.


In [None]:
# Check running containers
!docker ps

Now let's test the backend API to ensure it's responding correctly. This will send a simple test message to verify that the ChatQnA service is working properly.


In [None]:
# Test backend API
!curl -X POST http://localhost:8890/v1/chatqna \
  -H "Content-Type: application/json" \
  -d '{"messages": "Hello, how are you?"}'

### Step 5: Upload Documents
Let's create a sample document and upload it to the system. This will demonstrate how to ingest documents into the ChatQnA system for retrieval and question answering.


In [None]:
# Create a text file
!echo "Your document content here" > document.txt

# Upload the file
!curl -X POST http://localhost:18104/v1/dataprep/ingest \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document.txt"

Let's verify that the document was successfully uploaded and indexed by checking the contents of the Redis vector database.


In [None]:
# Verify the upload worked
# Check if the document was indexed
!curl -X POST http://localhost:18104/v1/dataprep/get \
  -H "Content-Type: application/json" \
  -d '{"index_name": "rag-redis"}'

You can also upload multiple documents at once. Here's how to create and upload several documents simultaneously to build up your knowledge base.


In [None]:
# For multiple documents
# Create multiple files
!echo "Document 1 content" > doc1.txt
!echo "Document 2 content" > doc2.txt

# Upload multiple files
!curl -X POST http://localhost:18104/v1/dataprep/ingest \
  -H "Content-Type: multipart/form-data" \
  -F "files=@doc1.txt" \
  -F "files=@doc2.txt"

## Performance Evaluation

### Overview

Performance evaluation helps you understand:
- **Throughput**: Requests per second
- **Latency**: Response time
- **Accuracy**: Answer quality
- **Resource Usage**: CPU, GPU, memory utilization

### Step 1: Setup Evaluation Environment
Now let's set up the evaluation environment. We'll navigate to the root directory and clone the GenAIEval repository, which contains the benchmarking tools we'll use to evaluate the ChatQnA system's performance.


In [None]:
# change directory to top of workspace
%cd /root# Pull from OPEA GitHub so GenAIExamples and GenAIEval are in the same directory
!git clone https://github.com/opea-project/GenAIEval.git

We'll copy the ChatQnA-specific evaluation scripts from the LaunchPad directory to the GenAIEval benchmark folder so we can run performance tests on our deployed system.


In [None]:
# Copy chatqna scripts from the LaunchPad directory
!cp /root/LaunchPad/GenAIEval/evals/benchmarks/* /root/GenAIEval/evals/benchmark/

Let's navigate to the GenAIEval directory where we'll set up and run our performance evaluation tests.


In [None]:
# Navigate to evaluation directory
%cd /root/GenAIEval

We need to install the Python virtual environment package to create an isolated environment for our evaluation tools.


In [None]:
# Install dependency
!apt install python3.12-venv

Now we'll create a virtual environment for our evaluation tools and activate it to ensure we have a clean, isolated Python environment.


In [None]:
# Create virtual environment
!python3 -m venv opea_eval_env
!source opea_eval_env/bin/activate

Let's install all the required dependencies for the evaluation tools and set up the GenAIEval package in development mode.


In [None]:
# Install evaluation dependencies
!pip install -r requirements.txt
!pip install -e .

Now let's navigate back to the ChatQnA deployment directory where we'll run the performance evaluation tests on our deployed system.
### Step 2: Run Basic Evaluation


In [None]:
# Navigate back to GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/
%cd /root/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/

Now we'll run the vLLM evaluation script which will test the performance of our ChatQnA system, measuring metrics like throughput, latency, and response quality.


In [None]:
# Run vLLM evaluation
!./run_chatqna.sh vllm-eval

### Step 3: Performance Metrics

#### Throughput Testing
We'll install Apache Bench (ab), a tool that will help us perform load testing and measure the throughput of our ChatQnA API under various conditions.


In [None]:
# Install dependency
!apt install apache2-utils

Let's create a test file with a complex question that will help us evaluate how well the system handles detailed, multi-part queries and generates comprehensive responses.


In [None]:
# Create a complex test file
!echo '{"messages": "Can you provide a detailed explanation of how neural networks work, including the concepts of forward propagation, backpropagation, and gradient descent? Also explain how these concepts relate to deep learning and why they are important for modern AI systems."}' > test_data.json

Now we'll run a load test using Apache Bench to simulate 100 concurrent requests with 10 simultaneous connections, which will help us measure the system's throughput and performance under stress.


In [None]:
# Test concurrent requests
!ab -n 100 -c 10 -p test_data.json -T application/json \
  http://localhost:8890/v1/chatqna

#### Latency Testing
Let's create a detailed timing format file for curl that will help us measure various latency metrics including DNS lookup, connection time, and total response time for precise performance analysis.


In [None]:
# Create curl-format.txt with the following content:
     time_namelookup:  %{time_namelookup}\n
        time_connect:  %{time_connect}\n
     time_appconnect:  %{time_appconnect}\n
    time_pretransfer:  %{time_pretransfer}\n
       time_redirect:  %{time_redirect}\n
  time_starttransfer:  %{time_starttransfer}\n
                     ----------\n
          time_total:  %{time_total}\n
          http_code:  %{http_code}\n
       size_download:  %{size_download}\n
      speed_download:  %{speed_download}\n

Now we'll use curl with our detailed timing format to measure the precise response times for a single request, which will give us granular insights into each step of the request processing pipeline.


In [None]:
# Measure response times
!curl -w "@curl-format.txt" -X POST http://localhost:8890/v1/chatqna \
  -H "Content-Type: application/json" \
  -d '{"messages": "What is machine learning?"}'

### Step 4: Evaluation Results

Evaluation results include:
- **Response Time**: Average, median, 95th percentile
- **Throughput**: Requests per second
- **Accuracy**: Answer quality metrics
- **Resource Usage**: CPU, GPU, memory consumption

### Common Issues and Solutions

#### Issue 1: GPU Memory Errors
**Symptoms**: out-of-memory or similar errors

**Solution**:


In [None]:
# Reduce batch size in vLLM configuratioin
# Edit compose_vllm.yaml, modify vLLM service command:
--max-model-len 2048 --tensor-parallel-size 1

#### Issue 2: Service Startup Failures
**Symptoms**: Services fail to start or remain in "starting" state

**Solution**:


In [None]:
# Check logs for specific errors
!docker compose -f compose_vllm.yaml logs

In [None]:
# Restart services
!./run_chatqna.sh restart-vllm

#### Issue 3: Redis Index Issues
**Symptoms**: Retrieval service fails to find documents

**Solution**:


In [None]:
# Fix Redis index
./fix_redis_index.sh

In [None]:
# Recreate index manually
!docker exec chatqna-redis-vector-db redis-cli FT.CREATE rag-redis ON HASH PREFIX 1 doc: SCHEMA content TEXT WEIGHT 1.0 distance NUMERIC

#### Issue 4: Model Download Failures
**Symptoms**: Services fail to download models

**Solution**:


In [None]:
# Check HF token
!echo $CHATQNA_HUGGINGFACEHUB_API_TOKEN

In [None]:
# Set token manually
!export CHATQNA_HUGGINGFACEHUB_API_TOKEN="your_token_here"

## Advanced Configuration

### Custom Model Configuration

Edit set_env_vllm.sh to use different models:


In [None]:
# Change LLM model
!export CHATQNA_LLM_MODEL_ID="Qwen/Qwen2.5-14B-Instruct"

In [None]:
# Change embedding model
!export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-large-en-v1.5"

In [None]:
# Change reranking model
!export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-large"

## Troubleshooting

### Diagnostic Commands

In [None]:
# Check system resources
!./detect_issues.sh

In [None]:
# Test complete system
!./quick_test_chatqna.sh eval-only

In [None]:
# Check service health
!docker compose -f compose_vllm.yaml ps

### Log Analysis

In [None]:
# View all logs
!docker compose -f compose_vllm.yaml logs

In [None]:
# Follow specific service logs
!docker compose -f compose_vllm.yaml logs -f chatqna-vllm-service

In [None]:
# Check for errors
!docker compose -f compose_vllm.yaml logs | grep -i error

## Conclusion

This tutorial provides a comprehensive guide for deploying ChatQnA with vLLM on AMD GPUs and performing detailed performance evaluation. The system offers:

- **High Performance**: vLLM-optimized inference
- **Scalability**: Docker-based microservices architecture
- **Monitoring**: Built-in performance metrics
- **Flexibility**: Configurable models and parameters

For additional support or advanced configurations, refer to the project documentation or create issues in the repository.

### Next Steps

1. **Customize Models**: Experiment with different LLM and embedding models
2. **Scale Deployment**: Add multiple GPU nodes for higher throughput
3. **Optimize Performance**: Fine-tune vLLM parameters for your specific use case
4. **Monitor Production**: Set up comprehensive monitoring for production deployments

### Useful Commands Reference



---

*Note*: This tutorial assumes you have the necessary permissions and that all required software is installed. For production deployments, consider additional security measures and monitoring solutions.