# AgentQnA vLLM Deployment & Evaluation Tutorial

## Table of Contents
1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [System Architecture](#system-architecture)
4. [Deployment Guide](#deployment-guide)
5. [Performance Evaluation](#performance-evaluation)
6. [Monitoring](#monitoring)
7. [Troubleshooting](#troubleshooting)
---

## Overview <a id="overview"></a>

AgentQnA is a Retrieval-Augmented Generation (RAG) system built from three cooperative LangChain agents and a high-performance LLM served by **vLLM** on AMD GPUs (ROCm). This notebook walks you through single-node deployment and basic performance testing.

## Prerequisites <a id="prerequisites"></a>

• AMD MI300X (or other ROCm GPU) • Docker & Docker Compose
• Python 3.12           • Hugging Face access token

# Deploy AgentQnA on AMD GPU (ROCm)

This document outlines the single-node deployment process for an AgentQnA application utilizing the GenAIComps micro-services on an AMD GPU (ROCm) server. The steps include pulling Docker images, container deployment via Docker Compose, and service execution using the AgentQnA micro-services.

## Background & High-Level Architecture

AgentQnA is a multi-tool, retrieval-augmented reasoning agent derived from the **OPEA** project. It couples several specialised sub-agents (RAG, SQL and ReAct) with a language-model back-end (**vLLM** or **TGI**) to answer heterogeneous user questions with minimal hallucination.

The end-to-end data flow is shown below.

```text
┌──────────────┐      ┌────────────────────┐     ┌────────────────────────┐
│  User Query  │ ───► │  Supervisor Agent  │ ───► │  Tool Invocation / RAG │
└──────────────┘      └─────────┬──────────┘     └────────────┬───────────┘
                                │                             │
                        ┌───────▼─────────┐     ┌─────────────▼─────────┐
                        │  Worker RAG     │     │   Worker SQL Agent    │
                        │  (LangChain)    │     │  (Table QA)           │
                        └───────┬─────────┘     └─────────────┬─────────┘
                                │                             │
                         ┌──────▼─────┐               ┌───────▼─────┐
                         │  Retriever │               │  Postgres   │
                         │ (Redis-IVF)│               │  or SQLite  │
                         └──────┬─────┘               └─────────────┘
                                │
                      ┌─────────▼─────────┐
                      │    vLLM / TGI     │  (LLM inference on ROCm GPUs)
                      └────────────────────┘
```

**Key points:**

* **vLLM (ROCm 6.4.1)** provides fast GPT-style decoding with multi-GPU tensor-parallelism.
* **Redis Vector** stores embeddings generated at ingestion time and serves similarity search during retrieval.
* **Prometheus + Grafana** (optional) expose GPU utilisation, token throughput and cache hit-rates.
* All services are orchestrated via **Docker Compose** and can be launched with one-liner helper scripts.

The remainder of this tutorial focuses on deploying the vLLM-backed stack, validating the agent endpoints, and benchmarking performance.


## Deployment Guide <a id="deployment-guide"></a>

In [None]:
# 1. Clone repositories
!git clone https://github.com/opea-project/GenAIExamples.git
!git clone https://github.com/Yu-amd/LaunchPad.git  

In [None]:
# 2. Copy helper scripts from LaunchPad (identical tree layout)
%cd LaunchPad/GenAIExamples/AgentQnA/docker_compose/amd/gpu/rocm
!cp *.sh  /path/to/GenAIExamples/AgentQnA/docker_compose/amd/gpu/rocm/
!cp *.yaml /path/to/GenAIExamples/AgentQnA/docker_compose/amd/gpu/rocm/

In [None]:
# 3. Configure environment
%cd /path/to/GenAIExamples/AgentQnA/docker_compose/amd/gpu/rocm
export AGENTQNA_HUGGINGFACEHUB_API_TOKEN="YOUR_HUGGING_FACE_TOKEN"
source set_env_vllm.sh

### Option A – One-click script (recommended)

In [None]:
!./run_agentqna.sh setup-vllm
!./run_agentqna.sh start-vllm
!./run_agentqna.sh status

### Option B – Manual Compose launch

In [None]:
source set_env_vllm.sh
docker compose -f compose_vllm.yaml up -d
docker compose ps

### Verify deployment

In [None]:
# List containers
!docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

In [None]:
# Quick inference test against vLLM service
!curl -X POST http://localhost:18009/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"${AGENTQNA_LLM_MODEL_ID}","messages":[{"role":"user","content":"Hello!"}]}'

## Performance Evaluation <a id="performance-evaluation"></a>

In [None]:
!./quick_eval_setup.sh   # installs GenAIEval + deps
!./performance_evaluation.sh

## Monitoring <a id="monitoring"></a>

In [None]:
!./start_monitoring.sh   # Prometheus + Grafana stack
echo "Grafana ⇒ http://<host_ip>:3000  (admin / admin)"

## Troubleshooting <a id="troubleshooting"></a>

• **Service not starting** – `docker compose logs -f`
• **GPU OOM**        – lower `--max-model-len` in [compose_vllm.yaml](cci:7://file:///root/ethanliu/LaunchPad/GenAIExamples/AgentQnA/docker_compose/amd/gpu/rocm/compose_vllm.yaml:0:0-0:0)
• **Slow responses**   – monitor GPU util in `rocm-smi` and adjust batch size
• **Restart stack**   – `./run_agentqna.sh restart-vllm`

---
© 2025 Advanced Micro Devices, Inc.