# ChatQnA vLLM Deployment and Performance Evaluation Tutorial

## Table of Contents
1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [System Architecture](#system-architecture)
4. [Deployment Guide](#deployment-guide)
5. [Performance Evaluation](#performance-evaluation)
6. [Monitoring and Troubleshooting](#monitoring-and-troubleshooting)
7. [Advanced Configuration](#advanced-configuration)
8. [Troubleshooting](#troubleshooting)

---

## Overview

ChatQnA is a Retrieval-Augmented Generation (RAG) system that combines document retrieval with LLM inference. This tutorial provides a comprehensive guide for deploying ChatQnA using vLLM on AMD GPUs with ROCm support, and performing pipeline performance evaluation.

### Key Features
- **vLLM Integration**: LLM serving with optimized inference on AMD Instinct GPUs
- **AMD GPU Support**: ROCm-based GPU acceleration
- **Vector Search**: Redis-based document retrieval
- **RAG Pipeline**: Complete question-answering system
- **Performance Monitoring**: Built-in metrics and evaluation tools

## Prerequisites

- **AMD Developer Cloud**: 1xMI300X GPU / 192 GB VRAM / 20 vCPU / 240 GB RAM Droplet
- **Hugging Face Token**: For model access

In [None]:
# Open Platform for Enterprise AI (OPEA)
!git clone https://github.com/opea-project/GenAIExamples.git

In [None]:
# One click deployment scripts for the use case
!git clone https://github.com/Yu-amd/LaunchPad.git

The LaunchPad project uses the same hierarchy as OPEA project. You need to copy the scripts and yaml files from the directory:  to the corresponding directory in OPEA folder:

In [None]:
# Copy necessary scripts and configuration files to the OPEA directory
# Replace /path/to/OPEA with your actual OPEA path
!cp LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/*.sh /path/to/OPEA/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/
!cp LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/*.yaml /path/to/OPEA/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/

### Step 2: Environment Setup

In [None]:
# Navigate to the OPEA deployment directory
%cd /path/to/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm

In [None]:
# Set up your Hugging Face token and environment
# Edit the following line in set_env_vllm.sh with your Hugging Face Token
!echo 'export CHATQNA_HUGGINGFACEHUB_API_TOKEN="your hugging face token"' >> set_env_vllm.sh

# Source the vLLM environment configuration
!source set_env_vllm.sh

### Step 3: Deploy Services

#### Option A: Using the Unified Script (Recommended)

In [None]:
# Setup vLLM environment
!./run_chatqna.sh setup-vllm

In [None]:
# Start vLLM services
!./run_chatqna.sh start-vllm

In [None]:
# Check service status
!./run_chatqna.sh status

In [None]:
# Check chatqna-vllm-service status
!docker logs -f chatqna-vllm-service

#### Option B: Manual Deployment

In [None]:
# Source environment variables
!source set_env_vllm.sh

# Start all services
!docker compose -f compose_vllm.yaml up -d

# Check service status
!docker ps

### Step 4: Verify Deployment

In [None]:
# Check running containers
!docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

In [None]:
# Test backend API
!curl -X POST http://localhost:8890/v1/chatqna \
  -H "Content-Type: application/json" \
  -d '{"messages": "Hello, how are you?"}'

In [None]:
# Get your public IP
!hostname -I | awk '{print }'

You can access the web interface at: 

### Step 5: Upload Documents

In [None]:
# Create a text file
!echo "Your document content here" > document.txt

# Upload the file
!curl -X POST http://localhost:18104/v1/dataprep/ingest \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document.txt"

In [None]:
# Verify the upload worked
# Check if the document was indexed
!curl -X POST http://localhost:18104/v1/dataprep/get \
  -H "Content-Type: application/json" \
  -d '{"index_name": "rag-redis"}'

In [None]:
# For multiple documents
# Create multiple files
!echo "Document 1 content" > doc1.txt
!echo "Document 2 content" > doc2.txt

# Upload multiple files
!curl -X POST http://localhost:18104/v1/dataprep/ingest \
  -H "Content-Type: multipart/form-data" \
  -F "files=@doc1.txt" \
  -F "files=@doc2.txt"

## Performance Evaluation

### Overview

Performance evaluation helps you understand:
- **Throughput**: Requests per second
- **Latency**: Response time
- **Accuracy**: Answer quality
- **Resource Usage**: CPU, GPU, memory utilization

### Step 1: Setup Evaluation Environment

In [None]:
# Pull from OPEA GitHub so GenAIExamples and GenAIEval are in the same directory
!git clone https://github.com/opea-project/GenAIEval.git

# Navigate to evaluation directory
%cd /path/to/GenAIEval

In [None]:
# Copy chatqna scripts from the LaunchPad directory
!cp /path/to/LaunchPad/GenAIEval/evals/benchmark/* /path/to/GenAIEval/evals/benchmark/

# Install dependency
!apt install python3.12-venv

# Create virtual environment
!python3 -m venv opea_eval_env
!source opea_eval_env/bin/activate

# Install evaluation dependencies
!pip install -r requirements.txt
!pip install -e .

### Step 2: Run Basic Evaluation

In [None]:
# Navigate back to GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/
%cd /path/to/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/

# Run vLLM evaluation
!./run_chatqna.sh vllm-eval

### Step 3: Performance Metrics

#### Throughput Testing

In [None]:
# Install dependency
!apt install apache2-utils

# Create a complex test file
!echo '{"messages": "Can you provide a detailed explanation of how neural networks work, including the concepts of forward propagation, backpropagation, and gradient descent? Also explain how these concepts relate to deep learning and why they are important for modern AI systems."}' > test_data.json

# Test concurrent requests
!ab -n 100 -c 10 -p test_data.json -T application/json \
  http://localhost:8890/v1/chatqna

#### Latency Testing

In [None]:
# Create curl-format.txt with the following content:
curl_format_content = """
     time_namelookup:  %{time_namelookup}\n
        time_connect:  %{time_connect}\n
     time_appconnect:  %{time_appconnect}\n
    time_pretransfer:  %{time_pretransfer}\n
       time_redirect:  %{time_redirect}\n
  time_starttransfer:  %{time_starttransfer}\n
                     ----------\n
          time_total:  %{time_total}\n
          http_code:  %{http_code}\n
       size_download:  %{size_download}\n
      speed_download:  %{speed_download}\n
"""

with open('curl-format.txt', 'w') as f:
    f.write(curl_format_content)

# Measure response times
!curl -w "@curl-format.txt" -X POST http://localhost:8890/v1/chatqna \
  -H "Content-Type: application/json" \
  -d '{"messages": "What is machine learning?"}'

### Step 4: Evaluation Results

Evaluation results include:
- **Response Time**: Average, median, 95th percentile
- **Throughput**: Requests per second
- **Accuracy**: Answer quality metrics
- **Resource Usage**: CPU, GPU, memory consumption

## Monitoring and Troubleshooting

### Service Monitoring

#### Check Service Status

In [None]:
# Check all services
!./run_chatqna.sh status

In [None]:
# Check specific service logs
!docker compose -f compose_vllm.yaml logs -f chatqna-vllm-service

#### Monitor Performance

In [None]:
# Copy prometheus.yml from LaunchPad directory
!cp /path/to/LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/grafana/prometheus.yml /path/to/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/grafana/

# Start monitoring stack
!./run_chatqna.sh monitor-start

In [None]:
# Copy the grafana files from LaunchPad to GenAIExample directory
!cp -r /path/to/LaunchPad/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/grafana/* /path/to/GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm/grafana/

You can access Grafana dashboard at:  (admin/admin)

### Data Source Import

In [None]:
# Prometheus Data Source
!curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://prometheus:9090",
    "access": "proxy",
    "isDefault": true
  }' \
  http://localhost:3000/api/datasources

### Dashboard Imports

#### 1. Comprehensive Dashboard (TGI + vLLM) - Fixed
**Use this for remote nodes with both TGI and vLLM services**

In [None]:
!curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d @grafana/dashboards/chatqna_comprehensive_dashboard_vllm_fixed_import.json \
  http://localhost:3000/api/dashboards/import

#### 2. AI Model Performance Dashboard
**Use this for detailed model-specific monitoring and performance analysis**

In [None]:
!curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d @grafana/dashboards/chatqna_ai_model_dashboard_import.json \
  http://localhost:3000/api/dashboards/import

#### 3. TGI-Only Dashboard (Local Development)
**Use this for local development where vLLM is not available**

In [None]:
!curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YWRtaW46YWRtaW4=" \
  -d @grafana/dashboards/chatqna_tgi_only_dashboard_import.json \
  http://localhost:3000/api/dashboards/import

### Common Issues and Solutions

#### Issue 1: GPU Memory Errors
**Symptoms**:  or similar errors
**Solution**:


#### Issue 2: Service Startup Failures
**Symptoms**: Services fail to start or remain in "starting" state
**Solution**:


#### Issue 3: Redis Index Issues
**Symptoms**: Retrieval service fails to find documents
**Solution**:


#### Issue 4: Model Download Failures
**Symptoms**: Services fail to download models
**Solution**:


## Advanced Configuration

### Custom Model Configuration

Edit  to use different models:



## Troubleshooting

### Diagnostic Commands

In [None]:
# Check system resources
!./detect_issues.sh

# Test complete system
!./quick_test_chatqna.sh eval-only

# Check service health
!docker compose -f compose_vllm.yaml ps

### Log Analysis

In [None]:
# View all logs
!docker compose -f compose_vllm.yaml logs

In [None]:
# Follow specific service logs
!docker compose -f compose_vllm.yaml logs -f chatqna-vllm-service

In [None]:
# Check for errors
!docker compose -f compose_vllm.yaml logs | grep -i error

## Conclusion

This tutorial provides a comprehensive guide for deploying ChatQnA with vLLM on AMD GPUs and performing detailed performance evaluation. The system offers:

- **High Performance**: vLLM-optimized inference
- **Scalability**: Docker-based microservices architecture
- **Monitoring**: Built-in performance metrics
- **Flexibility**: Configurable models and parameters

For additional support or advanced configurations, refer to the project documentation or create issues in the repository.

### Next Steps

1. **Customize Models**: Experiment with different LLM and embedding models
2. **Scale Deployment**: Add multiple GPU nodes for higher throughput
3. **Optimize Performance**: Fine-tune vLLM parameters for your specific use case
4. **Monitor Production**: Set up comprehensive monitoring for production deployments

### Useful Commands Reference



---

*Note**: This tutorial assumes you have the necessary permissions and that all required software is installed. For production deployments, consider additional security measures and monitoring solutions.*