# RAG Demo: Red Hat OpenShift AI Documentation

## Scenario 1: Enhancing LLM Knowledge with Official Documentation

**Business Context**: A Cloud Architect needs specific technical details about deploying RAG workloads in Red Hat OpenShift AI.

**Objective**: Demonstrate how RAG transforms generic responses into accurate, source-cited answers.

**Architecture**: This demo follows Red Hat's official RAG pattern:
- **Baseline**: Direct vLLM API call (OpenAI-compatible)
- **RAG**: Llama Stack Agent with vector retrieval

---

In [None]:
# Install required packages
!pip install llama-stack-client openai

In [None]:
# Setup and imports
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
from openai import OpenAI
import uuid
import json

# Configuration
VLLM_URL = "https://mistral-24b-quantized-predictor-private-ai-demo.apps.cluster-qtvt5.qtvt5.sandbox2082.opentlc.com/v1"
LLAMASTACK_URL = "http://llamastack.private-ai-demo.svc.cluster.local:8000"
MODEL_ID = "mistral-24b-quantized"
VECTOR_DB_ID = "rag_documents"  # Shared vector database for all scenarios

# Initialize clients
vllm_client = OpenAI(base_url=VLLM_URL, api_key="dummy")
stack_client = LlamaStackClient(base_url=LLAMASTACK_URL)

print(f"✅ vLLM client configured: {VLLM_URL}")
print(f"✅ Llama Stack client configured: {LLAMASTACK_URL}")

## Test Question

We'll ask the same question twice:
1. **Without RAG**: Generic response based on model's training
2. **With RAG**: Specific response based on retrieved Red Hat documentation

In [None]:
# The test question
QUESTION = (
    "What are the exact hardware and software prerequisites for deploying "
    "a LlamaStack distribution in Red Hat OpenShift AI? Include GPU requirements, "
    "operator dependencies, and the exact configuration steps."
)

print("Test Question:")
print(QUESTION)

## 1. Baseline Query (Without RAG)

First, let's see what the model knows without access to the documentation.
We call vLLM directly using the OpenAI-compatible API.

In [None]:
# Query without RAG - Direct vLLM call
print("=" * 70)
print("BASELINE RESPONSE (No RAG - Direct vLLM):")
print("=" * 70)

response_baseline = vllm_client.chat.completions.create(
    model=MODEL_ID,
    messages=[{"role": "user", "content": QUESTION}],
    max_tokens=300
)

print(response_baseline.choices[0].message.content)
print()

## 2. RAG Query (With Vector Retrieval)

Now let's use Llama Stack's Agent API to retrieve relevant documentation and enhance the response.
This follows Red Hat's official RAG pattern.

In [None]:
# Query with RAG - Using Llama Stack Agent
print("=" * 70)
print("RAG RESPONSE (With Vector Retrieval - Llama Stack Agent):")
print("=" * 70)

# Create RAG agent following Red Hat's pattern
rag_agent = Agent(
    stack_client,
    model=MODEL_ID,
    instructions="You are a helpful assistant with access to Red Hat OpenShift AI documentation.",
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {"vector_db_ids": [VECTOR_DB_ID]},
        }
    ],
)

# Create session and query
session_id = rag_agent.create_session(session_name=f"rag-demo-{uuid.uuid4().hex[:8]}")

response_rag = rag_agent.create_turn(
    messages=[{"role": "user", "content": QUESTION}],
    session_id=session_id,
    stream=True,
)

# Log the response
for log in AgentEventLogger().log(response_rag):
    log.print()

print()

## 3. Comparison

**Baseline (No RAG)**: Generic answer based on the model's training data.

**RAG (With Retrieval)**: Specific, accurate answer citing Red Hat documentation.

---

## Red Hat AI Four Pillars Demonstrated

1. **Efficient Inferencing**: Quantized Mistral 24B on vLLM
2. **Simplified Data Connection**: Milvus vector store via Llama Stack
3. **Hybrid Cloud Flexibility**: OpenShift AI on any infrastructure
4. **Agentic AI Delivery**: Llama Stack Agent with RAG tools