# RAG-only diagnostics with Llama Stack Agents (RHOAI)

This notebook demonstrates how to use the **Llama Stack Agents API** to perform
**RAG (Retrieval-Augmented Generation)** against a vector store loaded with
Special Payment Project knowledge (runbooks, known issues, post-incident reviews).

- It is designed to run against the **RHOAI Llama Stack image**  
  `rhoai/odh-llama-stack-core-rhel9:v3.0`.
- It connects to the Llama Stack instance via `LLAMA_BASE_URL`.
- It uses the **Agents API** (not the `/v1/responses` file_search flow) to:
  - Create an agent with a `file_search` tool
  - Bind that tool to a specific vector store
  - Create a session and ask a single RAG-backed question
  - Show the final answer (and any RAG trace we can see)


## 1. Install dependencies

This cell installs the `llama-stack-client` Python SDK (matching the server
version used by `rhoai/odh-llama-stack-core-rhel9:v3.0`), plus helpers for
environment variables and coloured output.


In [1]:
%pip install --quiet "llama-stack-client==0.3.0" python-dotenv termcolor



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from pprint import pprint

from dotenv import load_dotenv
from termcolor import cprint
from llama_stack_client import LlamaStackClient

# Load environment variables from .env (LLAMA_BASE_URL, etc.)
load_dotenv()

# Base URL of the Llama Stack server
base_url = os.getenv(
    "LLAMA_BASE_URL",
    "http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321",
).rstrip("/")

client = LlamaStackClient(base_url=base_url)
print(f"Connected to Llama Stack server: {base_url}")

# List models so we can see what's available
models = list(client.models.list())
print("\nAvailable models:")
for m in models:
    ident = getattr(m, "identifier", None) or getattr(m, "model_id", None) or str(m)
    print(
        f" - {ident} "
        f"(type={getattr(m, 'model_type', None)}, provider={getattr(m, 'provider_id', None)})"
    )

# Prefer a vLLM-backed LLM if available, otherwise just take the first LLM
llm = next(
    (
        m
        for m in models
        if getattr(m, "model_type", None) == "llm"
        and getattr(m, "provider_id", None) == "vllm-inference"
    ),
    None,
)

if not llm:
    llm = next((m for m in models if getattr(m, "model_type", None) == "llm"), None)

assert llm, "No LLM models available on Llama Stack"

model_id = getattr(llm, "identifier", None) or getattr(llm, "model_id", None)
print(f"\nUsing model: {model_id}")

# Vector store id for RAG
VECTOR_STORE_ID = os.getenv(
    "VECTOR_STORE_ID",
    "vs_c246cf6a-40a4-425b-80c2-4d4e3f438fb1",
)
print(f"Using vector store: {VECTOR_STORE_ID}")


INFO:httpx:HTTP Request: GET http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321/v1/models "HTTP/1.1 200 OK"


Connected to Llama Stack server: http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321

Available models:
 - granite-embedding-125m (type=embedding, provider=sentence-transformers)
 - vllm-inference/llama-4-scout-17b-16e-w4a16 (type=llm, provider=vllm-inference)
 - sentence-transformers/nomic-ai/nomic-embed-text-v1.5 (type=embedding, provider=sentence-transformers)

Using model: vllm-inference/llama-4-scout-17b-16e-w4a16
Using vector store: vs_c246cf6a-40a4-425b-80c2-4d4e3f438fb1


In [3]:
rag_system_prompt = """
You are an incident diagnostics assistant for the Special Payment Project.

You have access to a RAG (Retrieval-Augmented Generation) tool that searches a
vector store containing:
- Known issues
- Runbooks
- Post-incident reviews
- Design and architecture notes for the Special Payment Project

Your job for ANY question is:

1. ALWAYS use the RAG / file_search tool FIRST to retrieve relevant context.
2. Read and synthesise the retrieved content carefully.
3. Base your answer ONLY on the retrieved context plus the user question.
4. If the vector store does not contain enough information to answer confidently:
   - Say that clearly.
   - Suggest what additional logs, metrics, or documentation a human should check.

When you answer:
- Start with a short summary (“TL;DR”) of the likely root cause or key insight.
- Then explain the reasoning, referencing the retrieved documents in natural language
  (e.g. “In the incident report about the checkout 502s…”, “In the payment API runbook…”).
- End with 2–3 concrete next steps for the on-call engineer.

Hard rules:
- Do NOT fabricate details that are not supported by the retrieved context.
- If multiple documents disagree, say so and explain the different possibilities.
- If nothing relevant is found, say “I couldn’t find any relevant entries in the known-issues KB”
  and switch to generic, high-level guidance.
""".strip()

print(rag_system_prompt[:400] + "...\n")


You are an incident diagnostics assistant for the Special Payment Project.

You have access to a RAG (Retrieval-Augmented Generation) tool that searches a
vector store containing:
- Known issues
- Runbooks
- Post-incident reviews
- Design and architecture notes for the Special Payment Project

Your job for ANY question is:

1. ALWAYS use the RAG / file_search tool FIRST to retrieve relevant contex...



In [4]:
from llama_stack_client import Agent

# Configure the Agent with RAG/file_search only
tools_spec = [
    {
        "type": "file_search",
        "vector_store_ids": [VECTOR_STORE_ID],
    }
]

rag_agent = Agent(
    client,
    model=model_id,
    instructions=rag_system_prompt,
    tools=tools_spec,
)

print("RAG Agent created with tools:", tools_spec)


RAG Agent created with tools: [{'type': 'file_search', 'vector_store_ids': ['vs_c246cf6a-40a4-425b-80c2-4d4e3f438fb1']}]


In [5]:
from termcolor import cprint

question = (
    "Give some DNS names from the special payment project"
)

messages = [
    {"role": "user", "content": question},
]

cprint("User message:", "green")
print(question)

# 1) Create a session for the RAG agent
session = rag_agent.create_session(session_name="rag-only-demo")
session_id = getattr(session, "id", None) or getattr(session, "session_id", None) or str(session)
print("\nSession ID:", session_id)

# 2) Run a single non-streaming turn
rag_result = rag_agent.create_turn(
    messages=messages,
    session_id=session_id,
    stream=False,
)

print("\nRaw result type:", type(rag_result))


INFO:httpx:HTTP Request: POST http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321/v1/conversations "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321/v1/responses "HTTP/1.1 200 OK"


[32mUser message:[0m
Give some DNS names from the special payment project

Session ID: conv_931c75d9528f167f10b287a44485fecb24ca56e829f28c31

Raw result type: <class 'llama_stack_client.types.response_object.ResponseObject'>


In [6]:
from textwrap import indent

def show_rag_response(response, max_output_chars: int = 400, show_raw: bool = False):
    """
    Pretty-print file_search / RAG usage and the assistant's answer
    from a Llama Stack ResponseObject (via Agent API).
    """
    if hasattr(response, "to_dict"):
        data = response.to_dict()
    else:
        data = response

    # Try to find any file_search-related outputs (names may vary slightly by build)
    rag_items = [
        item
        for item in data.get("output", [])
        if isinstance(item, dict)
        and ("file_search" in str(item.get("type", "")).lower()
             or "retrieval" in str(item.get("type", "")).lower())
    ]

    cprint("\n=== RAG / file_search activity ===", "yellow")
    if not rag_items:
        print("(no explicit file_search entries found in output trace)")
    else:
        for idx, item in enumerate(rag_items, start=1):
            print(f"[RAG item {idx}] type={item.get('type')}")
            snippet = indent(str(item)[:max_output_chars], "    ")
            print(snippet)
            if len(str(item)) > max_output_chars:
                print("    ... [truncated]")
            print()

    # --- Assistant answer ---
    cprint("\n=== Assistant answer ===", "cyan")

    # Try convenience field first
    text = getattr(response, "output_text", None) if hasattr(response, "output_text") else None

    # Fallback: pull from final message content
    if (text in (None, "")) and isinstance(data, dict):
        for item in data.get("output", []):
            if item.get("type") == "message":
                for part in item.get("content", []):
                    if part.get("type") == "output_text":
                        text = part.get("text", "")
                        break
                if text is not None:
                    break

    if text and str(text).strip():
        print(text)
    else:
        print("(Assistant returned an empty message – no natural-language answer.)")
        if show_raw:
            print("\n--- Raw response (debug) ---")
            pprint(data)

show_rag_response(rag_result)


[33m
=== RAG / file_search activity ===[0m
[RAG item 1] type=file_search_call
    {'id': 'fc_520cd2ee-9385-479c-855e-a3c6d4f73898', 'queries': ['Special Payment Project DNS names'], 'status': 'completed', 'type': 'file_search_call', 'results': [{'attributes': {}, 'file_id': 'file-429b4839eae14654a952fe5d1af1b3e9', 'filename': 'file-429b4839eae14654a952fe5d1af1b3e9', 'score': 0.6768560409545898, 'text': 'Name`).\n* The existence and status of the gateway Service/FQDN in the prov
    ... [truncated]

[36m
=== Assistant answer ===[0m
TL;DR: The Special Payment Project uses several DNS names, including `special-payment.<apps-domain>`, `card-gateway-dns`, and `card-gateway-sandbox.payments-provider-sim.svc.cluster.local`.

The Special Payment Project uses the following DNS names:

* `special-payment.<apps-domain>`: This is the user-facing route for the Special Payment Project application.
* `card-gateway-dns`: This is a Service of type `ExternalName` in the `special-payment-project` nam