# Llama Stack + Kubernetes MCP: Cluster Diagnostics Demo

# WAIT TO TEST THIS ON V3.1 (CURRENTLY THIS RESPONSES API SETUP WORKS ON V3.0)

This notebook shows how to use **Llama Stack’s Responses API** together with a **Kubernetes Model Context Protocol (MCP) server** to diagnose issues in a Kubernetes cluster.

We will:


- - It is designed to run against the **RHOAI Llama Stack image**  
  `rhoai/odh-llama-stack-core-rhel9:v3.0`.
- Connect to a running **Llama Stack** server and select an LLM.
- Define a **system prompt** that turns the model into a Kubernetes diagnostics assistant.
- Point the model at a **Kubernetes MCP server**, which exposes tools like `pods_list`, `pods_log`, and `resources_get`.
- Send a natural-language question asking it to assess the cluster and identify issues.
- Let the model **call MCP tools** to inspect the cluster and return likely root causes plus evidence.

Later, this same pattern can be extended with **RAG** so the model also follows your ingested SRE runbooks while using MCP to act on the cluster.


### 1. Install Python dependencies

We install the Llama Stack Python client plus a couple of helpers.

In Jupyter we use `%pip` so it installs into this kernel.

Updated initialization: Configure the Llama Stack client without any web search provider or API key.


In [None]:
%pip install --quiet "llama-stack-client==0.3.0" python-dotenv termcolor



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2. Import libraries and connect to the Llama Stack server

- `dotenv` loads connection settings (base URL, etc.) from `.env`
- `LlamaStackClient` is the main entrypoint to the Llama Stack APIs
- We list available models and pick a suitable LLM (prefer vLLM-backed if available).


In [None]:
import os
from pprint import pprint

from dotenv import load_dotenv
from termcolor import cprint
from llama_stack_client import LlamaStackClient

# Load environment variables from .env (LLAMA_BASE_URL, REMOTE_OCP_MCP_URL, etc.)
load_dotenv()

# Base URL of the Llama Stack server
base_url = os.getenv(
    "LLAMA_BASE_URL",
    "http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321",
).rstrip("/")

client = LlamaStackClient(base_url=base_url)
print(f"Connected to Llama Stack server: {base_url}")

# List models so we can see what's available
models = list(client.models.list())
print("\nAvailable models:")
for m in models:
    print(" -", getattr(m, "identifier", m))

# Prefer a vLLM-backed LLM if available, otherwise just take the first LLM
llm = next(
    (
        m
        for m in models
        if getattr(m, "model_type", None) == "llm"
        and getattr(m, "provider_id", None) == "vllm-inference"
    ),
    None,
)

if not llm:
    llm = next((m for m in models if getattr(m, "model_type", None) == "llm"), None)

assert llm, "No LLM models available on Llama Stack"

model_id = llm.identifier
print(f"\nUsing model: {model_id}")


INFO:httpx:HTTP Request: GET http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321/v1/models "HTTP/1.1 200 OK"


Connected to Llama Stack server: http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321

Available models:
 - granite-embedding-125m
 - vllm-inference/llama-4-scout-17b-16e-w4a16
 - sentence-transformers/nomic-ai/nomic-embed-text-v1.5

Using model: vllm-inference/llama-4-scout-17b-16e-w4a16


Note: The legacy initialization is deprecated. Please use the updated initialization above.


### 3. Define the system prompt (generic behaviour)

This is where we tell the model **how** to behave in general.

Here we say:
- You’re a Kubernetes diagnostics assistant
- Use MCP tools instead of guessing
- Explain your reasoning and evidence

In [None]:
model_prompt = """
You are a Kubernetes diagnostics assistant working with a Model Context Protocol (MCP) server.
Your job is to investigate incidents using ONLY the Kubernetes MCP tools and then explain your findings.

Available MCP tools (do NOT invent new ones):
- configuration_view
- events_list
- helm_list
- namespaces_list
- nodes_log
- nodes_stats_summary
- nodes_top
- pods_get
- pods_list
- pods_list_in_namespace
- pods_log
- pods_top
- projects_list
- resources_get
- resources_list

High-level workflow for ANY incident / “something is broken” question:

PHASE 0 – Discover what exists in the project
1. If a Kubernetes namespace is mentioned (e.g. "special-payment-project"), treat it as the target namespace.
2. Call:
   - pods_list_in_namespace(namespace=<ns>)          # inventory pods
   - resources_list(apiVersion="v1",      kind="Service",    namespace=<ns>)   # inventory Services
   - resources_list(apiVersion="apps/v1", kind="Deployment", namespace=<ns>)   # inventory Deployments
   - Optionally, on OpenShift: resources_list(apiVersion="route.openshift.io/v1", kind="Route", namespace=<ns>)

PHASE 1 – Drill into the most relevant workloads
3. From the pod list, pick 1–3 pods whose names look most relevant to the question
   (e.g. contain "api", "frontend", "payment", "checkout").
4. You MUST call pods_log for at least one of those existing pods:
   - Only use pod names that appear EXACTLY in the pods_list_in_namespace output.
   - Never invent pod names like "payment" or "payment-api" if they were not listed.
   - It is allowed (and encouraged) to call pods_log for more than one relevant pod.
5. Optionally:
   - Call events_list(namespace=<ns>) to look for Warning/Error events related to those pods.
   - Use resources_get(...) if you need details for a specific Service or Deployment you already saw in resources_list.

Hard rules:
- You MUST call at least one MCP tool for EVERY answer.
- For incident / error questions, you MUST:
  - Call pods_list_in_namespace(namespace=<ns>) AND
  - Call pods_log(...) for at least one pod that actually exists.
- You MUST NOT talk about “checking logs” unless you have actually called pods_log in this conversation.
- You MUST NOT mention or use tools that are not in the list above (e.g. no services_list_in_namespace).

When you answer, ALWAYS:
- Start by listing which MCP tools you called and with which key arguments.
- Summarise what you observed:
  - Pods (names + status from pods_list_in_namespace)
  - Important log snippets from pods_log (even 1–2 lines is fine)
  - Any notable events or resource issues (from events_list / resources_list)
- Give your best diagnosis based on that evidence.
- End with 2–3 concrete next steps for an SRE/operator.

If tools fail or return nothing useful:
- Explicitly say which tools you tried and what they returned (e.g. “pod not found”, “no events”).
- State clearly that the evidence is inconclusive and what a human should check next.
""".strip()


### 4. Configure the Kubernetes MCP server

We point the model at our **Kubernetes MCP server**, which exposes tools like
`pods_list`, `pods_log`, and `resources_get`.

The URL normally comes from `REMOTE_OCP_MCP_URL` in `.env`.
If that’s not set, we fall back to a sensible cluster default.

In [None]:
# URL for the Kubernetes MCP server (adjust default to your real MCP route if needed)
ocp_mcp_url = os.getenv(
    "REMOTE_OCP_MCP_URL",
    "http://kubernetes-mcp-server.llama-stack-demo.svc.cluster.local:8080/sse",
).rstrip("/")

print(f"Using Kubernetes MCP server: {ocp_mcp_url}")


Using Kubernetes MCP server: http://kubernetes-mcp-server.llama-stack-demo.svc.cluster.local:8080/sse


### 5. Build the conversation: system + user messages

The Responses API takes a list of chat messages.

Here we build a simple conversation:
- System message = our generic instructions
- User message = a high-level request to assess the cluster and highlight issues


In [None]:
question = (
    "I'm getting Payment failed: HTTP 502 error in the project: special-payment-project, please investigate."
)

messages = [
    {"role": "system", "content": model_prompt},
    {"role": "user", "content": question},
]

cprint("System message:", "yellow")
print(messages[0]["content"])
print()
cprint("User message:", "green")
print(messages[1]["content"])


[33mSystem message:[0m
You are a Kubernetes diagnostics assistant working with a Model Context Protocol (MCP) server.
Your job is to investigate incidents using ONLY the Kubernetes MCP tools and then explain your findings.

Available MCP tools (do NOT invent new ones):
- configuration_view
- events_list
- helm_list
- namespaces_list
- nodes_log
- nodes_stats_summary
- nodes_top
- pods_get
- pods_list
- pods_list_in_namespace
- pods_log
- pods_top
- projects_list
- resources_get
- resources_list

High-level workflow for ANY incident / “something is broken” question:

PHASE 0 – Discover what exists in the project
1. If a Kubernetes namespace is mentioned (e.g. "special-payment-project"), treat it as the target namespace.
2. Call:
   - pods_list_in_namespace(namespace=<ns>)          # inventory pods
   - resources_list(apiVersion="v1",      kind="Service",    namespace=<ns>)   # inventory Services
   - resources_list(apiVersion="apps/v1", kind="Deployment", namespace=<ns>)   # inventory

### 6. Call the Responses API with MCP tools

Now we send the conversation to `/v1/responses`, telling Llama Stack:

- which **model** to use  
- our **messages** (system + user)  
- the **MCP tool** to call (the Kubernetes MCP server)  
- `max_infer_iters` to control how many tool-calling / “thinking” steps it can take.

This is where the model can:
1. Read the question  
2. Decide which Kubernetes tools to call via MCP  
3. Inspect services/pods/logs  
4. Return an explanation of issues it found and supporting evidence.


In [None]:
response = client.responses.create(
    model=model_id,
    input=messages,
    tools=[
        {
            "type": "mcp",
            # For this client/server version, MCP config is top-level:
            "server_url": ocp_mcp_url,
            "server_label": "kubernetes-mcp",
            # Optional extras if you want to tighten behaviour later:
            # "allowed_tools": ["pods_list", "pods_log", "resources_get"],
            # "require_approval": "never",
        }
    ],
    instructions=None,      # We already provided the instructions as the system message
    temperature=0.0,        # Greedy decoding (no randomness) for deterministic demos
    max_infer_iters=10,     # How many reasoning / tool-use loops are allowed
    # stream=False by default
)

print("Raw ResponseObject type:", type(response))


INFO:httpx:HTTP Request: POST http://lsd-llama-milvus-inline-service.llama-stack-demo.svc.cluster.local:8321/v1/responses "HTTP/1.1 200 OK"


Raw ResponseObject type: <class 'llama_stack_client.types.response_object.ResponseObject'>


### 7. Show MCP tools used + assistant answer

This helper prints:
- A concise list of MCP tool calls (name, arguments, and a snippet of output)
- The final assistant answer

Much easier to read than the full raw JSON structure.


In [None]:
from pprint import pprint
from textwrap import indent

def show_mcp_response(response, max_output_chars: int = 400, show_raw: bool = False):
    """
    Pretty-print MCP tool usage and the assistant's answer
    from a Llama Stack ResponseObject.

    If the assistant returns an empty message, we print a short notice
    instead of dumping the full raw structure (unless show_raw=True).
    """
    data = response.to_dict()

    # --- 1) Show MCP tools discovered (from mcp_list_tools) ---
    mcp_list = [item for item in data.get("output", []) if item.get("type") == "mcp_list_tools"]

    cprint("\n=== MCP tools discovered ===", "yellow")
    if mcp_list:
        tools = mcp_list[0].get("tools", [])
        names = [t.get("name") for t in tools if isinstance(t, dict)]
        print(", ".join(sorted(set(n for n in names if n))) or "(none)")
    else:
        print("(no mcp_list_tools entry)")

    # --- 2) Show actual MCP tool calls (mcp_call entries) ---
    mcp_calls = [item for item in data.get("output", []) if item.get("type") == "mcp_call"]

    cprint("\n=== MCP calls made ===", "yellow")
    if not mcp_calls:
        print("(no MCP tool calls were executed)")
    else:
        for call in mcp_calls:
            name = call.get("name")
            args = call.get("arguments")
            out = call.get("output", "") or ""
            print(f"- {name}({args})")
            if out:
                snippet = out[:max_output_chars]
                print(indent(snippet, "    "))
                if len(out) > max_output_chars:
                    print("    ... [truncated]")
            print()

    # --- 3) Extract assistant's final answer text ---
    cprint("\n=== Assistant answer ===", "cyan")

    # Try convenience field first
    text = getattr(response, "output_text", None)

    # Fallback: pull from the final message content
    if text in (None, ""):
        for item in data.get("output", []):
            if item.get("type") == "message":
                for part in item.get("content", []):
                    if part.get("type") == "output_text":
                        text = part.get("text", "")
                        break
                if text is not None:
                    break

    if text and str(text).strip():
        print(text)
    else:
        print("(Assistant returned an empty message – no natural-language answer.)")
        if show_raw:
            print("\n--- Raw response (debug) ---")
            pprint(data)

# Use this instead of pprint(response.to_dict())
show_mcp_response(response)


[33m
=== MCP tools discovered ===[0m
configuration_view, events_list, helm_list, namespaces_list, nodes_log, nodes_stats_summary, nodes_top, pods_get, pods_list, pods_list_in_namespace, pods_log, pods_top, projects_list, resources_get, resources_list
[33m
=== MCP calls made ===[0m
- pods_list_in_namespace({"namespace": "special-payment-project"})
    NAMESPACE                 APIVERSION   KIND   NAME                                READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES   LABELS
    special-payment-project   v1           Pod    checkout-api-84bff5f68d-2p775       1/1     Running   0          155m   10.128.2.96   ip-10-0-117-105.us-east-2.compute.internal   <non
    ... [truncated]

- resources_list({"apiVersion": "v1", "kind": "Service", "namespace": "special-payment-project"})
    NAMESPACE                 APIVERSION   KIND      NAME                TYPE           CLUSTER-IP       EXTERNAL-IP   

### 8. (optional for debugging purposes) Inspect the raw response structure

For debugging and for “show your workings” in the demo, we can inspect the
raw `ResponseObject`.

In here we can see:
- `output` items for the final assistant message
- and, if we scroll, tool calls and tool results coming from MCP.


In [None]:
pprint(response.to_dict())

{'created_at': 1764263943,
 'error': None,
 'id': 'resp_6900430e-90ae-4de8-a9e9-36a895210a96',
 'model': 'vllm-inference/llama-4-scout-17b-16e-w4a16',
 'object': 'response',
 'output': [{'id': 'mcp_list_49d676c9-04e9-4d84-a8d3-d06b570d054b',
             'server_label': 'kubernetes-mcp',
             'tools': [{'description': 'Get the current Kubernetes '
                                       'configuration content as a kubeconfig '
                                       'YAML',
                        'input_schema': {'properties': {'minified': {'description': 'Return '
                                                                                    'a '
                                                                                    'minified '
                                                                                    'version '
                                                                                    'of '
                                                   