# RAG Pipeline Demo

This notebook demonstrates the end-to-end **Retrieval-Augmented Generation (RAG)** pipeline
deployed on our Kubernetes AI Stack. The pipeline consists of three stages:

1. **Ingest** -- Documents are chunked, embedded via LocalAI, and stored in Qdrant.
2. **Retrieve** -- User queries are embedded and matched against stored vectors.
3. **Generate** -- An LLM produces a grounded answer using the retrieved context.

**Architecture:**
```
User Document --> POST /ingest --> Chunk --> Embed (LocalAI) --> Store (Qdrant)
User Query   --> POST /query  --> Embed --> Retrieve --> Re-rank --> Generate (LLM)
```

> **Prerequisites:** The RAG pipeline service, LocalAI, and Qdrant must be running
> in the `ai-stack` namespace. Use `kubectl get pods -n ai-stack` to verify.

In [None]:
"""
Import Dependencies
-------------------
We use the `requests` library to interact with the RAG pipeline REST API,
`openai` for direct LocalAI interaction, and `qdrant_client` for
inspecting the vector store directly when needed.
"""

import json
import time
from pprint import pprint

import requests

# Optional: direct access to the underlying services
try:
    import openai
    print("openai client available")
except ImportError:
    print("openai not installed -- direct LLM access unavailable")

try:
    from qdrant_client import QdrantClient
    print("qdrant_client available")
except ImportError:
    print("qdrant_client not installed -- direct vector DB access unavailable")

In [None]:
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
# When running inside the cluster, services are reachable via their
# Kubernetes DNS names. For local development, use kubectl port-forward:
#   kubectl port-forward -n ai-stack svc/rag-pipeline 8000:8000
#   kubectl port-forward -n ai-stack svc/qdrant 6333:6333

# RAG Pipeline API endpoint
RAG_API_URL = "http://rag-pipeline.ai-stack.svc.cluster.local:8000"

# For local development (uncomment if using port-forward):
# RAG_API_URL = "http://localhost:8000"

# Qdrant direct access (for inspection)
QDRANT_URL = "http://qdrant.ai-stack.svc.cluster.local:6333"

# Collection name (must match the pipeline configuration)
COLLECTION_NAME = "knowledge_base"

# Embedding and LLM model names (as configured in the pipeline)
EMBEDDING_MODEL = "text-embedding-ada-002"
LLM_MODEL = "gpt-3.5-turbo"

print(f"RAG API:    {RAG_API_URL}")
print(f"Qdrant:     {QDRANT_URL}")
print(f"Collection: {COLLECTION_NAME}")

## Document Ingestion

The `/ingest` endpoint accepts a document as raw text, along with optional metadata.
The pipeline then:

1. **Chunks** the text into overlapping segments (default: 512 chars, 50 char overlap)
2. **Embeds** each chunk using the configured embedding model via LocalAI
3. **Stores** the vectors and metadata in the Qdrant collection

Let's ingest a sample Kubernetes knowledge document.

In [None]:
# ---------------------------------------------------------------------------
# Ingest a sample document via POST /ingest
# ---------------------------------------------------------------------------

# A realistic Kubernetes knowledge document to ingest
sample_document = """
Kubernetes (K8s) is an open-source container orchestration platform originally
developed by Google. It automates the deployment, scaling, and management of
containerised applications.

Key Concepts:
- Pod: The smallest deployable unit, consisting of one or more containers that
  share networking and storage. Pods are ephemeral by design.
- Deployment: A controller that manages ReplicaSets and provides declarative
  updates for Pods. It supports rolling updates and rollbacks.
- Service: An abstraction that defines a logical set of Pods and a policy for
  accessing them. Types include ClusterIP, NodePort, and LoadBalancer.
- ConfigMap and Secret: Objects for managing configuration data. ConfigMaps
  store non-sensitive data, while Secrets handle sensitive information with
  base64 encoding.
- PersistentVolumeClaim (PVC): A request for storage by a user. PVCs are
  bound to PersistentVolumes and provide storage that persists beyond pod
  lifecycle.
- Horizontal Pod Autoscaler (HPA): Automatically scales the number of pod
  replicas based on observed CPU utilisation, memory usage, or custom metrics.

Networking:
Kubernetes uses a flat networking model where every Pod gets its own IP address.
Services provide stable endpoints and load balancing across Pod replicas. Ingress
controllers manage external HTTP/HTTPS traffic routing.

Storage:
The Container Storage Interface (CSI) allows Kubernetes to support various storage
backends including local disks, NFS, Ceph, and cloud provider storage (EBS, GCE PD).
StatefulSets provide stable storage and network identities for stateful applications.
"""

# Prepare the ingest request payload
ingest_payload = {
    "text": sample_document,
    "metadata": {
        "source": "kubernetes-knowledge-base",
        "category": "infrastructure",
        "language": "en",
    },
    "collection_name": COLLECTION_NAME,
    "chunk_size": 512,
    "chunk_overlap": 50,
}

print("Ingesting document...")
print(f"  Document length: {len(sample_document)} characters")
print(f"  Collection:      {COLLECTION_NAME}")
print(f"  Chunk size:      {ingest_payload['chunk_size']}")
print(f"  Chunk overlap:   {ingest_payload['chunk_overlap']}")

try:
    response = requests.post(
        f"{RAG_API_URL}/ingest",
        json=ingest_payload,
        timeout=60,
    )
    response.raise_for_status()
    result = response.json()

    print(f"\nIngestion successful!")
    print(f"  Document ID:     {result['document_id']}")
    print(f"  Chunks created:  {result['chunks_created']}")
    print(f"  Latency:")
    for stage, ms in result["latency_ms"].items():
        print(f"    {stage:>12}: {ms:.2f} ms")

except requests.exceptions.ConnectionError:
    print("\nConnection failed. The RAG pipeline service is not reachable.")
    print("Make sure the service is running: kubectl get pods -n ai-stack")
    # Use a simulated result for demonstration
    result = {
        "document_id": "a1b2c3d4e5f67890",
        "chunks_created": 5,
        "collection_name": COLLECTION_NAME,
        "latency_ms": {
            "chunking": 1.23,
            "embedding": 245.67,
            "storage": 18.45,
        },
    }
    print("\n[Simulated result for demonstration]")
    pprint(result)

## Querying the Pipeline

Now that we have ingested a document, we can query the RAG pipeline. The `/query` endpoint:

1. **Embeds** the user's question using the same embedding model
2. **Retrieves** the top-K most similar chunks from Qdrant (default: K=10)
3. **Re-ranks** candidates using a combined vector + lexical similarity score
4. **Generates** an answer via the LLM, grounded in the top-N re-ranked chunks (default: N=3)

The response includes the generated answer, source chunks with scores, and per-stage latency.

In [None]:
# ---------------------------------------------------------------------------
# Query the RAG pipeline via POST /query
# ---------------------------------------------------------------------------

queries = [
    "What is a Kubernetes Pod and how does it relate to containers?",
    "How does Horizontal Pod Autoscaler work?",
    "Explain the difference between ConfigMaps and Secrets in Kubernetes.",
]

query_results = []

for i, query_text in enumerate(queries, 1):
    print(f"\n{'='*70}")
    print(f"Query {i}: {query_text}")
    print('='*70)

    query_payload = {
        "query": query_text,
        "collection_name": COLLECTION_NAME,
        "top_k": 10,       # Number of initial retrieval candidates
        "top_n": 3,         # Number of chunks after re-ranking
        "include_sources": True,
    }

    try:
        response = requests.post(
            f"{RAG_API_URL}/query",
            json=query_payload,
            timeout=120,
        )
        response.raise_for_status()
        qresult = response.json()

        print(f"\nAnswer:\n{qresult['answer']}")
        print(f"\nSources ({len(qresult['sources'])} chunks):")
        for j, src in enumerate(qresult["sources"], 1):
            print(f"  [{j}] score={src['score']:.4f} rerank={src['rerank_score']:.4f}")
            print(f"      {src['text'][:120]}...")

        print(f"\nLatency:")
        for stage, ms in qresult["latency_ms"].items():
            print(f"  {stage:>18}: {ms:.2f} ms")

        query_results.append(qresult)

    except requests.exceptions.ConnectionError:
        print("\nConnection failed -- using simulated result for demonstration.")
        simulated = {
            "answer": (
                "A Kubernetes Pod is the smallest deployable unit in Kubernetes. "
                "It consists of one or more containers that share networking and storage. "
                "Pods are ephemeral by design and are managed by higher-level controllers "
                "like Deployments."
            ),
            "sources": [
                {"text": "Pod: The smallest deployable unit...", "score": 0.92, "rerank_score": 0.88},
                {"text": "Kubernetes uses a flat networking model...", "score": 0.78, "rerank_score": 0.71},
            ],
            "latency_ms": {
                "query_embedding": 45.2,
                "retrieval": 12.8,
                "reranking": 0.5,
                "generation": 2340.1,
            },
        }
        print(f"\n[Simulated] Answer:\n{simulated['answer']}")
        query_results.append(simulated)

## Results Analysis

Let's analyse the pipeline performance by looking at latency breakdown across
pipeline stages, retrieval quality (vector scores vs. re-rank scores), and
the overall response characteristics.

In [None]:
# ---------------------------------------------------------------------------
# Display results in a structured format
# ---------------------------------------------------------------------------

def display_latency_table(results: list[dict]) -> None:
    """Print a formatted table of pipeline latency across all queries."""
    print("\nPipeline Latency Breakdown (ms)")
    print("-" * 72)
    header = f"{'Stage':<20}"
    for i in range(len(results)):
        header += f"  {'Query ' + str(i+1):>12}"
    print(header)
    print("-" * 72)

    # Collect all stages across all results
    all_stages = []
    for r in results:
        for stage in r.get("latency_ms", {}):
            if stage not in all_stages:
                all_stages.append(stage)

    for stage in all_stages:
        row = f"{stage:<20}"
        for r in results:
            val = r.get("latency_ms", {}).get(stage, 0)
            row += f"  {val:>12.2f}"
        print(row)

    # Total row
    print("-" * 72)
    row = f"{'TOTAL':<20}"
    for r in results:
        total = sum(r.get("latency_ms", {}).values())
        row += f"  {total:>12.2f}"
    print(row)
    print("-" * 72)


def display_retrieval_quality(results: list[dict]) -> None:
    """Print retrieval score statistics for each query."""
    print("\nRetrieval Quality (Vector Score / Re-rank Score)")
    print("-" * 60)
    for i, r in enumerate(results, 1):
        sources = r.get("sources", [])
        if sources:
            avg_score = sum(s["score"] for s in sources) / len(sources)
            avg_rerank = sum(s["rerank_score"] for s in sources) / len(sources)
            print(f"  Query {i}: {len(sources)} sources | "
                  f"avg_vector={avg_score:.4f} | avg_rerank={avg_rerank:.4f}")
        else:
            print(f"  Query {i}: No sources returned")


def display_answer_summary(results: list[dict], queries: list[str]) -> None:
    """Print a concise summary of each answer."""
    print("\nAnswer Summary")
    print("=" * 70)
    for i, (q, r) in enumerate(zip(queries, results), 1):
        answer = r.get("answer", "N/A")
        # Truncate long answers for readability
        if len(answer) > 200:
            answer = answer[:200] + "..."
        print(f"\n  Q{i}: {q}")
        print(f"  A{i}: {answer}")
    print()


# Run the analysis
display_latency_table(query_results)
display_retrieval_quality(query_results)
display_answer_summary(query_results, queries)

## Conclusions

### Pipeline Performance
- **Chunking** is nearly instantaneous (< 2ms) since it is a simple string operation.
- **Embedding** is the most variable stage, depending on the LocalAI model and batch size.
- **Retrieval** from Qdrant is fast (typically < 20ms) thanks to HNSW indexing.
- **Re-ranking** adds minimal overhead (< 1ms) with the lexical + vector hybrid approach.
- **Generation** dominates total latency, as the LLM must produce a multi-sentence response.

### Retrieval Quality
- The two-stage retrieve-then-rerank approach improves result relevance by combining
  dense vector similarity with lexical overlap signals.
- Higher `top_k` values increase recall at the cost of more re-ranking computation.
- The `top_n` parameter controls the context window size fed to the LLM.

### Next Steps
- Swap the lightweight re-ranker with a cross-encoder model (e.g., `ms-marco-MiniLM`) for higher accuracy.
- Experiment with hybrid embedding strategies (dense + sparse) via the `EMBEDDING_STRATEGY` config.
- Add document metadata filtering to scope retrieval to specific knowledge domains.
- Monitor pipeline latency in production using the `/health` endpoint and Prometheus metrics.