# 1. Training Overview

* Session length: Halfday (theory included)

* Skill level: Intermediate Python (familiar with transformers)

* Goal: Students will implement an Agentic RAG loop ‚Äî an LLM that decides when to search, retrieves knowledge, and answers.

# 2. Learning Outcomes

By the end of the lab, students can:



*   Explain RAG and its limitations.
*   Understand what makes an AI system agentic.
*   Implement an agentic RAG using a small open-source model and free resources in Colab.
*   Visualize the reasoning steps (plan ‚Üí retrieve ‚Üí generate ‚Üí reflect).

# 3. Libraries Installation

In [12]:
!pip install -q google-generativeai sentence-transformers numpy scikit-learn rich torch

# 4. Import libraries


In [13]:
import os, time, torch, numpy as np
import google.generativeai as genai
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from rich.console import Console

console = Console()

# 5. Setup

In [14]:
# --- Device Setup ---
device = "cuda" if torch.cuda.is_available() else "cpu"
console.print(f"[bold blue]Device detected:[/bold blue] {device}")

# üîë Gemini API Key (runs on Google infrastructure)
os.environ["GOOGLE_API_KEY"] = "AIzaSyBORKiBmZFpNrP8W5wq8ZwemA6dpDVvdVo"  # Replace with your actual key
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

# Initialize Gemini 2.5 Flash (cloud execution)
model = genai.GenerativeModel("gemini-2.5-flash")
console.print("[green]Gemini 2.5 Flash initialized successfully (runs on cloud).[/green]")

# 6. Create Knowledge base / Synthetic data

In [15]:
knowledge_base = [
    "RAG (Retrieval-Augmented Generation) is an AI framework that combines an LLM with an external knowledge base to ground the model in current, accurate facts.",
    "Agentic AI refers to a system where an LLM is given the ability to reason, plan, and choose which tool or step to execute next (like search, code execution, or final answer).",
    "Agentic RAG specifically means the agent decides when and what to retrieve from the knowledge base, often using a 'Thought' or 'Decision' step.",
    "The MiniLM model, 'all-MiniLM-L6-v2', is a highly efficient sentence-transformer, known for being fast with a small size (384 dimensions) while offering good performance.",
    "Gemini 2.5 Flash is Google's most balanced and fast model, suitable for both the decision-making and final generation steps in this agentic workflow.",
    "A typical RAG pipeline is a fixed sequence (Query -> Retrieve -> Generate), whereas Agentic RAG is a dynamic loop (Query -> Decide -> Act -> Generate).",
    "Vector databases like FAISS or ChromaDB are often used to store the embeddings generated by models like MiniLM for efficient similarity search.",
]

# 7. Embedding model

In [21]:
console.print("[bold blue]Embedding Knowledge Base with MiniLM on GPU (if available)...[/bold blue]")
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)

# Encode the knowledge base (list of text chunks) into dense vector embeddings
# using the MiniLM model. Each text becomes a numerical representation that captures its meaning.
corpus_embeddings = embedder.encode(knowledge_base, convert_to_tensor=True, device=device)
console.print(f"[green]Knowledge base embedded successfully on {device.upper()}.[/green]")


# 8. Helper Functions

In [22]:
def retrieve_context(query, top_k=2):
    console.print("[dim]Action: Searching Knowledge Base...[/dim]")
    #    This converts the query (a sentence) into a 384-dimensional tensor,
    #    so we can compare it numerically with all knowledge base embeddings.
    query_emb = embedder.encode([query], convert_to_tensor=True, device=device)

    #    Cosine similarity measures how "aligned" two vectors are in meaning.
    #    Higher similarity ‚Üí closer semantic relationship.
    sims = torch.nn.functional.cosine_similarity(query_emb, corpus_embeddings)
    #    torch.topk() efficiently retrieves the highest similarity scores.
    #    We move results to CPU and convert to NumPy for indexing and display.
    top_idx = torch.topk(sims, top_k).indices.cpu().numpy()
    #   Using the indices from above, we pull the corresponding text from the knowledge base.
    docs = [knowledge_base[i] for i in top_idx]
    console.print(f"[dim]Top similarity score: {sims[top_idx[0]].item():.4f}[/dim]")
    return "\n".join(docs)


# This function sends a text prompt to the Gemini model, safely collects its generated response,
# and returns the output text ‚Äî with built-in error handling in case the model returns nothing or fails.
def gemini_generate(prompt, max_output_tokens=300, temperature=0.5):
    """Safe Gemini text generation with fallback handling."""
    try:
        response = model.generate_content(
            prompt,
            generation_config={"max_output_tokens": max_output_tokens, "temperature": temperature},
        )
        if not hasattr(response, "candidates") or not response.candidates:
            return "[Gemini returned no candidates.]"
        parts = []
        for c in response.candidates:
            if hasattr(c, "content") and hasattr(c.content, "parts"):
                for p in c.content.parts:
                    if hasattr(p, "text"):
                        parts.append(p.text)
        return " ".join(parts).strip() if parts else "[Gemini produced no text.]"
    except Exception as e:
        return f"[Error: {e}]"


# This function asks Gemini to decide whether the agent should SEARCH for more context
# or ANSWER directly, based on how complete or relevant the current context is ‚Äî
# then returns both the final decision and the model‚Äôs raw reasoning text.
def get_decision(query, context):
    """Decide SEARCH or ANSWER using adaptive reasoning."""
    decision_prompt = f"""
You are an AI agent in an Agentic RAG system.

You can do one of two actions:
- SEARCH ‚Üí if you need more information.
- ANSWER ‚Üí if you already have enough context to give the final answer.

Question: {query}
Current context: {context}

If the context already seems to contain relevant facts (e.g. mentions of Agentic RAG or RAG),
then you should ANSWER instead of SEARCH.

Respond in one sentence that includes the word SEARCH or ANSWER.
"""
    decision_raw = gemini_generate(decision_prompt, max_output_tokens=100, temperature=0.6)
    if "ANSWER" in decision_raw.upper():
        return "ANSWER", decision_raw
    elif "SEARCH" in decision_raw.upper():
        return "SEARCH", decision_raw
    else:
        if "RAG" in context.upper():
            return "ANSWER", decision_raw
        return "SEARCH", decision_raw

# 9. Agentic Rag

In [23]:
# This function runs the full Agentic RAG reasoning loop ‚Äî repeatedly asking Gemini to decide
# whether to SEARCH for more context or ANSWER using what‚Äôs already known ‚Äî retrieving information
# or generating the final answer accordingly, and printing each reasoning step for clarity.

def run_agentic_rag(query, max_steps=3):
    console.print("\n[bold magenta]*** AGENTIC RAG PROCESS STARTING ***[/bold magenta]")
    console.print(f"[bold cyan]User Query:[/bold cyan] {query}")
    context = "No information retrieved yet."

    for step in range(1, max_steps + 1):
        console.print(f"\n[bold yellow]--- STEP {step} ---[/bold yellow]")
        decision, decision_raw = get_decision(query, context)
        console.print(f"[bold green]Agent Decision:[/bold green] {decision}")

        if decision == "SEARCH":
            context = retrieve_context(query)
            console.print("[bold]Retrieved Chunks:[/bold]")
            console.print(f"RETRIEVED CONTEXT:\n{context}")

        elif decision == "ANSWER" or step == max_steps:
            answer_prompt = f"""
You are completing the final step of an Agentic RAG process.

Use the retrieved context to give a clear, factual final answer.

Context:
{context}

Question:
{query}

Answer in 2‚Äì3 sentences.
"""
            answer = gemini_generate(answer_prompt, max_output_tokens=300, temperature=0.4)
            console.print(f"[bold magenta]Final Answer:[/bold magenta] {answer}")
            break

    console.print("[bold magenta]*** AGENTIC RAG PROCESS FINISHED ***[/bold magenta]")


In [24]:
run_agentic_rag("What makes Agentic RAG different from the normal, standard RAG process?")

In [25]:
run_agentic_rag("What is MiniLM known for?")

In [26]:
run_agentic_rag("What is the main difference between Agentic RAG and a typical RAG pipeline?")

# üî¨ 10. Student Exercises

Modify corpus: add 5‚Äì10 of their own knowledge sentences.

Change model: try microsoft/phi-2 or Qwen/Qwen2.5-3B-Instruct.

Add reflection: ask model to judge its own answer:
‚ÄúWas your answer sufficient? If not, what else do you need?‚Äù

Visualization: draw the reasoning loop (plan ‚Üí retrieve ‚Üí generate ‚Üí reflect).

Optional (advanced): connect to Wikipedia via wikipedia API for live retrieval.