# RAG Verification Agent

This notebook implements a RAG Verification Agent that retrieves information from `report.pdf` and verifies/supplements it with web search using Tavily.

In [17]:
import os
from dotenv import load_dotenv
import chromadb
from pypdf import PdfReader
from tavily import TavilyClient
from google import genai

# Load environment variables
load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")

TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")

if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY or GEMINI_API_KEY not found in environment variables")
if not TAVILY_API_KEY:
    raise ValueError("TAVILY_API_KEY not found in environment variables")

## 1. Initialize Clients

In [18]:
# Initialize Gemini Client
client = genai.Client(api_key=GOOGLE_API_KEY)

# Initialize Tavily Client
tavily_client = TavilyClient(api_key=TAVILY_API_KEY)

# Initialize ChromaDB Client
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection_name = "pdf_verification"

# Reset collection if needed for fresh ingestion
try:
    chroma_client.delete_collection(name=collection_name)
except Exception:
    pass # Collection didn't exist or other error

collection = chroma_client.create_collection(name=collection_name)

## 2. Ingest PDF

In [None]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

def chunk_text(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

# Load and Chunk
pdf_path = "Stanford.pdf"
if not os.path.exists(pdf_path):
    print(f"File {pdf_path} not found. Please ensure Standford.pdf is in the directory.")
else:
    pdf_text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(pdf_text)
    print(f"Created {len(chunks)} chunks.")

    print("Generating embeddings and storing in ChromaDB...")
    documents = []
    embeddings = []
    ids = []

    # Batching could be done here, but for simplicity we loop
    for i, chunk in enumerate(chunks):
        # Embed the chunk using Gemini
        response = client.models.embed_content(
            model="text-embedding-004",
            contents=chunk
        )
        # The new SDK returns an object with embeddings list
        embedding = response.embeddings[0].values
        
        documents.append(chunk)
        embeddings.append(embedding)
        ids.append(f"chunk_{i}")

    collection.add(
        documents=documents,
        embeddings=embeddings,
        ids=ids
    )
    print("Ingestion complete.")

Created 2 chunks.
Generating embeddings and storing in ChromaDB...
Ingestion complete.


## 3. Web Search Tool

In [24]:
def search_web(query):
    try:
        response = tavily_client.search(query, search_depth="basic")
        results = response.get("results", [])
        context = ""
        for result in results:
            context += f"Source: {result['url']}\nContent: {result['content']}\n\n"
        return context
    except Exception as e:
        return f"Error performing web search: {e}"

## 4. Verification Agent Logic

In [None]:
def verify_information(user_query):
    print(f"Processing query: {user_query}")
    
    # 1. Retrieve from PDF
    print("Retrieving context from PDF...")
    query_embedding_resp = client.models.embed_content(
        model="text-embedding-004",
        contents=user_query
    )
    query_embedding = query_embedding_resp.embeddings[0].values
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    
    pdf_context = "No relevant information found in PDF."
    if results['documents'] and results['documents'][0]:
        pdf_context = "\n---\n".join(results['documents'][0])
        
    # 2. Search Web
    print("Searching web for verification/supplemental info...")
    web_context = search_web(user_query)
    
    # 3. Generate Answer
    print("Generating final response...")
    prompt = f"""
    You are a verification agent. Your goal is to answer the user's query using information from a provided PDF and verify/supplement it with information from the web.

    User Query: {user_query}

    Context from PDF (Internal Knowledge):
    {pdf_context}

    Context from Web Search (External Verification):
    {web_context}

    Instructions:
    - Synthesize the information from both sources.
    - Highlight if the PDF information is consistent with the web information.
    - If there are discrepancies, point them out.
    - If the PDF lacks information, rely on the web but state that the PDF did not contain the info.
    - Provide a clear and concise answer.
    """
    
    response = client.models.generate_content(
        model="gemini-2.5-flash-lite",
        contents=prompt,
        config=genai.types.GenerateContentConfig(
            temperature=0.3
        ),
    )
    
    return response.text

## 5. Run the Agent

In [29]:
# Example Usage
query = "Summarize the main points of the report and verify whether there is a document from Stanford which talks about context with high accuracy suddenly collapsing in size and its performance decreasing significantly"
response = verify_information(query)
print("\n=== AGENT RESPONSE ===\n")
print(response)

Processing query: Summarize the main points of the report and verify whether there is a document from Stanford which talks about context with high accuracy suddenly collapsing in size and its performance decreasing significantly
Retrieving context from PDF...
Searching web for verification/supplemental info...
Generating final response...

=== AGENT RESPONSE ===

The main point of the report, as extracted from the provided PDF context, describes a significant event where a system's context, initially at 8,282 tokens with 66.7% accuracy, suddenly collapsed to just 122 tokens during an update. This collapse resulted in a substantial drop in accuracy to 57.1%, which was notably worse than the 63.7% baseline. The remaining text in the PDF appears to be a poetic or personal reflection and does not contribute to the summary of a technical report.

Regarding your query about a Stanford document discussing context with high accuracy suddenly collapsing in size and its performance decreasing si