# üïµÔ∏è Project 4: The Data Detective

**Objective:** Build an agent that queries PDFs and verifies facts on Wikipedia.

## üìñ What You'll Learn

- RAG (Retrieval Augmented Generation) architecture
- Vector databases (ChromaDB)
- Document chunking and embedding
- Multi-source routing (PDF vs Web)
- Citation and source tracking

## üéØ Architecture

```
User Question
     |
     v
  [Router] --> Is this in the PDF or needs web search?
     |
     +---> [Vector Search] --> PDF chunks
     |
     +---> [Wikipedia API] --> Web facts
     |
     v
  [LLM + Context] --> Answer with citations
```

In [None]:
# Install required packages
# !pip install chromadb pypdf sentence-transformers wikipedia-api openai python-dotenv

In [None]:
import os
import re
from typing import List, Dict, Any
import chromadb
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
import wikipediaapi
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("‚úÖ Setup complete!")

## Task 1: Create Sample PDF Document

We'll create a sample document about AI agents for testing.

In [None]:
# Sample document content (simulating PDF content)
SAMPLE_DOCUMENT = """
AI Agent Systems: Technical Overview

Introduction
AI agents are autonomous systems that perceive their environment, make decisions, and take actions to achieve specific goals. Modern AI agents combine large language models (LLMs) with tool use, memory systems, and planning capabilities.

Core Components

1. Profile: The agent's identity and capabilities
An agent's profile defines its role, expertise, and available tools. This includes system prompts that shape the agent's behavior and decision-making patterns.

2. Memory: Short-term and long-term storage
Memory systems enable agents to maintain context across interactions. Short-term memory uses the LLM's context window, while long-term memory typically employs vector databases like ChromaDB or Pinecone for persistent storage.

3. Planning: Breaking down complex tasks
Planning mechanisms allow agents to decompose large goals into executable steps. Common approaches include Chain-of-Thought prompting, Tree-of-Thought reasoning, and ReAct (Reasoning + Acting) patterns.

4. Action: Tool use and execution
Agents interact with external systems through tools. These can include APIs, databases, file systems, web browsers, and code interpreters. Tool use follows a pattern: the LLM generates a tool call, the system executes it, and results are fed back to the LLM.

RAG Architecture
Retrieval Augmented Generation (RAG) enhances LLMs by retrieving relevant information before generating responses. The process involves:
1. Chunking documents into smaller segments
2. Creating embeddings for each chunk
3. Storing embeddings in a vector database
4. Retrieving relevant chunks based on query similarity
5. Passing retrieved context to the LLM for generation

Technical Implementation
A production RAG system requires careful consideration of chunk size (typically 200-500 tokens), overlap between chunks (50-100 tokens), embedding models (e.g., sentence-transformers), and retrieval strategies (semantic search, hybrid search, or re-ranking).

Multi-Agent Systems
Advanced applications use multiple specialized agents that collaborate. For example, a software development system might include a Product Manager agent for requirements, a Coder agent for implementation, and a Reviewer agent for quality assurance. Communication between agents can be synchronous (direct calls) or asynchronous (message queues).

Performance Considerations
Key metrics for agent systems include latency (time to first token and total response time), cost (API calls and token usage), accuracy (task completion rate), and reliability (error handling and fallback mechanisms). Optimization strategies include caching, batching, and model fine-tuning.

Conclusion
AI agents represent a paradigm shift from passive LLMs to active autonomous systems. By combining reasoning, memory, planning, and action capabilities, they can tackle complex real-world tasks that require multi-step problem solving and external tool integration.
"""

print(f"üìÑ Sample document loaded ({len(SAMPLE_DOCUMENT)} characters)")

## Task 2: Implement Document Chunking

In [None]:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[Dict[str, Any]]:
    """
    Split text into overlapping chunks.
    
    Args:
        text: Full document text
        chunk_size: Target size in characters
        overlap: Overlap between chunks in characters
    
    Returns:
        List of chunk dictionaries with text and metadata
    """
    # Split into paragraphs first
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    
    chunks = []
    current_chunk = ""
    chunk_id = 0
    
    for para in paragraphs:
        if len(current_chunk) + len(para) < chunk_size:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append({
                    'id': f'chunk_{chunk_id}',
                    'text': current_chunk.strip(),
                    'metadata': {'chunk_id': chunk_id, 'char_count': len(current_chunk)}
                })
                chunk_id += 1
            
            # Start new chunk with overlap from previous
            overlap_text = current_chunk[-overlap:] if len(current_chunk) > overlap else current_chunk
            current_chunk = overlap_text + para + "\n\n"
    
    # Add final chunk
    if current_chunk:
        chunks.append({
            'id': f'chunk_{chunk_id}',
            'text': current_chunk.strip(),
            'metadata': {'chunk_id': chunk_id, 'char_count': len(current_chunk)}
        })
    
    return chunks

# Chunk the document
chunks = chunk_text(SAMPLE_DOCUMENT, chunk_size=600, overlap=100)

print(f"‚úÖ Created {len(chunks)} chunks\n")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"Chunk {i}:")
    print(f"  Length: {len(chunk['text'])} chars")
    print(f"  Preview: {chunk['text'][:100]}...\n")

## Task 3: Create Vector Store with ChromaDB

In [None]:
# Initialize ChromaDB
chroma_client = chromadb.Client()

# Create or get collection
collection = chroma_client.create_collection(
    name="ai_agents_doc",
    metadata={"description": "AI Agent technical documentation"}
)

# Initialize embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("üì¶ ChromaDB collection created")
print(f"üìê Embedding model loaded (dimension: 384)")

In [None]:
# Add documents to ChromaDB
print("üîÑ Adding documents to vector store...\n")

for chunk in chunks:
    # Create embedding
    embedding = embedding_model.encode(chunk['text']).tolist()
    
    # Add to ChromaDB
    collection.add(
        ids=[chunk['id']],
        embeddings=[embedding],
        documents=[chunk['text']],
        metadatas=[chunk['metadata']]
    )
    print(f"‚úì Added {chunk['id']}")

print(f"\n‚úÖ Vector store ready with {collection.count()} documents")

## Task 4: Implement Semantic Search

In [None]:
def search_documents(query: str, top_k: int = 3) -> List[Dict[str, Any]]:
    """
    Search the vector store for relevant chunks.
    """
    # Create query embedding
    query_embedding = embedding_model.encode(query).tolist()
    
    # Query ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    # Format results
    retrieved = []
    for i in range(len(results['ids'][0])):
        retrieved.append({
            'id': results['ids'][0][i],
            'text': results['documents'][0][i],
            'distance': results['distances'][0][i] if 'distances' in results else None,
            'metadata': results['metadatas'][0][i]
        })
    
    return retrieved

# Test search
test_query = "What are the core components of an AI agent?"
results = search_documents(test_query)

print(f"üîç Query: '{test_query}'\n")
print("üìä Top Results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. [{result['id']}]")
    print(f"   {result['text'][:150]}...\n")

## Task 5: Wikipedia Search Tool

In [None]:
# Initialize Wikipedia API
wiki = wikipediaapi.Wikipedia(
    user_agent='AIAgentProject/1.0',
    language='en'
)

def search_wikipedia(query: str, sentences: int = 3) -> Dict[str, Any]:
    """
    Search Wikipedia and return summary.
    """
    try:
        page = wiki.page(query)
        
        if not page.exists():
            return {
                'success': False,
                'error': f"No Wikipedia page found for '{query}'"
            }
        
        # Get first N sentences
        summary = page.summary.split('. ')[:sentences]
        summary_text = '. '.join(summary) + '.'
        
        return {
            'success': True,
            'title': page.title,
            'summary': summary_text,
            'url': page.fullurl
        }
    except Exception as e:
        return {
            'success': False,
            'error': str(e)
        }

# Test Wikipedia search
test_result = search_wikipedia("Artificial Intelligence")
print("üåê Wikipedia Test:\n")
print(f"Title: {test_result.get('title')}")
print(f"Summary: {test_result.get('summary', 'N/A')[:200]}...")
print(f"URL: {test_result.get('url')}")

## Task 6: Build Router Agent

The agent decides: Should I search the PDF or Wikipedia?

In [None]:
ROUTER_PROMPT = """You are a routing agent. Given a question, decide which source to query:

Sources:
1. DOCUMENT - Use for questions about AI agent architecture, RAG, implementation details
2. WIKIPEDIA - Use for general knowledge, historical facts, definitions

Respond with ONLY one word: DOCUMENT or WIKIPEDIA

Examples:
Q: What are the core components of an AI agent?
A: DOCUMENT

Q: Who invented the Transformer architecture?
A: WIKIPEDIA

Q: How does RAG work?
A: DOCUMENT

Q: What is machine learning?
A: WIKIPEDIA
"""

def route_query(question: str) -> str:
    """
    Determine which source to query.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": ROUTER_PROMPT},
            {"role": "user", "content": f"Question: {question}"}
        ],
        temperature=0
    )
    
    return response.choices[0].message.content.strip().upper()

# Test routing
test_questions = [
    "What is the chunk size recommendation for RAG?",
    "What is natural language processing?",
    "How do multi-agent systems communicate?"
]

print("üß≠ Testing Router:\n")
for q in test_questions:
    route = route_query(q)
    print(f"Q: {q}")
    print(f"‚Üí Route: {route}\n")

## Task 7: Complete RAG Agent with Citations

In [None]:
def answer_with_sources(question: str, verbose: bool = True) -> Dict[str, Any]:
    """
    Answer question using appropriate source and provide citations.
    """
    if verbose:
        print("="*80)
        print(f"üéØ Question: {question}")
        print("="*80 + "\n")
    
    # Step 1: Route the query
    route = route_query(question)
    if verbose:
        print(f"üß≠ Routing decision: {route}\n")
    
    # Step 2: Retrieve context
    if route == "DOCUMENT":
        if verbose:
            print("üìÑ Searching document...\n")
        
        results = search_documents(question, top_k=2)
        context = "\n\n".join([r['text'] for r in results])
        sources = [f"Document chunk {r['id']}" for r in results]
        
    else:  # WIKIPEDIA
        if verbose:
            print("üåê Searching Wikipedia...\n")
        
        # Extract topic from question (simple approach)
        wiki_result = search_wikipedia(question.split()[-1])  # Last word as topic
        
        if wiki_result['success']:
            context = wiki_result['summary']
            sources = [wiki_result['url']]
        else:
            context = "No information found."
            sources = []
    
    if verbose:
        print(f"üìö Context retrieved ({len(context)} chars)\n")
    
    # Step 3: Generate answer with LLM
    answer_prompt = f"""Answer the question based ONLY on the provided context. 
If the context doesn't contain enough information, say so.
Keep your answer concise and factual.

Context:
{context}

Question: {question}

Answer:"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": answer_prompt}],
        temperature=0.3
    )
    
    answer = response.choices[0].message.content
    
    return {
        'question': question,
        'answer': answer,
        'source_type': route,
        'sources': sources,
        'context': context
    }

print("‚úÖ RAG Agent ready!")

## üß™ Test the Complete Agent

In [None]:
# Test 1: Document query
result = answer_with_sources("What are the four core components of an AI agent?")

print("\nüìù Answer:", result['answer'])
print("\nüîó Sources:")
for source in result['sources']:
    print(f"  - {source}")

In [None]:
# Test 2: Wikipedia query
result = answer_with_sources("What is machine learning?")

print("\nüìù Answer:", result['answer'])
print("\nüîó Sources:")
for source in result['sources']:
    print(f"  - {source}")

In [None]:
# Test 3: Complex query
result = answer_with_sources("How does retrieval work in RAG architecture?")

print("\nüìù Answer:", result['answer'])
print("\nüîó Sources:")
for source in result['sources']:
    print(f"  - {source}")

## üéì Key Takeaways

### What You've Built:

A production-grade RAG system with:
1. ‚úÖ Document ingestion and chunking
2. ‚úÖ Vector embeddings and search
3. ‚úÖ Multi-source routing
4. ‚úÖ Context-aware generation
5. ‚úÖ Source citations

### Key Concepts:

- **Chunking Strategy**: Balance between context and specificity
- **Embeddings**: Convert text to searchable vectors
- **Vector Databases**: Efficient similarity search at scale
- **Routing**: Intelligent source selection
- **Grounding**: LLM answers constrained to retrieved context

### Production Considerations:

- Chunk size affects retrieval quality (experiment!)
- Embedding model choice impacts accuracy
- Re-ranking can improve top-k results
- Hybrid search (semantic + keyword) often best
- Cache embeddings for repeated queries

### Next Phase:

In Phase 3, you'll add **persistent memory** so agents remember conversations across sessions!