<a href="https://colab.research.google.com/github/gcosma/COP509/blob/main/TutorialRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COP509: Natural Language Processing
# Tutorial: Retrieval-Augmented Generation (RAG)

**Department of Computer Science, Loughborough University**

**Prof. Georgina Cosma** (g.cosma@lboro.ac.uk)

---

## üéØ Learning Outcomes

By the end of this tutorial, you will be able to:

1. Explain what RAG is and why it addresses key limitations of large language models
2. Implement the retrieval component using vector similarity search
3. Understand how retrieved context is integrated with LLM prompts
4. Build a complete RAG pipeline in Python
5. Evaluate RAG systems and identify common failure modes

---

## üìö Prerequisites

This tutorial builds on concepts from:
- **Weeks 2-3**: Vector Space Models (TF-IDF, word embeddings)
- **Week 4**: Similarity measures (cosine similarity)
- **Week 5**: LSI and semantic similarity
- **Week 6**: Transformer models (BERT, attention mechanism)

---

## üìñ How to Use This Notebook

| Symbol | Meaning |
|--------|--------|
| üìù | Important concept - read carefully |
| üíª | Code to run |
| ‚úÖ | Exercise for you to complete |
| ‚ö†Ô∏è | Common mistake or warning |
| üîë | Key takeaway |

**Instructions:**
1. **Run cells in order** - each cell depends on previous cells
2. **Read the explanations** - don't just run the code!
3. **Check expected outputs** - verify your results match
4. **Complete the exercises** - practice is essential for learning

---

## Part 1: Setup and Installation

### üìù What are we installing?

| Library | Purpose | Why we need it |
|---------|---------|----------------|
| `sentence-transformers` | Creates dense embeddings | Converts text to vectors that capture meaning |
| `chromadb` | Vector database | Stores and searches embeddings efficiently |
| `langchain` | LLM framework | Provides tools to build RAG pipelines |
| `transformers` | Pre-trained models | Gives us access to language models like FLAN-T5 |

**‚è±Ô∏è Note:** Installation takes 2-3 minutes. You only need to run this once per Colab session.

In [None]:
# üíª STEP 1: Install required packages
# The -q flag means 'quiet' - it reduces output clutter
print("üì¶ Installing packages... this may take 2-3 minutes...")
!pip install -q sentence-transformers chromadb langchain-text-splitters
!pip install -q transformers accelerate
print("‚úÖ Installation complete!")

üì¶ Installing packages... this may take 2-3 minutes...
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m52.0/52.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# üíª STEP 2: Import libraries
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

**Expected output:**
```
‚úÖ Libraries imported successfully!
```

**‚ö†Ô∏è Troubleshooting:** If you see errors:
- Try running the installation cell again
- Go to **Runtime ‚Üí Restart runtime**, then run both cells again
- Make sure you're connected to the internet

---

## Part 2: Understanding the Problem

### üìù What is an LLM?

**LLM** = **L**arge **L**anguage **M**odel

These are neural networks trained on massive amounts of text to understand and generate human language.

**Examples:** ChatGPT, Claude, BERT, GPT-4, FLAN-T5

### üìù The Problem: Why LLMs Need Help

LLMs have three major limitations:

| Limitation | What it means | Example |
|------------|--------------|--------|
| **Knowledge Cutoff** | Only knows information from training data | Cannot answer "Who won the 2025 election?" |
| **Hallucinations** | Generates plausible but FALSE information | May invent statistics or citations |
| **No Private Data** | Cannot access your documents | Cannot answer "What is our company's leave policy?" |

### üìù The Solution: RAG

**RAG** = **R**etrieval-**A**ugmented **G**eneration

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        RAG PIPELINE                             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ   User Query ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                           ‚îÇ
‚îÇ                    ‚îÇ                                           ‚îÇ
‚îÇ                    ‚ñº                                           ‚îÇ
‚îÇ            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îÇ
‚îÇ            ‚îÇ   RETRIEVE   ‚îÇ ‚îÄ‚îÄ‚îÄ‚ñ∫ ‚îÇ  Relevant Docs   ‚îÇ         ‚îÇ
‚îÇ            ‚îÇ   (Search)   ‚îÇ      ‚îÇ  from Database   ‚îÇ         ‚îÇ
‚îÇ            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îÇ
‚îÇ                                           ‚îÇ                    ‚îÇ
‚îÇ                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îÇ
‚îÇ                    ‚ñº                                           ‚îÇ
‚îÇ            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                    ‚îÇ
‚îÇ            ‚îÇ   AUGMENT    ‚îÇ  Query + Retrieved Docs            ‚îÇ
‚îÇ            ‚îÇ   (Combine)  ‚îÇ  = Augmented Prompt                ‚îÇ
‚îÇ            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                    ‚îÇ
‚îÇ                   ‚îÇ                                            ‚îÇ
‚îÇ                   ‚ñº                                            ‚îÇ
‚îÇ            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                    ‚îÇ
‚îÇ            ‚îÇ   GENERATE   ‚îÇ                                    ‚îÇ
‚îÇ            ‚îÇ    (LLM)     ‚îÇ                                    ‚îÇ
‚îÇ            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                    ‚îÇ
‚îÇ                   ‚îÇ                                            ‚îÇ
‚îÇ                   ‚ñº                                            ‚îÇ
‚îÇ              Answer grounded                                   ‚îÇ
‚îÇ              in real documents                                 ‚îÇ
‚îÇ                                                                ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üîë Key Analogy

| Standard LLM | RAG |
|--------------|-----|
| Closed-book exam | Open-book exam |
| Must answer from memory | Can look up information |
| May guess incorrectly | Answers grounded in sources |

---

## Part 3: The Retrieval Component

### üìù What is Retrieval?

**Retrieval** = Finding relevant documents for a given query

This is exactly what you learned in **Weeks 2-5**! The process is:

```
Step 1: Convert all documents into vectors (numbers)
        ‚Üì
Step 2: Convert the user's query into a vector
        ‚Üì
Step 3: Calculate similarity between query and all documents
        ‚Üì
Step 4: Return the top-k most similar documents
```

### üìù What is "top-k"?

**k** is simply the number of documents to retrieve:
- **k=3** means retrieve the 3 most similar documents
- **k=5** means retrieve the 5 most similar documents

**Trade-off:** Higher k = more context, but may include irrelevant documents

### üìù Dense vs Sparse Embeddings

| Type | Method | What it looks like | Pros | Cons |
|------|--------|-------------------|------|------|
| **Sparse** | TF-IDF (Week 2-3) | `[0, 0, 0, 1, 0, 0, ..., 0]` | Fast, interpretable | Only matches exact words |
| **Dense** | Neural embeddings | `[0.2, -0.5, 0.8, ..., 0.1]` | Captures meaning | Requires more computation |

### üîë Why Dense Embeddings are Better for RAG

**Example:** User asks "How many holiday days do I get?"

| Method | Can it find "Annual leave is 30 days"? | Why? |
|--------|---------------------------------------|------|
| TF-IDF (Sparse) | ‚ùå Low score | "holiday" ‚â† "annual leave" (different words) |
| Dense Embeddings | ‚úÖ High score | Understands "holiday" ‚âà "annual leave" (same meaning) |

### üìù What is Sentence-Transformers?

**Sentence-Transformers** is a library that provides pre-trained models to convert text into dense vectors.

We'll use `all-MiniLM-L6-v2` which:
- ‚úÖ Is **free** and runs locally (no API key!)
- ‚úÖ Produces **384-dimensional** vectors
- ‚úÖ Is **fast** and works well for most tasks
- ‚úÖ Understands **semantic similarity**

In [None]:
# üíª STEP 3: Load the embedding model
# First run downloads ~90MB - be patient!
print("üì• Loading embedding model (first run downloads ~90MB)...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Model loaded!")
print(f"   This model produces {embedding_model.get_sentence_embedding_dimension()}-dimensional vectors")

**Expected output:**
```
üì• Loading embedding model (first run downloads ~90MB)...
‚úÖ Model loaded!
   This model produces 384-dimensional vectors
```

### üìù Step 3.1: Create a Knowledge Base

In a real application, you would load documents from files. For this tutorial, we'll create a simple knowledge base of company policies.

In [None]:
# üíª STEP 4: Create our knowledge base
# These are example company policy documents
documents = [
    "Annual leave allowance is 30 days per year for all full-time employees.",
    "Sick leave requires a doctor's note after 3 consecutive days of absence.",
    "Remote working is permitted 2 days per week with manager approval.",
    "The office dress code is business casual from Monday to Thursday.",
    "Friday is casual dress day and employees may wear jeans.",
    "All employees must complete mandatory health and safety training annually.",
    "Parental leave is 26 weeks for primary carers and 4 weeks for secondary carers.",
    "The company provides a pension contribution of 5% of salary.",
    "Performance reviews are conducted twice per year in March and September.",
    "Expenses must be submitted within 30 days of being incurred."
]

print(f"‚úÖ Knowledge base created with {len(documents)} documents")
print(f"\nüìÑ Example document: '{documents[0]}'")

### üìù Step 3.2: Convert Documents to Embeddings

Now we convert each document into a vector. This is called **indexing**.

**What happens:**
```
"Annual leave is 30 days..."  ‚Üí  [0.023, -0.156, 0.089, ..., 0.045]
                                        (384 numbers)
```

In [None]:
# üíª STEP 5: Create embeddings for all documents
# This is the INDEXING step - done ONCE before any queries
print("üîÑ Creating document embeddings...")
doc_embeddings = embedding_model.encode(documents)

print(f"‚úÖ Created embeddings!")
print(f"   Shape: {doc_embeddings.shape}")
print(f"   Meaning: {doc_embeddings.shape[0]} documents √ó {doc_embeddings.shape[1]} dimensions")
print(f"\nüìä First 5 values of document 1's embedding:")
print(f"   {doc_embeddings[0][:5]}")

**Expected output:**
```
üîÑ Creating document embeddings...
‚úÖ Created embeddings!
   Shape: (10, 384)
   Meaning: 10 documents √ó 384 dimensions

üìä First 5 values of document 1's embedding:
   [ 0.01234  -0.05678  0.12345  -0.09876  0.03456]
```
(Your exact numbers will be slightly different)

### üìù Step 3.3: Create the Retrieval Function

Now we create a function that:
1. Takes a query
2. Converts it to an embedding
3. Finds the most similar documents

**This uses cosine similarity from Week 4!**

In [None]:
# üíª STEP 6: Define the retrieval function
def retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k=3):
    """
    Retrieve the most relevant documents for a query.

    Parameters:
    -----------
    query : str
        The user's question
    doc_embeddings : numpy array
        Pre-computed embeddings for all documents
    documents : list
        List of original document texts
    embedding_model : SentenceTransformer
        The model used to create embeddings
    top_k : int
        Number of documents to retrieve (default: 3)

    Returns:
    --------
    list of tuples: [(document_text, similarity_score), ...]
    """
    # Step A: Convert query to embedding (MUST use same model!)
    query_embedding = embedding_model.encode([query])

    # Step B: Calculate cosine similarity with ALL documents
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

    # Step C: Get indices of top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Step D: Return documents with their scores
    results = [(documents[i], similarities[i]) for i in top_indices]

    return results

print("‚úÖ Retrieval function defined!")

### üíª Let's Test the Retrieval!

In [None]:
# üíª STEP 7: Test the retrieval function
query = "How many holiday days do I get?"

print(f"üîç Query: '{query}'")
print("\n" + "="*65)
print("üìÑ Retrieved documents (ranked by similarity):")
print("="*65)

results = retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k=3)

for i, (doc, score) in enumerate(results, 1):
    print(f"\n[{i}] Similarity: {score:.3f}")
    print(f"    üìÑ {doc}")

### üîë Key Observation!

The query asked about **"holiday days"** but the top result mentions **"annual leave"**.

**This is the power of dense embeddings!** They understand that:
- "holiday" ‚âà "annual leave"
- "days" ‚âà "days"

TF-IDF would have given a LOW score because the exact words don't match!

### ‚úÖ Exercise 3.1: Test Semantic Matching

Run the cell below and observe how the system finds relevant documents even when you use different words.

In [None]:
# ‚úÖ EXERCISE: Observe semantic matching in action
test_queries = [
    "What should I wear to the office?",      # Should match ‚Üí dress code
    "Can I work from home?",                   # Should match ‚Üí remote working
    "When is my performance evaluation?",      # Should match ‚Üí performance reviews
    "What happens if I'm sick?"                # Should match ‚Üí sick leave
]

print("Testing semantic matching:")
print("="*65)

for query in test_queries:
    results = retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k=1)
    doc, score = results[0]
    print(f"\nüîç Query: '{query}'")
    print(f"   ‚Üí Best match [{score:.3f}]: {doc[:50]}...")

### ‚úÖ Exercise 3.2: Compare Sparse vs Dense

Let's see how TF-IDF (sparse) compares to dense embeddings.

In [None]:
# ‚úÖ EXERCISE: Compare TF-IDF vs Dense Embeddings
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectors (sparse - like Week 2-3)
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

# Query where words DON'T match exactly
query = "How many holiday days do I get?"

# TF-IDF retrieval
query_tfidf = tfidf.transform([query])
tfidf_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]
tfidf_top = np.argmax(tfidf_scores)

# Dense retrieval
dense_results = retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k=1)

print(f"üîç Query: '{query}'")
print("\n" + "="*65)
print("TF-IDF (Sparse) - matches exact words only:")
print(f"   Score: {tfidf_scores[tfidf_top]:.3f}")
print(f"   Doc: {documents[tfidf_top][:50]}...")
print("\n" + "="*65)
print("Dense Embeddings - understands meaning:")
print(f"   Score: {dense_results[0][1]:.3f}")
print(f"   Doc: {dense_results[0][0][:50]}...")
print("\nüîë Notice: Dense embeddings give a HIGHER score because they")
print("   understand that 'holiday' ‚âà 'annual leave'!")

---

## Part 4: The Generation Component

### üìù What is Generation?

After retrieving relevant documents, we need to **generate an answer** using an LLM.

**The key idea:** We don't just send the query to the LLM. We send the query **PLUS** the retrieved documents.

```
Standard LLM:     Query ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫ LLM ‚îÄ‚îÄ‚ñ∫ Answer (may hallucinate)

RAG:              Query + Retrieved Documents ‚îÄ‚îÄ‚ñ∫ LLM ‚îÄ‚îÄ‚ñ∫ Grounded Answer ‚úì
```

### üìù What is Prompt Augmentation?

**Augmentation** = Adding retrieved documents to the prompt

We create a prompt that says:
1. "Here is some context" (the retrieved documents)
2. "Here is the question"
3. "Answer based ONLY on the context" (reduces hallucinations)

In [None]:
# üíª STEP 8: Create the prompt augmentation function
def create_rag_prompt(query, retrieved_docs):
    """
    Create a prompt that includes retrieved context.
    This is the AUGMENTATION step.
    """
    # Format the retrieved documents
    context_parts = []
    for i, (doc, score) in enumerate(retrieved_docs, 1):
        context_parts.append(f"[Document {i}] {doc}")

    context = "\n".join(context_parts)

    # Create the prompt with clear instructions
    prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain enough information, say "I don't have enough information."

Context:
{context}

Question: {query}

Answer:"""

    return prompt

print("‚úÖ Prompt augmentation function defined!")

In [None]:
# üíª Let's see what an augmented prompt looks like
query = "How many holiday days do I get?"
retrieved = retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k=2)

prompt = create_rag_prompt(query, retrieved)

print("üìù Generated RAG Prompt:")
print("="*65)
print(prompt)
print("="*65)

### üìù Loading the Language Model

We'll use **FLAN-T5**, a free, open-source model from Google:
- ‚úÖ Runs locally on Colab (no API key needed)
- ‚úÖ Good at following instructions
- ‚úÖ Small enough to run without a GPU

**‚è±Ô∏è Note:** First run downloads ~1GB. This takes 2-3 minutes.

In [None]:
# üíª STEP 9: Load the language model
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

print("üì• Loading language model (first run downloads ~1GB)...")
print("‚è±Ô∏è  This may take 2-3 minutes...")

model_name = "google/flan-t5-base"  # Free, no API key needed!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

generator = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100
)

print("‚úÖ Language model loaded!")

### üìù The Complete RAG Pipeline

Now let's put it all together:

```
User Query
    ‚îÇ
    ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. RETRIEVE    ‚îÇ  ‚Üê Find relevant documents
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  2. AUGMENT     ‚îÇ  ‚Üê Add documents to prompt
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
         ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  3. GENERATE    ‚îÇ  ‚Üê LLM produces answer
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
         ‚ñº
   Grounded Answer
```

In [None]:
# üíª STEP 10: Define the complete RAG pipeline
def rag_answer(query, doc_embeddings, documents, embedding_model, generator, top_k=3):
    """
    Complete RAG pipeline: Retrieve ‚Üí Augment ‚Üí Generate
    """
    # Step 1: RETRIEVE relevant documents
    retrieved = retrieve_documents(query, doc_embeddings, documents, embedding_model, top_k)

    # Step 2: AUGMENT - create prompt with context
    prompt = create_rag_prompt(query, retrieved)

    # Step 3: GENERATE answer using the LLM
    response = generator(prompt)[0]['generated_text']

    return response, retrieved

print("‚úÖ Complete RAG pipeline defined!")

In [None]:
# üíª STEP 11: Test the complete RAG system!
query = "How many holiday days do I get?"

print(f"üîç Question: {query}")
print("\nüîÑ Processing...")

answer, sources = rag_answer(query, doc_embeddings, documents, embedding_model, generator)

print("\n" + "="*65)
print(f"üí¨ Answer: {answer}")
print("="*65)
print("\nüìö Sources used:")
for doc, score in sources:
    print(f"   [{score:.3f}] {doc}")

### üîë Key Observation

The answer **"30 days"** comes directly from the retrieved document!

The LLM didn't hallucinate - it extracted the information from the context we provided.

### ‚úÖ Exercise 4.1: Test with Different Questions

In [None]:
# ‚úÖ EXERCISE: Test the RAG system with multiple questions
test_questions = [
    "What is the dress code on Friday?",
    "How long is parental leave for primary carers?",
    "When do I need a doctor's note?",
    "Can I work from home?"
]

print("Testing RAG system:")
print("="*65)

for q in test_questions:
    answer, _ = rag_answer(q, doc_embeddings, documents, embedding_model, generator)
    print(f"\n‚ùì Q: {q}")
    print(f"üí¨ A: {answer}")

---

## Part 5: Chunking - Splitting Long Documents

### üìù What is Chunking?

**Chunking** = Splitting long documents into smaller pieces

### üìù Why Do We Need Chunking?

| Reason | Explanation |
|--------|-------------|
| **Embedding model limits** | Models can only process ~512 tokens at once |
| **LLM context limits** | Can't fit entire documents in the prompt |
| **Precision** | We want specific paragraphs, not whole documents |

### üìù Chunk Size Trade-offs

```
Chunks TOO SMALL                    Chunks TOO LARGE
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                    ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚ùå Lose context                      ‚ùå Include irrelevant info
‚ùå May split sentences               ‚ùå Less precise retrieval
‚ùå "Annual leave is" / "30 days"    ‚ùå Whole handbook in one chunk
```

**üîë Goal:** Each chunk should contain ONE complete, coherent piece of information.

### üìù What is Overlap?

**Overlap** = Shared text between adjacent chunks

```
Without overlap:  [Chunk 1: "Annual leave is"] [Chunk 2: "30 days per year"]
                                           ‚Üë
                                  Information split! ‚ùå

With overlap:     [Chunk 1: "Annual leave is 30 days"]
                                       [Chunk 2: "is 30 days per year"]
                                           ‚Üë
                                  Information preserved! ‚úì
```

**Rule of thumb:** Set overlap to 10-20% of chunk size

In [None]:
# üíª STEP 12: Chunking example
from langchain_text_splitters import RecursiveCharacterTextSplitter

# A longer document to demonstrate chunking
long_document = """
EMPLOYEE HANDBOOK - LEAVE POLICIES

Section 1: Annual Leave
All full-time employees are entitled to 30 days of annual leave per year.
Leave must be requested at least 2 weeks in advance for periods longer than 3 days.
Up to 5 days of unused leave may be carried over to the following year.

Section 2: Sick Leave
Employees may take sick leave when unwell without a doctor's note for up to 3 days.
For absences longer than 3 days, a medical certificate is required.
The company provides full pay for the first 10 days of sick leave per year.

Section 3: Parental Leave
Primary carers are entitled to 26 weeks of paid parental leave.
Secondary carers are entitled to 4 weeks of paid parental leave.
Parental leave must be taken within the first year of the child's birth.
"""

print(f"üìÑ Original document length: {len(long_document)} characters")

In [None]:
# üíª STEP 13: Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,      # Maximum characters per chunk
    chunk_overlap=50,    # Overlap between chunks
    separators=["\n\n", "\n", ". ", " "]  # Try these in order
)

chunks = text_splitter.split_text(long_document)

print(f"‚úÖ Split into {len(chunks)} chunks\n")
print("="*65)
for i, chunk in enumerate(chunks, 1):
    print(f"\nüìÑ Chunk {i} ({len(chunk)} chars):")
    print("-"*40)
    print(chunk.strip())

### üìù How RecursiveCharacterTextSplitter Works

It tries separators **in order**:

1. First: `\n\n` (paragraph breaks) - keeps paragraphs together
2. Then: `\n` (line breaks) - keeps lines together  
3. Then: `. ` (sentences) - keeps sentences together
4. Finally: ` ` (spaces) - splits words as last resort

This preserves natural text structure!

### ‚úÖ Exercise 5.1: Experiment with Chunk Sizes

In [None]:
# ‚úÖ EXERCISE: Try different chunk sizes
chunk_sizes = [100, 200, 300, 500]

print("Comparing chunk sizes:")
print("="*50)

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=size // 10  # 10% overlap
    )
    result_chunks = splitter.split_text(long_document)
    avg_len = np.mean([len(c) for c in result_chunks])

    print(f"\nChunk size {size}:")
    print(f"   ‚Üí {len(result_chunks)} chunks (avg {avg_len:.0f} chars each)")

---

## Part 6: Vector Databases

### üìù What is a Vector Database?

A **vector database** is designed to store and search embeddings efficiently.

| Regular Database | Vector Database |
|-----------------|----------------|
| Stores text, numbers | Stores vectors (embeddings) |
| Exact matching | Similarity matching |
| SQL queries | Vector similarity search |

### üìù Why Use a Vector Database?

- When you have **thousands/millions** of documents
- Comparing every vector is **too slow**
- Vector DBs use special algorithms for **fast search**
- They **persist data** between sessions

In [None]:
# üíª STEP 14: Create a Chroma vector database
import chromadb
from chromadb.utils import embedding_functions

# Create a client
chroma_client = chromadb.Client()

# Create embedding function (same model as before!)
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create a collection
collection = chroma_client.create_collection(
    name="employee_handbook",
    embedding_function=embedding_func
)

print("‚úÖ Chroma vector database created!")

In [None]:
# üíª STEP 15: Add chunks to the database
collection.add(
    documents=chunks,
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    metadatas=[{"source": "handbook", "chunk_id": i} for i in range(len(chunks))]
)

print(f"‚úÖ Added {collection.count()} chunks to the database")

In [None]:
# üíª STEP 16: Query the vector database
query = "How many days can I carry over?"

results = collection.query(
    query_texts=[query],
    n_results=2
)

print(f"üîç Query: '{query}'")
print("\n" + "="*65)
print("üìÑ Retrieved chunks:")
print("="*65)

for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
    similarity = 1 - dist  # Convert distance to similarity
    print(f"\n[{i}] Similarity: {similarity:.3f}")
    print(f"    {doc[:100]}..." if len(doc) > 100 else f"    {doc}")

---

## Part 7: Evaluation

### üìù Two Things to Evaluate

| Component | Question | If it fails... |
|-----------|----------|---------------|
| **Retrieval** | Did we find the right documents? | Wrong context ‚Üí Wrong answer |
| **Generation** | Is the answer correct? | Even right context can give wrong answer |

### üîë Important: Evaluate Retrieval FIRST!

If retrieval fails, generation cannot succeed.

### üìù Retrieval Metrics (from Week 5!)

| Metric | Formula | Meaning |
|--------|---------|--------|
| **Precision@k** | Relevant in top-k √∑ k | What % of retrieved docs are relevant? |
| **Recall@k** | Relevant in top-k √∑ Total relevant | What % of relevant docs did we find? |

In [None]:
# üíª STEP 17: Evaluation function
def evaluate_retrieval(retrieved_ids, relevant_ids):
    """
    Calculate Precision and Recall for retrieval.
    """
    retrieved = set(retrieved_ids)
    relevant = set(relevant_ids)

    true_positives = len(retrieved & relevant)

    precision = true_positives / len(retrieved) if retrieved else 0
    recall = true_positives / len(relevant) if relevant else 0

    return {'precision': precision, 'recall': recall}

# Example evaluation
query = "What is the parental leave policy?"

# Find which chunks are actually relevant (ground truth)
relevant_ids = [i for i, chunk in enumerate(chunks) if 'parental' in chunk.lower()]

# Get retrieved chunks
results = collection.query(query_texts=[query], n_results=3)
retrieved_ids = [int(id.split('_')[1]) for id in results['ids'][0]]

print(f"üîç Query: '{query}'")
print(f"\nüìä Relevant chunk IDs: {relevant_ids}")
print(f"üìä Retrieved chunk IDs: {retrieved_ids}")

metrics = evaluate_retrieval(retrieved_ids, relevant_ids)
print(f"\n‚úÖ Precision: {metrics['precision']:.2f}")
print(f"‚úÖ Recall: {metrics['recall']:.2f}")

---

## Part 8: Common Failure Modes

### üìù Understanding How RAG Can Fail

In [None]:
# üíª Failure Mode 1: Information not in knowledge base
query = "What is the company's policy on pets in the office?"

results = collection.query(query_texts=[query], n_results=2)

print(f"üîç Query: '{query}'")
print("\n‚ö†Ô∏è Retrieved documents:")
for doc in results['documents'][0]:
    print(f"   ‚Üí {doc[:60]}...")

print("\n‚ùå PROBLEM: The knowledge base has NO information about pets!")
print("   A good RAG system should say 'I don't have that information.'")

In [None]:
# üíª Failure Mode 2: Different terminology
query = "What's the PTO policy?"  # PTO = American term

results = collection.query(query_texts=[query], n_results=2)

print(f"üîç Query: '{query}'")
print("\nüìÑ Retrieved:")
for doc, dist in zip(results['documents'][0], results['distances'][0]):
    print(f"   [{1-dist:.3f}] {doc[:50]}...")

print("\n‚ö†Ô∏è NOTE: Our documents use 'annual leave' (British), not 'PTO' (American).")
print("   Dense embeddings may still work, but it's not guaranteed.")

---

## Part 9: Summary

### üîë Key Takeaways

| Concept | What You Learned |
|---------|------------------|
| **RAG** | Retrieve ‚Üí Augment ‚Üí Generate |
| **Why RAG** | Fixes LLM limitations (cutoff, hallucinations, no private data) |
| **Dense embeddings** | Capture semantic meaning ("holiday" ‚âà "annual leave") |
| **Chunking** | Split documents for precise retrieval |
| **Vector databases** | Fast similarity search at scale |
| **Evaluation** | Check retrieval AND generation separately |

### üìä Recommended Settings

| Parameter | Starting Value | Notes |
|-----------|---------------|-------|
| Chunk size | 500-1000 chars | Adjust for your documents |
| Chunk overlap | 50-100 chars | ~10% of chunk size |
| Top-k | 3-5 | Balance coverage vs noise |
| Embedding model | all-MiniLM-L6-v2 | Free, fast, good quality |

---

## ‚úÖ Lab Exercise: Build Your Own RAG System

**Task:** Build a RAG system for product reviews.

**Steps:**
1. Create a Chroma collection
2. Add the reviews
3. Query the system
4. Generate answers

In [None]:
# Sample reviews for the exercise
sample_reviews = [
    "The pen writes smoothly and the ink quality is excellent. Great value.",
    "Disappointed. The scissors broke after just one use.",
    "These markers are perfect for art projects. Vibrant colours!",
    "Overpriced. I found better alternatives at half the price.",
    "The notebook paper is too thin. Ink bleeds through.",
    "Love these coloured pencils! They blend beautifully.",
    "Poor quality control. Two pens in the pack were dried out.",
    "Fast delivery and excellent packaging. Exceeded expectations.",
    "Not worth the money. The highlighters stopped working quickly.",
    "Best art supplies I've ever purchased. Will buy again!"
]

print(f"‚úÖ Loaded {len(sample_reviews)} sample reviews")

In [None]:
# ‚úÖ YOUR CODE HERE - Try it yourself first!
# Step 1: Create a new collection called "reviews"
# review_collection = chroma_client.create_collection(...)

# Step 2: Add the reviews
# review_collection.add(...)

# Step 3: Query for "quality issues"
# results = review_collection.query(...)

# Step 4: Print the results

print("üíª Try implementing this yourself first, then check the solution below!")

### üìù Solution

Once you've tried it yourself, check your solution against this:

In [None]:
# ‚úÖ SOLUTION - Step 1: Create a new collection
review_collection = chroma_client.create_collection(
    name="product_reviews",
    embedding_function=embedding_func
)
print("‚úÖ Step 1: Collection created!")

In [None]:
# ‚úÖ SOLUTION - Step 2: Add the reviews to the collection
review_collection.add(
    documents=sample_reviews,
    ids=[f"review_{i}" for i in range(len(sample_reviews))],
    metadatas=[{"source": "art_supplies", "review_id": i} for i in range(len(sample_reviews))]
)
print(f"‚úÖ Step 2: Added {review_collection.count()} reviews to the collection!")

In [None]:
# ‚úÖ SOLUTION - Step 3: Query the collection
test_queries = [
    "Find reviews about quality issues",
    "What do people say about the price?",
    "Are there positive reviews about pens?"
]

print("‚úÖ Step 3: Querying the review collection")
print("=" * 65)

for query in test_queries:
    results = review_collection.query(
        query_texts=[query],
        n_results=3
    )

    print(f"\nüîç Query: '{query}'")
    print("-" * 50)
    for i, (doc, dist) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
        similarity = 1 - dist
        print(f"   [{i}] ({similarity:.3f}) {doc}")

In [None]:
# ‚úÖ SOLUTION - Step 4: Use the full RAG pipeline to generate answers
# First, create embeddings for the reviews (for our rag_answer function)
review_embeddings = embedding_model.encode(sample_reviews)

print("‚úÖ Step 4: Full RAG pipeline with reviews")
print("=" * 65)

rag_queries = [
    "What are the main quality complaints?",
    "Which products are recommended?"
]

for query in rag_queries:
    answer, sources = rag_answer(query, review_embeddings, sample_reviews, embedding_model, generator, top_k=3)

    print(f"\n‚ùì Question: {query}")
    print(f"üí¨ Answer: {answer}")
    print("üìö Sources:")
    for doc, score in sources:
        print(f"   [{score:.3f}] {doc[:50]}...")

### üîë What You Learned in This Lab

1. **Creating a collection** - `chroma_client.create_collection()`
2. **Adding documents** - `collection.add(documents=..., ids=..., metadatas=...)`
3. **Querying** - `collection.query(query_texts=[...], n_results=k)`
4. **Full RAG pipeline** - Combining retrieval with generation

### ‚úÖ Self-Check Questions

Can you answer these?

1. Why do we need unique IDs for each document?
2. What happens if you increase `n_results` to 5?
3. What's the difference between the Chroma query and our `retrieve_documents` function?
4. How would you add new reviews to an existing collection?

---

## üìö References

**Original RAG Paper:**
- Lewis et al. (2020), "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"
- https://arxiv.org/abs/2005.11401

**Tools:**
- Sentence-Transformers: https://www.sbert.net/
- Chroma: https://www.trychroma.com/
- LangChain: https://python.langchain.com/

**Further Reading:**
- "Lost in the Middle" (Liu et al., 2023)
- RAGAS evaluation framework