# Module 10: RAG Pipeline - Building a Complete Q&A System

In this module, we connect **retrieval** (from Module 9) to **generation** to build a full Retrieval-Augmented Generation (RAG) pipeline.

**What is RAG?** RAG is an architecture that enhances LLM responses by grounding them in external knowledge. Instead of relying solely on the model's training data, we:

1. **Embed the user query** using the same embedding model used for our knowledge base
2. **Search a vector database** for the most relevant chunks of text
3. **Build a prompt** that includes the retrieved context alongside the user's question
4. **Generate an answer** using an LLM that is grounded in the retrieved evidence

This end-to-end flow transforms a basic LLM into a knowledge-grounded Q&A system that can answer questions about **your** data.

**What we will build in this notebook:**
- A complete RAG pipeline from scratch (no frameworks)
- Multiple chain strategies: Stuff, Map-Reduce, and Refine
- Source attribution so users know where answers come from
- Multi-turn conversational RAG with memory
- A brief comparison with the LangChain framework

The **from-scratch implementation** is the primary focus. LangChain is shown at the end only for comparison.

---
## 1. Setup

In [None]:
!pip install -q chromadb openai python-dotenv sentence-transformers langchain langchain-openai langchain-community

In [None]:
from dotenv import load_dotenv
import os
load_dotenv("/home/amir/source/.env")

In [None]:
import torch
import chromadb
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from typing import Optional

# Device detection for embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Initialize OpenAI client
client = OpenAI()

# Initialize embedding model
embed_model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
print(f"Embedding model loaded. Dimension: {embed_model.get_sentence_embedding_dimension()}")

---
## 2. Prepare the Knowledge Base

We will create a knowledge base of ~18 detailed paragraphs covering the **History and Concepts of Artificial Intelligence**. This gives us rich, factual content to query against.

After defining the documents, we will:
1. **Chunk** them using recursive character splitting (as in Module 9)
2. **Embed** each chunk using our sentence-transformer model
3. **Store** the embeddings in a ChromaDB collection

In [None]:
# Our knowledge base: detailed paragraphs about AI history and concepts
documents = [
    """The concept of artificial intelligence dates back to antiquity, with myths and stories of artificial beings endowed with intelligence. However, the formal field of AI research was founded at a workshop held at Dartmouth College in the summer of 1956. The workshop was organized by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon. They proposed that 'every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.' This optimistic vision set the stage for decades of research.""",

    """Alan Turing, often considered the father of theoretical computer science, made foundational contributions to AI. In his 1950 paper 'Computing Machinery and Intelligence,' he proposed the Turing Test as a measure of machine intelligence. The test involves a human evaluator who converses with both a human and a machine through text. If the evaluator cannot reliably distinguish the machine from the human, the machine is said to have passed the test. While the Turing Test has been debated and criticized over the decades, it remains an influential concept in AI philosophy.""",

    """The early years of AI (1956-1974) are sometimes called the 'Golden Age' of AI research. During this period, researchers developed programs that could solve algebra problems, prove logical theorems, and even speak English. The General Problem Solver (GPS), created by Herbert Simon and Allen Newell in 1957, was one of the first programs designed to imitate human problem-solving. ELIZA, created by Joseph Weizenbaum in 1966, simulated a psychotherapist and demonstrated natural language processing, though it relied on pattern matching rather than true understanding.""",

    """The first 'AI Winter' occurred roughly between 1974 and 1980. Government funding agencies like DARPA became disillusioned with AI research because early promises of human-level AI had not materialized. The Lighthill Report in the UK (1973) was particularly critical, arguing that AI had failed to achieve its ambitious goals. Funding was drastically cut, and many researchers left the field. This period demonstrated that AI progress would not be linear and that the gap between narrow demonstrations and general intelligence was far wider than initially thought.""",

    """Expert systems revived AI in the 1980s and led to a boom in commercial AI. These systems encoded human expert knowledge as if-then rules and could make decisions in specialized domains. MYCIN, developed at Stanford in the 1970s, diagnosed bacterial infections and recommended antibiotics. R1/XCON, used by Digital Equipment Corporation starting in 1980, configured computer orders and saved the company an estimated $40 million per year. By 1985, the AI industry was worth over $1 billion, with companies investing heavily in expert system technology.""",

    """The second AI Winter began in the late 1980s and lasted into the mid-1990s. Expert systems proved expensive to maintain, brittle in the face of unexpected inputs, and unable to learn from experience. The collapse of the Lisp machine market and reduced government spending contributed to widespread disillusionment. Many AI companies went bankrupt, and the term 'artificial intelligence' became something of a stigma in grant applications. Researchers rebranded their work under terms like 'machine learning,' 'knowledge-based systems,' or 'computational intelligence.'""",

    """Machine learning, a subfield of AI, focuses on algorithms that improve through experience. Rather than programming explicit rules, machine learning systems learn patterns from data. Supervised learning trains on labeled examples (e.g., classifying emails as spam or not spam), unsupervised learning finds hidden structure in unlabeled data (e.g., customer segmentation), and reinforcement learning trains agents through rewards and penalties (e.g., game playing). The shift from rule-based systems to data-driven learning was one of the most important transitions in AI history.""",

    """Neural networks, inspired by the structure of the human brain, consist of layers of interconnected nodes (neurons). Each connection has a weight that is adjusted during training. The backpropagation algorithm, popularized by Rumelhart, Hinton, and Williams in 1986, enabled efficient training of multi-layer networks by propagating error gradients backwards through the network. Despite this breakthrough, neural networks fell out of favor in the 1990s due to limited computational power and the difficulty of training deep architectures, which led to problems like vanishing gradients.""",

    """Deep learning, a subset of machine learning using neural networks with many layers, revolutionized AI starting around 2012. The breakthrough moment came when AlexNet, a deep convolutional neural network designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet competition by a large margin. This success was enabled by three factors: large datasets (ImageNet had millions of labeled images), powerful GPUs that could train deep networks efficiently, and algorithmic innovations like ReLU activation functions and dropout regularization.""",

    """Convolutional Neural Networks (CNNs) are specialized for processing grid-like data such as images. They use convolutional layers that apply learnable filters across the input, detecting features like edges, textures, and shapes. Pooling layers reduce spatial dimensions, making the network invariant to small translations. CNNs have achieved superhuman performance on many image recognition tasks. Notable architectures include LeNet (1998) for digit recognition, VGGNet (2014) for its simplicity and depth, ResNet (2015) which introduced skip connections to train very deep networks, and EfficientNet (2019) which optimized the trade-off between accuracy and computational cost.""",

    """Recurrent Neural Networks (RNNs) are designed for sequential data like text and time series. They maintain a hidden state that acts as memory, allowing information to persist across time steps. However, standard RNNs struggle with long-range dependencies due to vanishing gradients. Long Short-Term Memory (LSTM) networks, invented by Hochreiter and Schmidhuber in 1997, solved this with gating mechanisms that control information flow. Gated Recurrent Units (GRUs), introduced by Cho et al. in 2014, offered a simpler alternative with comparable performance. RNNs dominated natural language processing until the advent of transformers.""",

    """The Transformer architecture, introduced in the landmark paper 'Attention Is All You Need' by Vaswani et al. in 2017, replaced recurrence with self-attention mechanisms. Self-attention allows each token in a sequence to attend to every other token, capturing long-range dependencies efficiently. Transformers process all tokens in parallel rather than sequentially, making them much faster to train on modern hardware. The architecture consists of an encoder and decoder, each composed of multi-head attention layers and feed-forward networks with residual connections and layer normalization.""",

    """Large Language Models (LLMs) are transformer-based models trained on massive text corpora. GPT (Generative Pre-trained Transformer) by OpenAI demonstrated that pre-training on large amounts of text followed by fine-tuning on specific tasks could achieve state-of-the-art results across many NLP benchmarks. GPT-2 (2019) had 1.5 billion parameters and could generate remarkably coherent text. GPT-3 (2020) scaled to 175 billion parameters and exhibited few-shot learning abilities. GPT-4 (2023) further advanced capabilities with multimodal inputs. BERT, developed by Google in 2018, used bidirectional training and excelled at understanding tasks like question answering and sentiment analysis.""",

    """Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by incorporating external knowledge retrieved from a database or document collection. Introduced by Lewis et al. in 2020, RAG addresses key limitations of LLMs: hallucination (generating plausible but incorrect information), knowledge cutoff (not knowing about events after training), and lack of source attribution. In a RAG pipeline, a user query is first used to retrieve relevant documents from a knowledge base, then these documents are provided as context to the LLM, which generates a grounded answer. RAG has become a standard pattern for building production AI applications.""",

    """Vector databases are specialized systems designed to store and efficiently search high-dimensional vectors (embeddings). Unlike traditional databases that match exact values, vector databases find the most similar vectors using distance metrics like cosine similarity, Euclidean distance, or dot product. Popular vector databases include Pinecone (managed cloud service), Weaviate (open-source with hybrid search), ChromaDB (lightweight and embeddable), Milvus (open-source and highly scalable), and Qdrant (Rust-based with filtering). These databases use approximate nearest neighbor (ANN) algorithms like HNSW or IVF to achieve sub-millisecond search times even with millions of vectors.""",

    """Embeddings are dense numerical representations of data (text, images, audio) in a continuous vector space. Text embeddings capture semantic meaning: words and sentences with similar meanings are mapped to nearby points in the embedding space. Early embedding models like Word2Vec (2013) and GloVe (2014) produced word-level vectors. Modern sentence embedding models like Sentence-BERT (2019) and OpenAI's text-embedding-ada-002 produce embeddings for entire sentences or paragraphs. The quality of embeddings is crucial for RAG systems because retrieval accuracy depends directly on how well the embeddings capture semantic similarity.""",

    """Prompt engineering is the practice of designing effective prompts to guide LLM behavior. Techniques include zero-shot prompting (giving no examples), few-shot prompting (providing several examples in the prompt), chain-of-thought prompting (asking the model to reason step by step), and system prompts (setting the model's role and behavior). In RAG systems, prompt engineering is critical for the generation step: the prompt must clearly instruct the model to answer based on the provided context, avoid making up information, and cite sources when possible. A well-engineered prompt can dramatically improve answer quality and reduce hallucinations.""",

    """Reinforcement Learning from Human Feedback (RLHF) is a technique used to align LLMs with human preferences. The process involves three steps: first, the model is pre-trained on large text corpora using standard language modeling objectives; second, human raters rank model outputs by quality, and a reward model is trained on these rankings; third, the language model is fine-tuned using reinforcement learning (typically PPO - Proximal Policy Optimization) to maximize the reward model's score. RLHF was a key ingredient in making ChatGPT more helpful, harmless, and honest compared to the base GPT model. Constitutional AI (CAI), developed by Anthropic, extends this idea by using AI feedback alongside human feedback.""",

    """AI safety and alignment research focuses on ensuring that AI systems behave in ways that are beneficial to humans. Key concerns include the alignment problem (ensuring AI goals match human intentions), robustness (ensuring AI performs reliably in novel situations), interpretability (understanding why AI makes certain decisions), and misuse prevention (stopping AI from being used for harmful purposes). Organizations like Anthropic, OpenAI, DeepMind, and MIRI conduct alignment research. Techniques include RLHF, constitutional AI, red-teaming (adversarial testing), mechanistic interpretability (reverse-engineering neural network computations), and scalable oversight (using AI to help humans supervise more powerful AI systems)."""
]

print(f"Knowledge base: {len(documents)} documents")
print(f"Sample (first 100 chars): {documents[0][:100]}...")

### Chunking the Documents

We use recursive character splitting to break long documents into smaller, overlapping chunks. This ensures that each chunk fits within the embedding model's context window and that no information is lost at chunk boundaries.

In [None]:
def recursive_character_split(text: str, chunk_size: int = 300, chunk_overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks using recursive character splitting."""
    separators = ["\n\n", "\n", ". ", ", ", " ", ""]
    chunks = []

    def _split(text: str, sep_idx: int = 0) -> list[str]:
        if len(text) <= chunk_size:
            return [text.strip()] if text.strip() else []

        if sep_idx >= len(separators):
            # Last resort: hard split
            result = []
            for i in range(0, len(text), chunk_size - chunk_overlap):
                piece = text[i:i + chunk_size].strip()
                if piece:
                    result.append(piece)
            return result

        sep = separators[sep_idx]
        if sep == "":
            # Character-level split
            result = []
            for i in range(0, len(text), chunk_size - chunk_overlap):
                piece = text[i:i + chunk_size].strip()
                if piece:
                    result.append(piece)
            return result

        parts = text.split(sep)
        current = ""
        result = []

        for part in parts:
            candidate = current + sep + part if current else part
            if len(candidate) <= chunk_size:
                current = candidate
            else:
                if current.strip():
                    result.append(current.strip())
                # If a single part is too large, recursively split it
                if len(part) > chunk_size:
                    result.extend(_split(part, sep_idx + 1))
                    current = ""
                else:
                    # Start new chunk with overlap from the end of previous chunk
                    if current and chunk_overlap > 0:
                        overlap_text = current[-chunk_overlap:]
                        current = overlap_text + sep + part
                    else:
                        current = part

        if current.strip():
            result.append(current.strip())

        return result

    return _split(text)


# Chunk all documents and track source metadata
all_chunks = []
chunk_metadata = []  # Track which document each chunk came from

for doc_idx, doc in enumerate(documents):
    chunks = recursive_character_split(doc, chunk_size=300, chunk_overlap=50)
    for chunk in chunks:
        all_chunks.append(chunk)
        chunk_metadata.append({
            "doc_id": doc_idx,
            "source": f"Document {doc_idx + 1}",
            "preview": doc[:80] + "..."
        })

print(f"Total chunks created: {len(all_chunks)}")
print(f"\nSample chunk (index 0):")
print(f"  Text: {all_chunks[0][:150]}...")
print(f"  Metadata: {chunk_metadata[0]}")
print(f"\nSample chunk (index 5):")
print(f"  Text: {all_chunks[5][:150]}...")
print(f"  Metadata: {chunk_metadata[5]}")

### Embed and Store in ChromaDB

In [None]:
# Embed all chunks
print("Embedding chunks...")
chunk_embeddings = embed_model.encode(all_chunks, show_progress_bar=True)
print(f"Embeddings shape: {chunk_embeddings.shape}")

# Create ChromaDB collection
chroma_client = chromadb.Client()  # In-memory client

# Delete collection if it already exists (for re-running)
try:
    chroma_client.delete_collection(name="ai_knowledge_base")
except:
    pass

collection = chroma_client.create_collection(
    name="ai_knowledge_base",
    metadata={"hnsw:space": "cosine"}
)

# Add chunks to ChromaDB
collection.add(
    ids=[f"chunk_{i}" for i in range(len(all_chunks))],
    embeddings=chunk_embeddings.tolist(),
    documents=all_chunks,
    metadatas=chunk_metadata
)

print(f"\nChromaDB collection '{collection.name}' created with {collection.count()} chunks.")

---
## 3. Building RAG from Scratch (No Framework)

Now we build the complete RAG pipeline step by step. No LangChain, no abstractions -- just Python, ChromaDB, and the OpenAI API.

### Step 1: Embed the User Query

We embed the user's question using the **same embedding model** that was used for the knowledge base. This ensures the query vector lives in the same space as the document vectors.

In [None]:
def embed_query(query: str) -> list[float]:
    """Embed a user query using the same model as the knowledge base."""
    embedding = embed_model.encode(query)
    return embedding.tolist()

# Example
test_query = "What is the Transformer architecture?"
query_vec = embed_query(test_query)
print(f"Query: '{test_query}'")
print(f"Embedding dimension: {len(query_vec)}")
print(f"First 5 values: {query_vec[:5]}")

### Step 2: Retrieve Top-k Relevant Chunks from ChromaDB

In [None]:
def retrieve(query: str, collection, k: int = 3) -> dict:
    """Retrieve the top-k most relevant chunks for a query."""
    query_embedding = embed_query(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )
    return results

# Example retrieval
results = retrieve("What is the Transformer architecture?", collection, k=3)

print("Retrieved chunks:")
for i, (doc, meta, dist) in enumerate(zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
)):
    print(f"\n--- Chunk {i+1} (distance: {dist:.4f}, source: {meta['source']}) ---")
    print(doc[:200] + "...")

### Step 3: Build the Prompt with Context + Question

In [None]:
def build_prompt(query: str, context_chunks: list[str]) -> str:
    """Build a prompt that includes retrieved context and the user's question."""
    context = "\n\n".join(
        f"[Context {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )

    prompt = f"""You are a helpful AI assistant. Answer the user's question based ONLY on the provided context.
If the context does not contain enough information to answer the question, say "I don't have enough information to answer this question."
Do not make up information that is not supported by the context.

Context:
{context}

Question: {query}

Answer:"""
    return prompt

# Example
sample_prompt = build_prompt(
    "What is the Transformer architecture?",
    results["documents"][0]
)
print(sample_prompt[:500] + "\n...")

### Step 4: Call OpenAI API to Generate the Answer

In [None]:
def generate_answer(prompt: str, model: str = "gpt-4o-mini", temperature: float = 0.2) -> str:
    """Generate an answer using OpenAI's chat completions API."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        max_tokens=1024
    )
    return response.choices[0].message.content

# Example generation
answer = generate_answer(sample_prompt)
print("Question: What is the Transformer architecture?")
print(f"\nAnswer:\n{answer}")

### Complete RAG Flow

Now let us put all four steps together into one function and show the complete flow with retrieved context printed alongside the answer.

In [None]:
def rag_query(question: str, collection, k: int = 3, verbose: bool = True) -> str:
    """Complete RAG pipeline: retrieve, build prompt, generate answer."""
    # Step 1 & 2: Retrieve relevant chunks
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]
    metadatas = results["metadatas"][0]
    distances = results["distances"][0]

    if verbose:
        print("=" * 60)
        print(f"Question: {question}")
        print("=" * 60)
        print(f"\nRetrieved {len(context_chunks)} chunks:")
        for i, (chunk, meta, dist) in enumerate(zip(context_chunks, metadatas, distances)):
            print(f"  [{i+1}] (distance: {dist:.4f}, {meta['source']}) {chunk[:100]}...")

    # Step 3: Build prompt
    prompt = build_prompt(question, context_chunks)

    # Step 4: Generate answer
    answer = generate_answer(prompt)

    if verbose:
        print(f"\nAnswer:\n{answer}")
        print("=" * 60)

    return answer


# Test with several questions
questions = [
    "What caused the first AI Winter?",
    "How does RAG work and why is it useful?",
    "What is the difference between CNNs and RNNs?"
]

for q in questions:
    rag_query(q, collection)
    print()

### Exercise 1: Build a RAG System

Implement your own `rag_query_exercise` function that:
1. Takes a question and a ChromaDB collection
2. Retrieves the top-k relevant chunks
3. Builds a prompt with the context
4. Generates and returns the answer

Test it with the question: *"What is RLHF and how does it work?"*

In [None]:
def rag_query_exercise(question: str, collection, k: int = 3) -> str:
    """
    TODO: Implement a complete RAG pipeline.

    Steps:
    1. Embed the question using embed_query()
    2. Query the collection for top-k results
    3. Build a prompt with the retrieved context
    4. Generate and return the answer
    """
    # TODO: Retrieve relevant chunks
    retrieved = None

    # TODO: Extract document texts from results
    context_chunks = None

    # TODO: Build the prompt
    prompt = None

    # TODO: Generate and return the answer
    answer = None

    return answer


# Test
# answer = rag_query_exercise("What is RLHF and how does it work?", collection)
# print(answer)

### Solution

In [None]:
def rag_query_exercise(question: str, collection, k: int = 3) -> str:
    """
    Complete RAG pipeline: retrieve context, build prompt, generate answer.
    """
    # Step 1 & 2: Retrieve relevant chunks
    query_embedding = embed_query(question)
    retrieved = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )

    # Extract document texts
    context_chunks = retrieved["documents"][0]

    # Step 3: Build the prompt
    context = "\n\n".join(
        f"[Context {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )
    prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.
If the context does not contain enough information, say so.

Context:
{context}

Question: {question}

Answer:"""

    # Step 4: Generate the answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        max_tokens=1024
    )
    answer = response.choices[0].message.content
    return answer


# Test
answer = rag_query_exercise("What is RLHF and how does it work?", collection)
print("Question: What is RLHF and how does it work?")
print(f"\nAnswer:\n{answer}")

---
## 4. Chain Strategies

When we retrieve multiple chunks, how should we combine them and send them to the LLM? There are three common strategies:

| Strategy | Approach | Pros | Cons |
|----------|----------|------|------|
| **Stuff** | Concatenate all chunks into one prompt | Simple, one API call | Limited by context window |
| **Map-Reduce** | Summarize each chunk, then combine summaries | Handles many chunks | Multiple API calls, slower |
| **Refine** | Iteratively refine answer with each chunk | Preserves detail | Sequential, slowest |

Let us implement each one from scratch.

### 4.1 Stuff Method

The simplest approach: concatenate all retrieved chunks into a single prompt.

**Pros:** Simple, one API call, preserves all context.  
**Cons:** Can exceed the LLM's context window if too many chunks are retrieved.

In [None]:
def stuff_chain(question: str, collection, k: int = 5) -> str:
    """Stuff method: concatenate all chunks into one prompt."""
    # Retrieve chunks
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]

    # Stuff all chunks into a single context
    context = "\n\n".join(
        f"[Passage {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )

    prompt = f"""You are a knowledgeable AI assistant. Answer the question thoroughly based ONLY on the provided passages.
If the passages do not contain enough information, say so.

Passages:
{context}

Question: {question}

Provide a comprehensive answer:"""

    answer = generate_answer(prompt)
    return answer


# Test
question = "Explain the evolution of neural networks from simple models to deep learning."
print(f"Question: {question}\n")
answer = stuff_chain(question, collection, k=5)
print(f"[Stuff Method] Answer:\n{answer}")

### 4.2 Map-Reduce Method

Two-phase approach:
1. **Map:** Summarize each retrieved chunk independently (in parallel, conceptually)
2. **Reduce:** Combine all summaries into a final answer

**Pros:** Can handle a large number of chunks without exceeding context limits.  
**Cons:** Multiple API calls (one per chunk + one final), slower and more expensive.

In [None]:
def map_reduce_chain(question: str, collection, k: int = 5) -> str:
    """Map-Reduce method: summarize each chunk, then combine summaries."""
    # Retrieve chunks
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]

    # MAP phase: extract relevant information from each chunk
    summaries = []
    for i, chunk in enumerate(context_chunks):
        map_prompt = f"""Given the following passage and a question, extract ONLY the information from the passage that is relevant to answering the question.
If the passage contains no relevant information, respond with "No relevant information."

Passage: {chunk}

Question: {question}

Relevant information:"""

        summary = generate_answer(map_prompt)
        summaries.append(summary)
        print(f"  Map step {i+1}/{len(context_chunks)} complete.")

    # REDUCE phase: combine all summaries into a final answer
    combined_summaries = "\n\n".join(
        f"[Summary {i+1}]: {s}" for i, s in enumerate(summaries)
    )

    reduce_prompt = f"""You are a knowledgeable AI assistant. Based on the following summaries extracted from multiple sources, provide a comprehensive answer to the question.
Synthesize the information and avoid repetition.

Summaries:
{combined_summaries}

Question: {question}

Comprehensive answer:"""

    answer = generate_answer(reduce_prompt)
    print("  Reduce step complete.")
    return answer


# Test
question = "Explain the evolution of neural networks from simple models to deep learning."
print(f"Question: {question}\n")
print("Running Map-Reduce chain...")
answer = map_reduce_chain(question, collection, k=5)
print(f"\n[Map-Reduce Method] Answer:\n{answer}")

### 4.3 Refine Method

Iterative approach: start with the first chunk, generate an initial answer, then refine it with each subsequent chunk.

**Pros:** Can incorporate nuanced details from each chunk progressively.  
**Cons:** Strictly sequential (each step depends on the previous), slowest of the three.

In [None]:
def refine_chain(question: str, collection, k: int = 5) -> str:
    """Refine method: iteratively refine the answer with each chunk."""
    # Retrieve chunks
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]

    # Initial answer from the first chunk
    initial_prompt = f"""You are a knowledgeable AI assistant. Answer the question based on the provided context.
If the context is insufficient, provide what you can and note what is missing.

Context: {context_chunks[0]}

Question: {question}

Answer:"""

    current_answer = generate_answer(initial_prompt)
    print(f"  Initial answer generated from chunk 1.")

    # Refine with each subsequent chunk
    for i, chunk in enumerate(context_chunks[1:], start=2):
        refine_prompt = f"""You are a knowledgeable AI assistant. You have an existing answer to a question, and now you have additional context.
Refine the existing answer by incorporating any new relevant information from the additional context.
If the new context is not relevant, keep the existing answer unchanged.
Do not remove correct information from the existing answer.

Existing answer: {current_answer}

Additional context: {chunk}

Question: {question}

Refined answer:"""

        current_answer = generate_answer(refine_prompt)
        print(f"  Refined with chunk {i}/{len(context_chunks)}.")

    return current_answer


# Test
question = "Explain the evolution of neural networks from simple models to deep learning."
print(f"Question: {question}\n")
print("Running Refine chain...")
answer = refine_chain(question, collection, k=5)
print(f"\n[Refine Method] Answer:\n{answer}")

### Comparing the Three Strategies

Let us run the same question through all three methods and compare the outputs side by side.

In [None]:
comparison_question = "What are the key milestones in AI history?"

print("=" * 70)
print(f"Comparison Question: {comparison_question}")
print("=" * 70)

# Stuff
print("\n--- STUFF METHOD ---")
stuff_answer = stuff_chain(comparison_question, collection, k=5)
print(stuff_answer)

# Map-Reduce
print("\n--- MAP-REDUCE METHOD ---")
mr_answer = map_reduce_chain(comparison_question, collection, k=5)
print(mr_answer)

# Refine
print("\n--- REFINE METHOD ---")
refine_answer = refine_chain(comparison_question, collection, k=5)
print(refine_answer)

print("\n" + "=" * 70)
print("Observation: All three methods produce similar content but differ in")
print("structure and detail. Stuff is fastest, Map-Reduce handles scale,")
print("and Refine produces the most refined, iteratively improved answers.")
print("=" * 70)

### Exercise 2: Implement Stuff and Map-Reduce from Scratch

Implement both the **Stuff** and **Map-Reduce** chain strategies yourself. Then run the same question through both and compare the quality of answers.

Question to test: *"What are vector databases and why are they important for AI?"*

In [None]:
def stuff_chain_exercise(question: str, collection, k: int = 5) -> str:
    """
    TODO: Implement the Stuff chain strategy.

    Steps:
    1. Retrieve top-k chunks
    2. Concatenate all chunks into a single context string
    3. Build a prompt with the combined context + question
    4. Generate and return the answer
    """
    # TODO: Retrieve chunks
    results = None

    # TODO: Build combined context
    context = None

    # TODO: Build prompt and generate answer
    answer = None

    return answer


def map_reduce_chain_exercise(question: str, collection, k: int = 5) -> str:
    """
    TODO: Implement the Map-Reduce chain strategy.

    Steps:
    1. Retrieve top-k chunks
    2. MAP: For each chunk, extract relevant information (one API call per chunk)
    3. REDUCE: Combine all extracted summaries into a final answer (one API call)
    """
    # TODO: Retrieve chunks
    results = None

    # TODO: MAP phase - summarize each chunk
    summaries = None

    # TODO: REDUCE phase - combine summaries into final answer
    answer = None

    return answer


# Test
# test_q = "What are vector databases and why are they important for AI?"
# print("Stuff:", stuff_chain_exercise(test_q, collection))
# print("\nMap-Reduce:", map_reduce_chain_exercise(test_q, collection))

### Solution

In [None]:
def stuff_chain_exercise(question: str, collection, k: int = 5) -> str:
    """Stuff method: put all chunks into one prompt."""
    # Retrieve
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]

    # Build combined context
    context = "\n\n".join(
        f"[Passage {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )

    # Build prompt and generate
    prompt = f"""Answer the question based ONLY on the provided passages.

Passages:
{context}

Question: {question}

Answer:"""

    answer = generate_answer(prompt)
    return answer


def map_reduce_chain_exercise(question: str, collection, k: int = 5) -> str:
    """Map-Reduce method: summarize each chunk, then combine."""
    # Retrieve
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]

    # MAP: extract relevant info from each chunk
    summaries = []
    for chunk in context_chunks:
        map_prompt = f"""Extract information relevant to the question from this passage.
If nothing is relevant, say "No relevant information."

Passage: {chunk}
Question: {question}

Relevant information:"""
        summary = generate_answer(map_prompt)
        summaries.append(summary)

    # REDUCE: combine summaries
    combined = "\n\n".join(f"[{i+1}]: {s}" for i, s in enumerate(summaries))
    reduce_prompt = f"""Synthesize the following extracted information into a comprehensive answer.

Extracted information:
{combined}

Question: {question}

Answer:"""

    answer = generate_answer(reduce_prompt)
    return answer


# Test and compare
test_q = "What are vector databases and why are they important for AI?"
print(f"Question: {test_q}\n")

print("--- STUFF ---")
print(stuff_chain_exercise(test_q, collection))

print("\n--- MAP-REDUCE ---")
print(map_reduce_chain_exercise(test_q, collection))

---
## 5. Source Attribution

A production RAG system should tell users **where** each answer came from. This builds trust and lets users verify the information.

We will modify our RAG pipeline to:
1. Number each retrieved context passage
2. Instruct the LLM to cite sources using bracket notation like [1], [2]
3. Display the cited sources below the answer

In [None]:
def rag_query_with_sources(question: str, collection, k: int = 3) -> dict:
    """
    RAG pipeline with source attribution.
    Returns a dict with 'answer' and 'sources'.
    """
    # Retrieve
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]
    metadatas = results["metadatas"][0]

    # Build numbered context
    context = "\n\n".join(
        f"[{i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )

    # Prompt instructs the model to cite sources
    prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided numbered sources.
IMPORTANT: Cite your sources by including the source number in square brackets (e.g., [1], [2]) after the relevant statements in your answer.
Every factual claim should have at least one citation.
If the sources do not contain enough information, say so.

Sources:
{context}

Question: {question}

Answer (with citations):"""

    answer = generate_answer(prompt)

    # Build source references
    sources = []
    for i, (chunk, meta) in enumerate(zip(context_chunks, metadatas)):
        sources.append({
            "id": i + 1,
            "source": meta["source"],
            "text": chunk[:150] + "..."
        })

    return {"answer": answer, "sources": sources}


def display_answer_with_sources(result: dict):
    """Pretty-print an answer with its sources."""
    print("Answer:")
    print(result["answer"])
    print("\n--- Sources ---")
    for src in result["sources"]:
        print(f"  [{src['id']}] ({src['source']}) {src['text']}")


# Test
result = rag_query_with_sources("What is the Turing Test?", collection, k=3)
display_answer_with_sources(result)

In [None]:
# Another example
result = rag_query_with_sources(
    "What are the different types of machine learning?",
    collection,
    k=4
)
display_answer_with_sources(result)

### Exercise 3: Add Source Attribution

Implement `rag_with_citations` that:
1. Retrieves relevant chunks
2. Includes source numbers in the prompt
3. Instructs the LLM to cite sources with [1], [2], etc.
4. Returns both the answer and a list of source references

Test with: *"How do expert systems work and why did they fall out of favor?"*

In [None]:
def rag_with_citations(question: str, collection, k: int = 3) -> dict:
    """
    TODO: Implement RAG with source attribution.

    Returns:
        dict with keys:
        - 'answer': str with inline [1], [2] citations
        - 'sources': list of dicts with 'id', 'source', 'text'
    """
    # TODO: Retrieve chunks
    results = None

    # TODO: Build numbered context for the prompt
    context = None

    # TODO: Build prompt that instructs the LLM to cite sources
    prompt = None

    # TODO: Generate answer
    answer = None

    # TODO: Build source references list
    sources = None

    return {"answer": answer, "sources": sources}


# Test
# result = rag_with_citations("How do expert systems work and why did they fall out of favor?", collection)
# display_answer_with_sources(result)

### Solution

In [None]:
def rag_with_citations(question: str, collection, k: int = 3) -> dict:
    """
    RAG with source attribution using inline citations.
    """
    # Retrieve chunks
    results = retrieve(question, collection, k=k)
    context_chunks = results["documents"][0]
    metadatas = results["metadatas"][0]

    # Build numbered context
    context = "\n\n".join(
        f"[{i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
    )

    # Build prompt with citation instructions
    prompt = f"""You are a helpful AI assistant. Answer the question using ONLY the provided numbered sources.
Cite sources inline using bracket notation like [1], [2], [3] after relevant statements.
Every factual claim must include at least one citation.

Sources:
{context}

Question: {question}

Answer with inline citations:"""

    # Generate answer
    answer = generate_answer(prompt)

    # Build source references
    sources = []
    for i, (chunk, meta) in enumerate(zip(context_chunks, metadatas)):
        sources.append({
            "id": i + 1,
            "source": meta["source"],
            "text": chunk[:150] + "..."
        })

    return {"answer": answer, "sources": sources}


# Test
result = rag_with_citations(
    "How do expert systems work and why did they fall out of favor?",
    collection,
    k=4
)
display_answer_with_sources(result)

---
## 6. Multi-turn Conversation

Real Q&A systems need to handle follow-up questions. A user might ask:
1. "What is deep learning?"
2. "When did it become popular?" (refers to "deep learning" from Q1)
3. "Who were the key researchers?" (still about deep learning)

To support this, we:
- Maintain a **conversation history** (list of previous Q&A pairs)
- Include conversation history in the prompt so the LLM can resolve references
- Still retrieve fresh context for each new question

In [None]:
class ConversationRAG:
    """RAG system with multi-turn conversation memory."""

    def __init__(self, collection, embed_model, openai_client, k: int = 3, max_history: int = 10):
        self.collection = collection
        self.embed_model = embed_model
        self.client = openai_client
        self.k = k
        self.max_history = max_history
        self.history = []  # List of {"role": "user"/"assistant", "content": ...}

    def _format_history(self) -> str:
        """Format conversation history for the prompt."""
        if not self.history:
            return "No previous conversation."
        lines = []
        for entry in self.history[-self.max_history:]:
            role = "User" if entry["role"] == "user" else "Assistant"
            lines.append(f"{role}: {entry['content']}")
        return "\n".join(lines)

    def ask(self, question: str, verbose: bool = True) -> str:
        """Ask a question with conversation context."""
        # Retrieve relevant chunks for the current question
        query_embedding = self.embed_model.encode(question).tolist()
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=self.k,
            include=["documents", "metadatas"]
        )
        context_chunks = results["documents"][0]

        # Build context string
        context = "\n\n".join(
            f"[Source {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
        )

        # Build the full prompt with history
        history_str = self._format_history()

        prompt = f"""You are a helpful AI assistant engaged in a conversation. Use the provided sources to answer the current question.
Consider the conversation history to understand context and resolve pronouns or references (e.g., "it", "they", "that").
Answer based ONLY on the provided sources. If the sources are insufficient, say so.

Conversation History:
{history_str}

Sources:
{context}

Current Question: {question}

Answer:"""

        # Generate
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful conversational AI assistant. Answer questions based on provided sources and conversation history."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2,
            max_tokens=1024
        )
        answer = response.choices[0].message.content

        # Update conversation history
        self.history.append({"role": "user", "content": question})
        self.history.append({"role": "assistant", "content": answer})

        if verbose:
            print(f"User: {question}")
            print(f"\nAssistant: {answer}\n")
            print("-" * 50)

        return answer

    def reset(self):
        """Clear conversation history."""
        self.history = []
        print("Conversation history cleared.")


# Demo: multi-turn conversation
print("=" * 60)
print("Multi-turn Conversation Demo")
print("=" * 60 + "\n")

conv = ConversationRAG(collection, embed_model, client, k=3)

# Turn 1
conv.ask("What is deep learning?")

# Turn 2 - references "it" from Turn 1
conv.ask("When did it become popular and what triggered its success?")

# Turn 3 - references "the researchers" from Turn 2
conv.ask("Who were the key researchers involved?")

# Turn 4 - follow-up about a related topic
conv.ask("How did this lead to modern large language models?")

### Exercise 4: Multi-turn Conversational RAG

Implement a `ConversationRAGExercise` class that:
1. Maintains a list of previous Q&A pairs as conversation history
2. Includes the conversation history in each prompt
3. Retrieves fresh context for each question
4. Has an `ask(question)` method that returns the answer

Test it with a 3-4 turn conversation about AI safety.

In [None]:
class ConversationRAGExercise:
    """
    TODO: Implement a multi-turn conversational RAG system.
    """

    def __init__(self, collection, embed_model, openai_client, k: int = 3):
        self.collection = collection
        self.embed_model = embed_model
        self.client = openai_client
        self.k = k
        # TODO: Initialize conversation history
        self.history = None

    def ask(self, question: str) -> str:
        """
        TODO: Implement the ask method.

        Steps:
        1. Retrieve relevant chunks for the question
        2. Format conversation history
        3. Build a prompt with history + context + question
        4. Generate the answer
        5. Add the Q&A pair to history
        6. Return the answer
        """
        # TODO: Retrieve relevant chunks
        context_chunks = None

        # TODO: Format history string
        history_str = None

        # TODO: Build prompt with history + context + question
        prompt = None

        # TODO: Generate answer
        answer = None

        # TODO: Update history

        return answer

    def reset(self):
        """Clear conversation history."""
        self.history = None


# Test
# conv = ConversationRAGExercise(collection, embed_model, client)
# conv.ask("What is AI safety?")
# conv.ask("What techniques are used to address it?")
# conv.ask("Which organizations are leading this research?")

### Solution

In [None]:
class ConversationRAGExercise:
    """Multi-turn conversational RAG system."""

    def __init__(self, collection, embed_model, openai_client, k: int = 3):
        self.collection = collection
        self.embed_model = embed_model
        self.client = openai_client
        self.k = k
        self.history = []

    def ask(self, question: str) -> str:
        """Ask a question with full conversation context."""
        # Retrieve relevant chunks
        query_embedding = self.embed_model.encode(question).tolist()
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=self.k,
            include=["documents"]
        )
        context_chunks = results["documents"][0]

        # Format history
        if self.history:
            history_str = "\n".join(
                f"{'User' if h['role'] == 'user' else 'Assistant'}: {h['content']}"
                for h in self.history
            )
        else:
            history_str = "No previous conversation."

        # Build context
        context = "\n\n".join(
            f"[Source {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)
        )

        # Build prompt
        prompt = f"""You are a helpful conversational AI assistant. Use the sources to answer the question.
Use the conversation history to resolve references like "it", "they", "that topic".
Answer based ONLY on the sources provided.

Conversation History:
{history_str}

Sources:
{context}

Question: {question}

Answer:"""

        # Generate answer
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful conversational assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2,
            max_tokens=1024
        )
        answer = response.choices[0].message.content

        # Update history
        self.history.append({"role": "user", "content": question})
        self.history.append({"role": "assistant", "content": answer})

        print(f"User: {question}")
        print(f"\nAssistant: {answer}\n")
        print("-" * 50)

        return answer

    def reset(self):
        """Clear conversation history."""
        self.history = []
        print("Conversation history cleared.")


# Demo conversation about AI safety
print("=" * 60)
print("Multi-turn Conversation: AI Safety")
print("=" * 60 + "\n")

conv_exercise = ConversationRAGExercise(collection, embed_model, client)

# Turn 1
conv_exercise.ask("What is AI safety and alignment?")

# Turn 2 - "it" refers to AI safety
conv_exercise.ask("What techniques are used to address it?")

# Turn 3 - "this research" refers to AI safety research
conv_exercise.ask("Which organizations are leading this research?")

# Turn 4 - deeper follow-up
conv_exercise.ask("How does RLHF specifically help with alignment?")

---
## 7. Using LangChain (Comparison)

Now that we have built everything from scratch, let us see how LangChain simplifies the same pipeline. This section is a brief comparison -- our from-scratch version gives us more control and understanding.

In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

In [None]:
# Create LangChain documents from our knowledge base
lc_docs = [
    Document(page_content=doc, metadata={"source": f"Document {i+1}"})
    for i, doc in enumerate(documents)
]

# Split using LangChain's splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50
)
lc_chunks = text_splitter.split_documents(lc_docs)
print(f"LangChain created {len(lc_chunks)} chunks")

# Create a Chroma vector store using OpenAI embeddings
lc_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
lc_vectorstore = Chroma.from_documents(
    documents=lc_chunks,
    embedding=lc_embeddings,
    collection_name="ai_knowledge_base_lc"
)

# Create the RetrievalQA chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Same as our stuff method
    retriever=lc_vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

print("LangChain RetrievalQA chain created.")

In [None]:
# Query with LangChain
lc_question = "What is the Transformer architecture and why was it important?"

lc_result = qa_chain.invoke({"query": lc_question})

print(f"Question: {lc_question}\n")
print(f"Answer:\n{lc_result['result']}")
print(f"\nSource documents used: {len(lc_result['source_documents'])}")
for i, doc in enumerate(lc_result['source_documents']):
    print(f"  [{i+1}] ({doc.metadata.get('source', 'N/A')}) {doc.page_content[:100]}...")

### From-Scratch vs LangChain: Comparison

| Aspect | From Scratch | LangChain |
|--------|-------------|----------|
| **Lines of code** | ~30-50 for basic RAG | ~10-15 |
| **Customization** | Full control over every step | Limited to available chain types and parameters |
| **Prompt engineering** | Write your own prompts | Uses built-in prompt templates |
| **Debugging** | Easy -- you see every step | Harder -- abstraction layers hide details |
| **Chain strategies** | Implement exactly what you need | `chain_type` parameter: "stuff", "map_reduce", "refine" |
| **Learning value** | High -- understand the full pipeline | Lower -- learn the API, not the concepts |
| **Production readiness** | Requires more engineering | Provides many utilities out of the box |

**Recommendation:** Start by building from scratch to understand the concepts (as we did in this notebook). Then use LangChain or similar frameworks when you need to move fast in production.

---
## 8. Summary and References

### Key Takeaways

In this module, you built a **complete RAG system from scratch**:

1. **Knowledge Base Preparation**: Chunked documents and stored embeddings in ChromaDB
2. **Core RAG Pipeline**: Query embedding, vector search, context building, LLM generation
3. **Chain Strategies**: Implemented Stuff (simple), Map-Reduce (scalable), and Refine (iterative) approaches
4. **Source Attribution**: Added numbered citations so users can verify answers
5. **Multi-turn Conversations**: Added memory so follow-up questions work naturally
6. **LangChain Comparison**: Saw how frameworks simplify the code but trade off control

The from-scratch approach gives you deep understanding of every component. When building production systems, you can choose to use frameworks like LangChain for convenience or keep your custom implementation for maximum control.

### References

- **Documentation**: [LangChain - Question Answering](https://python.langchain.com/docs/use_cases/question_answering/)
- **Course**: [DeepLearning.AI - Building and Evaluating Advanced RAG](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/)
- **Blog**: [LlamaIndex - Building RAG from Scratch](https://docs.llamaindex.ai/en/stable/optimizing/building_rag_from_scratch/)
- **Paper**: Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2023) - [arXiv:2312.10997](https://arxiv.org/abs/2312.10997)
- **Paper**: Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020) - [arXiv:2005.11401](https://arxiv.org/abs/2005.11401)
- **ChromaDB Documentation**: [https://docs.trychroma.com/](https://docs.trychroma.com/)
- **Sentence Transformers**: [https://www.sbert.net/](https://www.sbert.net/)