# Prompt Engineering Notebook: Retrieval‑Augmented Generation (RAG)
*Google Colab–Compatible — Author: ChatGPT (o3) — Date: 2025-07-09*

This interactive notebook demystifies **Retrieval‑Augmented Generation**—injecting external knowledge into language‑model prompts at runtime to boost accuracy, reduce hallucination, and keep responses fresh.

> 🧑‍🏫 *Pedagogical design note:* Each section follows the **explain‑demo‑exercise** pattern, with progressively richer challenges and built‑in reflection checkpoints.

## Learning Objectives
By the end of this notebook you will be able to:
1. Explain the standard RAG pipeline and why it mitigates hallucinations.
2. Build a toy document store with embeddings + FAISS and perform similarity search.
3. Craft dynamic prompts that combine retrieved context with user questions.
4. Evaluate RAG responses for relevance and factual grounding using simple metrics.
5. Experiment with chunking, hybrid search, and citation formatting.
6. Identify safety & bias pitfalls unique to RAG systems and propose mitigations.

## 0. Environment Setup
Run the next cell to install lightweight dependencies. Feel free to comment‑out packages you already have.

In [None]:
!pip -q install langchain openai faiss-cpu python-dotenv tiktoken

### Configure API Credentials
Enter your OpenAI API key below (optional—offline demo stubs will be used if absent).

In [None]:
import os, getpass
if not os.getenv('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass.getpass('🔑 OpenAI API Key (optional): ')

## 1. Why Retrieval‑Augmented Generation?
Large language models (LLMs) possess vast latent knowledge but **forget specifics** and **freeze in time**. RAG connects them to an up‑to‑date knowledge base **at inference‑time**:
```text
User Query ─▶ Retriever ─▶ Relevant Chunks ─┐
                                   ⬇         │
                       [Prompt Template]      │
                                   ⬇         │
                                  LLM ────────┘
```

### Key Benefits
1. **Freshness 📅** – Inject new documents without retraining.
2. **Explainability 🔍** – Provide citations for retrieved passages.
3. **Efficiency ⚡** – Smaller models can perform expert tasks when paired with domain docs.
4. **Control 🛠️** – Curate the knowledge corpus to steer model outputs.

## 2. Hands‑On Part I – Create a Toy Document Store

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
sample_docs = {
    "transformers": """Transformers are neural network architectures that rely on self‑attention. 
They have become the backbone of modern NLP.""",
    "vector-db": """FAISS is a library for efficient similarity search and clustering of dense vectors.""",
    "rag": """Retrieval‑Augmented Generation pipelines retrieve relevant chunks from external documents 
and feed them into the prompt of an LLM."""
}

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=80, chunk_overlap=10)
docs = []
for title, text in sample_docs.items():
    for chunk in splitter.split_text(text):
        docs.append(chunk)
print(f"📄 {len(docs)} chunks created:")
for d in docs:
    print('-', d[:60] + ('...' if len(d) > 60 else ''))


### 2.1 Embed & Index

In [None]:
from langchain.vectorstores import FAISS

if os.getenv('OPENAI_API_KEY'):
    from langchain.embeddings import OpenAIEmbeddings
    embeddings = OpenAIEmbeddings()
else:
    # Offline stub embeds by hashing
    import numpy as np, hashlib
    class HashEmbeddings:
        def embed_documents(self, texts):
            return [np.array([int(hashlib.md5(t.encode()).hexdigest(),16)%997]) for t in texts]
    embeddings = HashEmbeddings()

db = FAISS.from_texts(docs, embeddings)
print('✅ Vector store ready —', db.index.ntotal, 'vectors')

### 2.2 Similarity Search Demo

In [None]:
query = "What library helps with similarity search for embeddings?"
docs_ret = db.similarity_search(query, k=2)
for i,d in enumerate(docs_ret,1):
    print(f"Doc {i} ▶", d.page_content)


**Exercise 📝**: Change `query` and re‑run. Observe retrieved snippets.

## 3. Hands‑On Part II – Plug Retrieval into Generation

In [None]:
def format_prompt(question, contexts):
    joined = "\n---\n".join(contexts)
    return f"""Answer the question using the context below. Cite sources using [doc #].

Context:
{joined}

Question: {question}
Answer:"""

def ask_rag(question, k=3):
    ctx = [d.page_content for d in db.similarity_search(question, k=k)]
    prompt = format_prompt(question, ctx)
    if os.getenv('OPENAI_API_KEY'):
        from langchain.llms import OpenAI
        llm = OpenAI(temperature=0)
        return llm(prompt)
    else:
        # Offline stub: echo context keywords
        return "Stub‑LLM answer based on retrieved info: " + ', '.join(ctx)

print(ask_rag("Explain what FAISS does.", k=2))


**Checkpoint 🔎**: Does the answer cite sources? If not, tweak `format_prompt`.

## 4. Experiment – Chunk Size vs. Recall

In [None]:
import matplotlib.pyplot as plt, numpy as np

sizes = [32, 64, 128, 256]
recall_scores=[]
for cs in sizes:
    splitter = RecursiveCharacterTextSplitter(chunk_size=cs, chunk_overlap=10)
    chunks = sum(len(splitter.split_text(t)) for t in sample_docs.values())
    recall_scores.append(min(1, 80/cs))  # fake metric for demo

plt.plot(sizes, recall_scores, marker='o')
plt.title("Effect of Chunk Size on (Toy) Recall")
plt.xlabel("Chunk Size (chars)")
plt.ylabel("Recall (proxy)")
plt.show()


> **Reflection 💡**: Larger chunks capture more context but may dilute retrieval precision.

## 5. Evaluating RAG Outputs

*Automatic heuristics*
- **Context precision**: % of retrieved chunks actually used in answer.
- **Citation accuracy**: Do cited docs support the claim?
- **Answer faithfulness**: Compare with ground truth via LLM verifier or overlap metrics.

*Manual checklist*
- Factual correctness?
- Is necessary info missing?
- Tone & completeness?

## 6. Going Further
- **Hybrid retrieval**: combine dense (embeddings) + sparse (BM25) search.
- **Graph‑based retrieval**: store doc relationships and traverse paths.
- **Multi‑hop RAG**: iterative retrieval to answer complex queries.
- **Streaming RAG**: produce partial answers while fetching more chunks.

## 7. Safety & Bias Considerations
- **Data poisoning**: Malicious docs may steer outputs → Validate sources.
- **Privacy leaks**: Retrieved chunks might expose sensitive info → Access control.
- **Confabulations**: LLM may still hallucinate even with context → Force citation and compare.
- **License compliance**: Ensure you have rights to redistribute retrieved content.

## Assignment 📚
1. Build a mini‑knowledge base from **any 3 Wikipedia articles** of your choice.
2. Implement a RAG function that answers *five* trivia questions about those topics.
3. Report precision, citation accuracy, and one failure case with analysis.
4. *(Stretch)* Swap FAISS for **Chroma** or **Weaviate** and compare retrieval latency.

## Further Reading & Tools
- Lewis et al., *Retrieval‑Augmented Generation for Knowledge‑Intensive NLP* (2020)
- LlamaIndex (GPT Index) and LangChain RAG templates
- **Haystack** framework (deepset) for production RAG pipelines
- OpenAI Cookbook: *Evaluating RAG pipelines*
- Papers with Code: <https://paperswithcode.com/task/retrieval-augmented-generation>