Retriever-Augmented Generation (RAG) is a powerful architecture that **combines retrieval-based and generation-based approaches** to build more accurate and scalable NLP systems—especially for tasks like **open-domain question answering**.

---

## 🧠 What Is Retriever-Augmented Generation (RAG)?

RAG enhances Large Language Models (LLMs) by **injecting external knowledge** dynamically at inference time. Instead of relying solely on the model’s internal parameters (which can be limited or outdated), RAG **retrieves relevant documents** from a large corpus and uses them to **inform the generation** of a final answer.

---

## 🔁 General Flow of a RAG Pipeline

Here’s the step-by-step flow:

```
  ┌───────────────────────┐
  │     User Query        │
  └─────────┬─────────────┘
            ↓
  ┌───────────────────────┐
  │  Question Encoder (e.g. DPR) ─────┐
  └───────────────────────┘           │
            ↓                         ▼
  ┌───────────────────────┐    ┌───────────────────────┐
  │ FAISS or other index   │<───┤ Context Encoder (DPR) │
  │ (Document Embeddings)  │    └───────────────────────┘
  └─────────┬─────────────┘
            │
     Top-k Relevant Passages
            ↓
  ┌────────────────────────────┐
  │ Concatenate query + context│
  └─────────┬──────────────────┘
            ↓
  ┌────────────────────────────┐
  │   Generator Model (e.g. GPT2)│
  └─────────┬──────────────────┘
            ↓
      Final Generated Answer
```

---

## 🧩 Key Components of a RAG System

### 1. **Retriever**

* **Goal**: Find relevant documents/passages from a corpus.
* **Common choice**: Dense Passage Retriever (DPR)
* **Parts**:

  * `DPRQuestionEncoder`: encodes the user query into a dense vector
  * `DPRContextEncoder`: pre-encodes documents in the corpus
* **Similarity metric**: FAISS Index with L2 or inner product (dot-product) distance

### 2. **Generator**

* **Goal**: Generate natural language output based on the query + retrieved context
* **Common choice**: GPT-2, BART, T5, etc.
* **Input**: Query + top-k retrieved passages (concatenated)
* **Output**: Final response or answer

---

## 🔧 What Gets Trained?

There are **two major training phases**, depending on the use case:

### ▶️ **Pretraining (optional)**

* You may start with pretrained models:

  * DPR (retriever): already trained on QA tasks like Natural Questions.
  * Generator: pretrained language model (e.g., GPT-2 or BART).

### 🏋️‍♀️ **Fine-tuning**

You can fine-tune:

1. **Retriever**:

   * Use **contrastive learning** (positive vs negative passages).
   * Objective: bring questions closer to correct context embeddings.

2. **Generator**:

   * Fine-tune to condition on query + retrieved documents to generate better outputs.
   * Loss: language modeling loss (e.g., cross-entropy).

### 👥 End-to-End (optional but advanced):

* Train both retriever and generator together.
* This is **more complex** and **less stable**, so often not done unless needed.

---

## 🎯 Why Use RAG?

### ✅ Advantages:

* Reduces hallucination by grounding output in actual retrieved facts.
* Extensible to large corpora without retraining the LLM.
* Memory-efficient: doesn't force the model to memorize all world knowledge.

### ❌ Limitations:

* Retrieval quality heavily affects generation quality.
* Concatenating long contexts can exceed model token limits (e.g., GPT2's 1024).
* Retrieval + generation latency is higher than generation-only.

---

## Example in Plain Terms

> Q: “What is our company’s mobile phone policy?”

* **Step 1**: Question is encoded into a vector.
* **Step 2**: That vector is used to **search a corpus** of HR documents using FAISS.
* **Step 3**: Top 5 matching paragraphs are retrieved.
* **Step 4**: These paragraphs are **fed into GPT-2** along with the original question.
* **Step 5**: GPT-2 **generates an answer**, grounded in the retrieved data.

---

## Summary Table

| Component        | Model Type | Purpose                    | Trainable?    |
| ---------------- | ---------- | -------------------------- | ------------- |
| Question Encoder | DPR        | Encode user query          | Yes           |
| Context Encoder  | DPR        | Encode documents for FAISS | Yes           |
| Retriever Index  | FAISS      | Fast similarity search     | No (prebuilt) |
| Generator        | GPT2/BART  | Generate final answer      | Yes           |

---

Would you like a diagram or a PyTorch code version of a minimal RAG implementation?


In [None]:
!pip install transformers faiss-cpu torch

In [39]:
import torch
import numpy as np
import faiss

from transformers import (
    DPRContextEncoder, DPRContextEncoderTokenizer,
    DPRQuestionEncoder, DPRQuestionEncoderTokenizer,
    AutoTokenizer, AutoModelForCausalLM
)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# -----------------------------------------------------------------------------------
# STEP 1: Prepare a small corpus of documents
# -----------------------------------------------------------------------------------
documents = [
    "The company offers 15 days of paid vacation per year.",
    "Employees should submit reimbursement forms within 30 days.",
    "Mobile phones must be secured with a company-approved password.",
    "Remote work is allowed up to 3 days per week.",
    "Drinking alcohol during work hours is strictly prohibited.",
    "The company policy prohibits sex at work."
]


In [40]:

# -----------------------------------------------------------------------------------
# STEP 2: Load DPR Context Encoder and Tokenizer
# These will convert documents into dense embeddings
# -----------------------------------------------------------------------------------
context_encoder =   DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base").to(device)

from transformers import AutoTokenizer

context_tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
#context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

# Tokenize and encode each document
context_embeddings = []
for doc in documents:
    inputs = context_tokenizer(doc, return_tensors='pt', max_length=256, truncation=True, padding=True).to(device)
    with torch.no_grad():
        embedding = context_encoder(**inputs).pooler_output  # (1, 768)
    context_embeddings.append(embedding.cpu().numpy())

# Stack into numpy array for FAISS
context_embeddings_np = np.vstack(context_embeddings).astype('float32')  # Shape: (5, 768)

# Show each document and the first 5 values of its embedding vector
for i, (doc, emb) in enumerate(zip(documents, context_embeddings_np)):
    print(f"\n📄 Document {i+1}:")
    print(doc)
    print(f"\n🔢 First 5 values of embedding vector (shape: {emb.shape}):")
    print(emb[:5])  # Only show the first 5 values


Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



📄 Document 1:
The company offers 15 days of paid vacation per year.

🔢 First 5 values of embedding vector (shape: (768,)):
[ 0.42181766  0.16109067  0.3723451  -0.09172494  0.28965074]

📄 Document 2:
Employees should submit reimbursement forms within 30 days.

🔢 First 5 values of embedding vector (shape: (768,)):
[ 0.3361708   0.27786404  0.5407016  -0.33155122  0.07597482]

📄 Document 3:
Mobile phones must be secured with a company-approved password.

🔢 First 5 values of embedding vector (shape: (768,)):
[0.24592635 0.49385566 0.31940734 0.09724129 0.7529858 ]

📄 Document 4:
Remote work is allowed up to 3 days per week.

🔢 First 5 values of embedding vector (shape: (768,)):
[0.18697365 0.00790876 0.4691345  0.08455902 0.34109434]

📄 Document 5:
Drinking alcohol during work hours is strictly prohibited.

🔢 First 5 values of embedding vector (shape: (768,)):
[ 0.58111304  0.6653758   0.17271869 -0.36189055 -0.12203238]

📄 Document 6:
The company policy prohibits sex at work.

🔢 First 5

In [41]:

# -----------------------------------------------------------------------------------
# STEP 3: Create FAISS index
# This allows us to retrieve similar documents by vector similarity
# -----------------------------------------------------------------------------------
embedding_dim = context_embeddings_np.shape[1]
index = faiss.IndexFlatL2(embedding_dim)  # L2 = Euclidean distance
index.add(context_embeddings_np)  # Add document embeddings to index


In [42]:

# -----------------------------------------------------------------------------------
# STEP 4: Load DPR Question Encoder
# This will embed the user query in the same vector space
# -----------------------------------------------------------------------------------
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base").to(device)
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")


Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [43]:

# -----------------------------------------------------------------------------------
# STEP 5: Define a query and retrieve top-k relevant documents
# -----------------------------------------------------------------------------------
query = "At work, Is it ok if Julie has sex in the utility closit?"

# Tokenize and encode the question
inputs = question_tokenizer(query, return_tensors="pt").to(device)
with torch.no_grad():
    query_embedding = question_encoder(**inputs).pooler_output.cpu().numpy()

# Search FAISS for top 2 closest documents
D, I = index.search(query_embedding, k=2)

print("Top matching documents:")
for idx in I[0]:
    print("-", documents[idx])


Top matching documents:
- The company policy prohibits sex at work.
- Remote work is allowed up to 3 days per week.


[User Query] ──▶ DPR Question Encoder ──▶ vector
                                     │
                                     ▼
                           Search FAISS Index
                                     │
                                     ▼
              Top-k Relevant Docs (raw text) ──▶ GPT-2 ──▶ Final Answer


In [44]:

# -----------------------------------------------------------------------------------
# STEP 6: Load GPT-2 for generation
# -----------------------------------------------------------------------------------
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

gpt_model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
gpt_model.eval()

# Set special token to avoid warnings
gpt_model.generation_config.pad_token_id = gpt_tokenizer.eos_token_id


In [8]:
raise "Skipping")
# -----------------------------------------------------------------------------------
# STEP 7a: Generate answer WITHOUT context
# -----------------------------------------------------------------------------------
def generate_without_context(query):
    inputs = gpt_tokenizer(query, return_tensors="pt").to(device)
    output = gpt_model.generate(inputs["input_ids"], max_new_tokens=50)
    return gpt_tokenizer.decode(output[0], skip_special_tokens=True)

print("\nAnswer without context:")
print(generate_without_context(query))


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Answer without context:
What is the mobile phone policy?

The mobile phone policy is a policy that allows you to use your mobile phone for any purpose. It is a policy that allows you to use your mobile phone for any purpose. It is a policy that allows you to use your mobile phone for


In [45]:
# -----------------------------------------------------------------------------------
# STEP 7a: Generate answer WITHOUT context
# -----------------------------------------------------------------------------------
def generate_without_context(query):
    inputs = gpt_tokenizer(query, return_tensors="pt", padding=True).to(device)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]
    output = gpt_model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=50
    )
    return gpt_tokenizer.decode(output[0], skip_special_tokens=True)
print("\nAnswer without context:")
print(generate_without_context(query))


Answer without context:
At work, Is it ok if Julie has sex in the utility closit?

I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. I don't


Great — you're now entering the **generation phase** of the RAG pipeline. Let’s walk through:

---

## 🔹 **Purpose of Step 7a:**

You're **testing GPT-2 on its own**, without feeding it any extra knowledge (i.e., *not* using FAISS-retrieved documents).

> 🎯 This is a baseline. You're asking:
> “What does GPT-2 already know about the question just from pretraining?”

---

## 🧠 What This Code Does (Line by Line):

### 1. **Define a function: `generate_without_context(query)`**

```python
def generate_without_context(query):
```

This function takes a **query string** (like `"What is the mobile phone policy?"`) and returns a GPT-2 generated response.

---

### 2. **Tokenize the input**

```python
inputs = gpt_tokenizer(query, return_tensors="pt").to(device)
```

* Converts the string into token IDs using the **GPT-2 tokenizer**
* Wraps in a PyTorch tensor
* Moves it to the correct device (`cpu` or `cuda`)

Example: `"what is lunch?"` → `[15496, 318, 17944, 30]`

---

### 3. **Generate a response**

```python
output = gpt_model.generate(inputs["input_ids"], max_new_tokens=50)
```

* Uses GPT-2 to **generate up to 50 new tokens** starting from the prompt
* No extra knowledge — just what GPT-2 already learned during pretraining

---

### 4. **Decode the output tokens**

```python
return gpt_tokenizer.decode(output[0], skip_special_tokens=True)
```

* Converts token IDs back into a human-readable string
* Removes special tokens like \`\` if any

---

### 5. **Call the function and print**

```python
print("\nAnswer without context:")
print(generate_without_context(query))
```

* Runs your function with the original query
* Shows what GPT-2 says on its own

---

## 🧪 Example Output

Query:

```python
"What is the mobile phone policy?"
```

GPT-2 might say:

```text
"The mobile phone policy may vary by organization. Employees are usually expected to keep phones off during meetings..."
```

> But **it could also be vague, wrong, or hallucinated** — because it doesn't know your specific documents yet.

---

## 📊 Why This Step Matters

This is your **control group** in the experiment.

Later, you'll compare it to:

* ✅ `generate_with_context(retrieved_docs + query)`

And see how **RAG improves accuracy** by injecting relevant info from FAISS.

---

## ✅ Summary Table

| Step                       | Purpose                                           |
| -------------------------- | ------------------------------------------------- |
| `generate_without_context` | Run GPT-2 by itself, no external help             |
| Tokenizer                  | Converts text to token IDs for GPT-2              |
| `generate(...)`            | GPT-2 makes predictions based on pretraining only |
| Output                     | Baseline answer — can be vague, biased, or wrong  |

---

Let me know when you're ready to explain `generate_with_context()` — that's where RAG shines.


In [22]:
raise("skipping")
# -----------------------------------------------------------------------------------
# STEP 7b: Generate answer WITH top-k context
# -----------------------------------------------------------------------------------
def generate_with_context(query, retrieved_docs):
    full_input = query + " " + " ".join(retrieved_docs)
    inputs = gpt_tokenizer(full_input, return_tensors="pt", truncation=True, max_length=1024).to(device)
    output = gpt_model.generate(inputs["input_ids"], max_new_tokens=50)
    return gpt_tokenizer.decode(output[0], skip_special_tokens=True)

# Use the top 2 FAISS matches
retrieved = [documents[i] for i in I[0]]

print("\nAnswer with retrieved context:")
print(generate_with_context(query, retrieved))



Answer with retrieved context:
What is the mobile phone policy? Mobile phones must be secured with a company-approved password. The company offers 15 days of paid vacation per year.

What is the mobile phone policy? Mobile phones must be secured with a company-approved password. The company offers 15 days of paid vacation per year. What is the mobile phone policy? Mobile phones must be secured with a company-approved password


In [46]:
import textwrap

def wrap_text(text, width=80, indent=4):
    indent_str = ' ' * indent
    return '\n'.join(textwrap.wrap(text, width=width, subsequent_indent=indent_str))

# -----------------------------------------------------------------------------------
# STEP 7b: Generate answer WITH top-k context (pretty print)
# -----------------------------------------------------------------------------------
def generate_with_context(query, retrieved_docs):
    full_input = "Context: " + " ".join(retrieved_docs) + "\nQuestion: " + query + "\nAnswer:"
    print(f"full input = {full_input}")
#    inputs = gpt_tokenizer(full_input, return_tensors="pt", truncation=True, max_length=1024).to(device)
#    output = gpt_model.generate(inputs["input_ids"], max_new_tokens=50)
#    full_input = query + " " + " ".join(retrieved_docs)
    inputs = gpt_tokenizer(full_input, return_tensors="pt", truncation=True, max_length=1024, padding=True).to(device)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]
    print(f"input_ids = {input_ids}")
    print(f"attention_mask = {attention_mask}")

#    output = gpt_model.generate(
#        input_ids=input_ids,
#        attention_mask=attention_mask,
#        max_new_tokens=50
#    )

    output = gpt_model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=100,
      repetition_penalty=1.2,
      do_sample=True,
      temperature=0.8
    )

    print(f"output = {output}")
    generated_text = gpt_tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"generated_text = {generated_text}")
    print("--------------------------------------------------")
    print("Question:")
    print(wrap_text(query))
    print("\nCombined Input:")
    print(wrap_text(full_input))
    print("\nOutput:")
    print(wrap_text(generated_text))
    print("--------------------------------------------------\n")

# Use the top 2 FAISS matches
retrieved = [documents[i] for i in I[0]]
print(f"retrieved = {retrieved}")

#print("\nAnswer with retrieved context:")
generate_with_context(query, retrieved)


retrieved = ['The company policy prohibits sex at work.', 'Remote work is allowed up to 3 days per week.']
full input = Context: The company policy prohibits sex at work. Remote work is allowed up to 3 days per week.
Question: At work, Is it ok if Julie has sex in the utility closit?
Answer:
input_ids = tensor([[21947,    25,   383,  1664,  2450, 24059,  1714,   379,   670,    13,
         21520,   670,   318,  3142,   510,   284,   513,  1528,   583,  1285,
            13,   198, 24361,    25,  1629,   670,    11,  1148,   340, 12876,
           611, 21946,   468,  1714,   287,   262, 10361,  3542,   270,    30,
           198, 33706,    25]])
attention_mask = tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
output = tensor([[21947,    25,   383,  1664,  2450, 24059,  1714,   379,   670,    13,
         21520,   670,   318,  3142,   510,   284,   513,  1528,   583,  1285,
            13

That kind of output — `"I don't know. I don't know. ..."` — is a classic **low-confidence degenerative loop** in language models like GPT-2. Here's what it means and how to fix it:

---

## 🤖 Why It Happens

1. **Lack of Strong Context**

   * If the retrieved context from FAISS isn't relevant or is too weak, the model may revert to generic phrases like “I don’t know.”

2. **No Fine-Tuning**

   * GPT-2 isn't specifically fine-tuned on Q\&A tasks with knowledge grounding. It’s just predicting the next likely token — and if it’s uncertain, it often loops simple outputs.

3. **Temperature Too Low**

   * If you're using `temperature=1.0` or lower without other constraints, the model may play it safe and stick to repetitive defaults.

4. **No Output Penalty**

   * If you don’t apply a `repetition_penalty`, the model is prone to repeating the same token patterns.

---

## ✅ Fixes to Try

### 🔧 Update `generate()` with these parameters:

```python
output = gpt_model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_new_tokens=100,
    repetition_penalty=1.5,   # discourages "I don't know" loops
    temperature=0.8,          # adds variability
    top_p=0.95,               # nucleus sampling
    do_sample=True            # enables sampling instead of greedy decoding
)
```

These help:

* Encourage variation
* Avoid repetitive loops
* Explore better outputs

---

### 🚀 Optional Upgrade: Use a better model

GPT-2 is small and not trained for retrieval-based generation.

* **Try**: `facebook/bart-large` or `facebook/bart-large-cnn`
* **Or RAG models**: `facebook/rag-token-base`

Those are trained **specifically for question-answering** over retrieved documents.

---

Would you like me to help you switch the generator model to one of those? Or adjust your current `generate_with_context()` to include these new settings?
