# CS769 Tutorial 4


<div id="Agenda"></div>

## Agenda

- [Introduction of Retrieval-augmented generation](#first)
- [Implementation of Retrieval-augmented generation](#second)
---

<div id="first"></div>

## 1 Introduction of Retrieval-augmented generation

Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and sequence-to-sequence models. It is proposed by a [NIPS paprt](https://arxiv.org/abs/2005.11401).

**QUIZ**: What is pretrained dense retrieval (DPR)?

**QUIZ**: What is sequence-to-sequence model?


### 1.1 Work Flow

RAG models,
- Retrieve documents
- Pass them to a seq2seq model
- Marginalize to generate outputs


The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.


For exmaple:

  - User Question: *What is NLP?*

  - Context:
    1. NLP stands for Natural Language Processing, a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language.
    2. It involves the development of algorithms and models that enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful.
    3. NLP encompasses a variety of tasks including text analysis, translation, sentiment analysis, and speech recognition.

  - The answer might be:

    NLP, or Natural Language Processing, is a field of artificial intelligence that focuses on the interaction between computers and human languages. It involves creating algorithms and models that allow computers to understand, interpret, and generate human language. This technology is used in applications such as text analysis, translation, sentiment analysis, and speech recognition, enabling machines to process and respond to human language in a meaningful way.

**QUIZ**: Based on the example, can you imagine what is the biggest difference between RAG and normal QA?

### 1.2 Retrieval-augmented generation (RAG)

RAG idea:

![RAG](https://api.wandb.ai/files/cosmo3769/images/projects/38097019/02184762.png)


[Back to Agenda](#Agenda)

---


## 2 Instruction of Retrieval-augmented generation

### 2.1 Essential Packages

We need to install [pytorch](https://pytorch.org/) and may need the Transformers.

#### **Step 1**: Activate Conda Environment
```
# Move to your project directory (Optional)
cd </path/to/your/directory>

# Activate your environment
conda activate <env_name>
```

#### **Step 2**: Install pytorch
```
# GPU version and the cuda version is over 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# CPU version
conda install pytorch torchvision torchaudio cpuonly -c pytorch
```

#### **Step 3**: Install transformers
```
conda install transformers
```

To avoid the time in package installization, we can use [Colab](https://colab.research.google.com/) which is already provided these packages.

[Back to Agenda](#Agenda)

### 2.2 RAG Sample

First, we need to know the structure of RAG.

**QUIZ**: Can you find any error or missing parts in the following RAG pesudo code?



In [None]:
# Pseudo code for RAG (Retrieval-Augmented Generation)

# Function to perform RAG-based question answering
def RAG_system(question):

    # Step 1: Retrieve relevant documents or passages from the knowledge base
    relevant_docs = retrieve_documents(question)

    # Step 2: Combine the retrieved documents into a context for the generation model
    combined_context = combine_contexts(relevant_docs)

    # Step 3: Generate the answer using the generation model with the provided context
    generated_answer = generate_answer(question, combined_context)

    # Step 4: Return the final generated answer
    return generated_answer

# Function to retrieve relevant documents based on the question
def retrieve_documents(question):
    # Assume we have a search index or knowledge base
    # Use the question to query and retrieve relevant documents
    retrieved_docs = search_knowledge_base(question)
    return retrieved_docs

# Function to combine the retrieved documents into a single context
def combine_contexts(docs):
    # Concatenate or select the most relevant parts of the retrieved documents
    context = concatenate_docs(docs)
    return context

# Function to generate an answer based on the question and context
def generate_answer(question, context):
    # Input the question and combined context into a language generation model
    answer = language_model.generate(text=f"Q: {question}\nContext: {context}\nA:")
    return answer

# Example usage
question = "What is NLP?"
answer = RAG_system(question)
print(answer)


## 3 Implementation of a RAG System



## 3.1 Craft the RAG System Manually
We will use the FLAN-T5 model as the backbone genrative model.

We will implement the functions in the above sample code correspondingly.

In [9]:
import random, numpy as np, torch

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # if using GPU

In [10]:

# Import required libraries (all pre-installed in Colab)
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    pipeline
)
import torch
from typing import List, Tuple

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# ============================================
# SETUP: Initialize Components
# ============================================

print("Loading model components... This may take a minute on first run.")

# We'll use FLAN-T5 which is excellent for question answering
# and works well with context augmentation
model_name = "google/flan-t5-base"  # You can also try "google/flan-t5-small" for faster/lighter

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print("Model loaded successfully!")

# ============================================
# RAG SYSTEM IMPLEMENTATION
# ============================================

def RAG_system(question: str, knowledge_base: List[str] = None) -> str:
    """
    Main RAG function that performs retrieval-augmented generation
    Following the pseudo-code structure exactly

    Args:
        question: The input question to answer
        knowledge_base: Optional list of documents for retrieval

    Returns:
        Generated answer string
    """

    if knowledge_base:
        # Step 1: Retrieve relevant documents from knowledge base
        relevant_docs = retrieve_documents(question, knowledge_base)

        # Step 2: Combine retrieved documents into context
        combined_context = combine_contexts(relevant_docs)

        # Step 3: Generate answer using question and context
        generated_answer = generate_answer(question, combined_context)
    else:
        # Generate answer without retrieval (using model's parametric knowledge)
        generated_answer = generate_answer(question, "")

    # Step 4: Return the final generated answer
    return generated_answer


def retrieve_documents(question: str, knowledge_base: List[str]) -> List[str]:
    """
    Retrieve relevant documents based on the question
    Uses simple TF-IDF-like scoring for demonstration

    Args:
        question: Input question
        knowledge_base: List of document strings

    Returns:
        List of relevant documents (top 3)
    """
    if not knowledge_base:
        return []

    # Simple retrieval using word overlap scoring
    question_words = set(question.lower().split())

    # Score each document
    doc_scores = []
    for doc in knowledge_base:
        doc_words = set(doc.lower().split())
        # Calculate overlap score
        common_words = question_words.intersection(doc_words)
        score = len(common_words)
        # Bonus for exact phrase matches
        if any(word in doc.lower() for word in question.lower().split() if len(word) > 3):
            score += 1
        doc_scores.append((score, doc))

    # Sort by score and get top 3
    doc_scores.sort(reverse=True, key=lambda x: x[0])

    # Return top documents with non-zero scores
    relevant_docs = [doc for score, doc in doc_scores[:3] if score > 0]

    print(f"Retrieved {len(relevant_docs)} relevant documents")
    return relevant_docs


def combine_contexts(docs: List[str]) -> str:
    """
    Combine the retrieved documents into a single context

    Args:
        docs: List of document strings

    Returns:
        Combined context string
    """
    if not docs:
        return ""

    # Concatenate documents with separator
    context = " ".join(docs)

    # Truncate if too long (to fit in model's input limit)
    max_words = 300  # Keep context reasonable
    words = context.split()
    if len(words) > max_words:
        context = " ".join(words[:max_words])

    return context


def generate_answer(question: str, context: str) -> str:
    """
    Generate an answer based on the question and context
    Using the language model with appropriate prompting

    Args:
        question: Input question
        context: Combined context from retrieved documents

    Returns:
        Generated answer string
    """
    # Create prompt based on whether we have context
    if context:
        # RAG-style prompt with context
        prompt = f"""Answer the question based on the following context.

Context: {context}

Question: {question}

Answer:"""
    else:
        # Direct question answering without context
        prompt = f"Question: {question}\nAnswer:"

    # Tokenize input
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    ).to(device)

    # Generate answer
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=150,
            min_length=10,
            temperature=0.7,
            do_sample=False,  # Deterministic for consistency
            num_beams=3,
            early_stopping=True
        )

    # Decode and return answer
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer


# ============================================
# HELPER FUNCTIONS FOR DEMONSTRATION
# ============================================

def search_knowledge_base(question: str, kb: List[str]) -> List[str]:
    """
    Wrapper function matching the pseudo-code's search_knowledge_base
    """
    return retrieve_documents(question, kb)


def concatenate_docs(docs: List[str]) -> str:
    """
    Wrapper function matching the pseudo-code's concatenate_docs
    """
    return combine_contexts(docs)


# ============================================
# EXAMPLE KNOWLEDGE BASE
# ============================================

# Create a comprehensive knowledge base for demonstration
default_knowledge_base = [
    "Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It combines computational linguistics with machine learning and deep learning.",

    "NLP enables computers to understand, interpret, and generate human language in a valuable way. Common applications include machine translation, sentiment analysis, and chatbots.",

    "Key NLP tasks include tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, text classification, and question answering.",

    "Transformers are a neural network architecture introduced in 2017 that revolutionized NLP. They use self-attention mechanisms to process sequential data more effectively than RNNs or LSTMs.",

    "BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google that achieves state-of-the-art results on many NLP tasks.",

    "GPT (Generative Pre-trained Transformer) is a series of language models developed by OpenAI that excel at text generation and have been scaled to billions of parameters.",

    "RAG (Retrieval-Augmented Generation) combines information retrieval with text generation, allowing models to access external knowledge bases for more accurate and factual responses.",

    "Word embeddings like Word2Vec and GloVe represent words as dense vectors, capturing semantic relationships between words in a continuous vector space.",

    "Attention mechanisms allow models to focus on relevant parts of the input when producing output, significantly improving performance on tasks like machine translation.",

    "Transfer learning in NLP involves pre-training models on large text corpora and then fine-tuning them for specific downstream tasks, reducing the need for task-specific training data."
]

# ============================================
# EXAMPLE USAGE
# ============================================

print("\n" + "="*50)
print("RAG SYSTEM EXAMPLES")
print("="*50)

# Example 1: Question with retrieval from knowledge base
print("\n--- Example 1: RAG with Knowledge Base ---")
question1 = "What is NLP?"
print(f"Question: {question1}")
answer1 = RAG_system(question1, knowledge_base=default_knowledge_base)
print(f"Answer: {answer1}")

# Example 2: Question about transformers
print("\n--- Example 2: Question about Transformers ---")
question2 = "What are transformers in NLP?"
print(f"Question: {question2}")
answer2 = RAG_system(question2, knowledge_base=default_knowledge_base)
print(f"Answer: {answer2}")

# Example 3: Question about RAG itself
print("\n--- Example 3: Question about RAG ---")
question3 = "What is RAG and how does it work?"
print(f"Question: {question3}")
answer3 = RAG_system(question3, knowledge_base=default_knowledge_base)
print(f"Answer: {answer3}")

# Example 4: Direct question without knowledge base
print("\n--- Example 4: Direct Question (No Retrieval) ---")
question4 = "What is the capital of France?"
print(f"Question: {question4}")
answer4 = RAG_system(question4)  # No knowledge base provided
print(f"Answer: {answer4}")




Using device: cuda
Loading model components... This may take a minute on first run.


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model loaded successfully!

RAG SYSTEM EXAMPLES

--- Example 1: RAG with Knowledge Base ---
Question: What is NLP?
Retrieved 3 relevant documents


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: a subfield of artificial intelligence that focuses on the interaction between computers and human language

--- Example 2: Question about Transformers ---
Question: What are transformers in NLP?
Retrieved 3 relevant documents


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: a neural network architecture introduced in 2017 that revolutionized

--- Example 3: Question about RAG ---
Question: What is RAG and how does it work?
Retrieved 3 relevant documents


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Answer: RAG (Retrieval-Augmented Generation) combines information retrieval with text generation

--- Example 4: Direct Question (No Retrieval) ---
Question: What is the capital of France?
Answer: saône-by-seine


## 3.2 Try a Pre-trained RAG Model with Built in Retriver and Generator
In the hugging face, there are a lot of models.

You can find [here](https://huggingface.co/models?search=rag).

Following is one of the example model and usage.

In [13]:
# ============================================
# STEP 1: Run this installation first
# ============================================
!pip install faiss-cpu -q
print("faiss-cpu installed!")
import faiss; print(faiss.__version__)
!pip install datasets==2.19.0
!pip -q install "numpy<2.0"

faiss-cpu installed!
1.11.0


In [11]:
"""
Facebook RAG Model - Direct Implementation
"""

# ============================================
# STEP 2: After installation, run this
# (If error, restart runtime then run)
# ============================================

from transformers import RagTokenizer, RagTokenForGeneration, RagRetriever
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using: {device}\n")

# Load Facebook's RAG model
model_name = "facebook/rag-token-nq"

print("Loading RAG components...")

# Load tokenizer
tokenizer = RagTokenizer.from_pretrained(model_name)
print("✓ Tokenizer loaded")

# Load retriever
retriever = RagRetriever.from_pretrained(
    model_name,
    index_name="exact",
    use_dummy_dataset=True
)
print("✓ Retriever loaded")

# Load model with retriever
model = RagTokenForGeneration.from_pretrained(
    model_name,
    retriever=retriever
).to(device)
print("✓ RAG model loaded\n")

def ask_rag(question):
    """Ask question using Facebook's RAG model"""
    inputs = tokenizer(question, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            max_length=1000

        )

    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Test it
print("="*50)
print("TESTING FACEBOOK RAG MODEL")
print("="*50)

questions = [
    "What is the capital of France?",
    "Who invented the telephone?",
    "What is data science",
    "When was the moon landing?",
]

for q in questions:
    print(f"\nQ: {q}")
    answer = ask_rag(q)
    print(f"A: {answer}")

print("\n✅ Facebook RAG model working!")
print("Usage: answer = ask_rag('your question')")

Using: cuda

Loading RAG components...


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

✓ Tokenizer loaded


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

✓ Retriever loaded


Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


✓ RAG model loaded

TESTING FACEBOOK RAG MODEL

Q: What is the capital of France?
A:  city of paris

Q: Who invented the telephone?
A:  alexander graham bell

Q: What is data science
A:  analysis of data

Q: When was the moon landing?
A:  july 20 , 1969

✅ Facebook RAG model working!
Usage: answer = ask_rag('your question')
