<a href="https://colab.research.google.com/github/adityaraj1105/RAGs-vs-Fine-tuning/blob/main/RAGs_vs_Fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Setting Up Environment




In [1]:
!pip install faiss-cpu



In [2]:
!pip install transformers datasets




#2. Loading Dataset



In [3]:
from datasets import load_dataset

# Load the SQuAD dataset
dataset = load_dataset("squad")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


#3. Fine-Tuning the LLM [GPT-2(137M)]


In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# Load SQuAD dataset
dataset = load_dataset("squad")

# Load tokenizer and model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

def preprocess_function(examples):
    inputs = [q + " " + c for q, c in zip(examples['question'], examples['context'])]
    model_inputs = tokenizer(inputs, max_length=256, truncation=True, padding='max_length')
    model_inputs['labels'] = model_inputs['input_ids'].copy()
    return model_inputs

# Tokenize dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Use a smaller subset for quicker training
train_dataset = tokenized_dataset['train'].select(range(500))  # Further reduced
eval_dataset = tokenized_dataset['validation'].select(range(50))  # Further reduced

# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-small-squad",
    per_device_train_batch_size=1,  # Reduced batch size
    per_device_eval_batch_size=1,   # Reduced batch size
    num_train_epochs=1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=20,  # Less frequent logging
    evaluation_strategy="steps",
    save_steps=1000,  # Save less frequently
    eval_steps=1000,  # Evaluate less frequently
    save_total_limit=1,
    gradient_accumulation_steps=2,  # Accumulate gradients
    fp16=True,  # Keep mixed precision
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Start training
trainer.train()


Map:   0%|          | 0/10570 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss


TrainOutput(global_step=250, training_loss=1.997335693359375, metrics={'train_runtime': 2282.9714, 'train_samples_per_second': 0.219, 'train_steps_per_second': 0.11, 'total_flos': 65323008000000.0, 'train_loss': 1.997335693359375, 'epoch': 1.0})

# 4. Evaluating the Fine-Tuned Model


In [5]:
# Evaluate the fine-tuned model
results = trainer.evaluate()
print("Evaluation results:", results)

Evaluation results: {'eval_loss': 1.6428017616271973, 'eval_runtime': 65.4114, 'eval_samples_per_second': 0.764, 'eval_steps_per_second': 0.764, 'epoch': 1.0}


#5. Implementing RAG with SQuAD



In [6]:
from transformers import RagTokenizer, RagSequenceForGeneration

# Load RAG model and tokenizer
rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
rag_model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq")

# Tokenize SQuAD data for RAG
def rag_preprocess_function(examples):
    return rag_tokenizer(examples['question'], truncation=True, padding='max_length', max_length=256)

rag_tokenized_dataset = dataset.map(rag_preprocess_function, batched=True)

# Use a smaller subset for quicker processing
rag_train_dataset = rag_tokenized_dataset['train'].select(range(1000))
rag_eval_dataset = rag_tokenized_dataset['validation'].select(range(100))


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

#6. Tokenize and Generate Answers using RAG

In [12]:
from transformers import pipeline, RagTokenizer, RagRetriever, RagSequenceForGeneration, GPT2Tokenizer, GPT2LMHeadModel

# Load RAG model and tokenizer
rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
rag_retriever = RagRetriever.from_pretrained("facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True)
rag_model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", retriever=rag_retriever)

# Load a simple generation model like GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")

# Manually assign a pad_token_id
pad_token_id = gpt2_tokenizer.eos_token_id  # GPT-2's eos_token_id

# Define a sample question and context
sample_question = "What is the capital of France?"

# Tokenize the input using GPT-2 tokenizer
input_ids = gpt2_tokenizer(sample_question, return_tensors='pt').input_ids

# Generate an answer using the RAG model
outputs = rag_model.generate(input_ids=input_ids, pad_token_id=pad_token_id, max_length=50)

# Decode the generated answer
generated_answer = gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("RAG Answer:", generated_answer)



The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

RAG Answer: # but Yugann%ion hopBegin#
GPT-2 Answer: What is the capital of France? Paris is the capital of France. It is known for its art, fashion, and culture. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known for its history. It is also known
RAG Answer: # but Yugann%ion hopBegin#


#7. Comparing Results


In [17]:
# Questions and Contexts from SQuAD Dataset
test_data = [
    {
        "question": "When was the Declaration of Independence signed?",
        "context": "The Declaration of Independence was signed on July 4, 1776, by representatives from the thirteen American colonies."
    },
    {
        "question": "Who is the author of 'The Odyssey'?",
        "context": "Homer, an ancient Greek poet, is traditionally said to be the author of 'The Odyssey' and 'The Iliad.'"
    },
    # Add more questions and contexts as needed
]

# Function to generate answers using fine-tuned GPT-2
def gpt2_generate_answer(question, context):
    inputs = tokenizer.encode_plus(question + " " + context, return_tensors="pt", padding=True, truncation=True, max_length=512)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=150, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Function to generate answers using RAG
def rag_generate_answer(question, context):
    # Combine question and context
    inputs = rag_tokenizer(question, context, return_tensors="pt", padding=True, truncation=True, max_length=512)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Generate outputs
    outputs = rag_model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=50)
    return rag_tokenizer.decode(outputs[0], skip_special_tokens=True)


# Loop through the test data and generate answers using both models
for item in test_data:
    question = item["question"]
    context = item["context"]

    # Generate answer using fine-tuned GPT-2
    gpt2_answer = gpt2_generate_answer(question, context)

    # Generate answer using RAG
    rag_answer = rag_generate_answer(question, context)

    # Print the results
    print(f"Question: {question}")
    print(f"Context: {context}")
    print(f"GPT-2 Answer: {gpt2_answer}")
    print(f"RAG Answer: {rag_answer}")
    print("-" * 50)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: When was the Declaration of Independence signed?
Context: The Declaration of Independence was signed on July 4, 1776, by representatives from the thirteen American colonies.
GPT-2 Answer: When was the Declaration of Independence signed? The Declaration of Independence was signed on July 4, 1776, by representatives from the thirteen American colonies. The Declaration of Independence was signed by the president, John Adams, and was signed by the president, James Madison. The United States of America was established in 1776 by the United States Congress. The United States of America was established in 1803 by the United States House of Representatives. The United States of America was established in 1803 by the United States Senate. The United States of America was established in 1803 by the United States House of Representatives.
RAG Answer: 
--------------------------------------------------
Question: Who is the author of 'The Odyssey'?
Context: Homer, an ancient Greek poet, i