# Comparing The Effectiveness Of RAG Between LLMs
This project analyzes how different Large Language Models (LLMs) respond to Retrieval Augmented Generation (RAG). We based our implementation on the paper "RAGAS: Automated Evaluation of Retrieval Augmented Generation" by Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. To evaluate the RAG performance of the different models, we implemented the three different scores mentioned in the paper: faithfulness, answer relevance, and context relevance. For instructions on how to run the project see the README.md file. The final results can be viewed in the last cell after the project has been executed.

### Dependencies

In [1]:
!pip install torch --index-url https://download.pytorch.org/whl/cu128 --quiet 
!pip install python-dotenv ipywidgets sentence-transformers faiss-cpu transformers datasets transformers jupyterlab_widgets pandas accelerate numpy hf_xet tqdm --quiet


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import os
import re
import gc
import torch
import numpy as np
from numpy.linalg import norm
import faiss
import datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from accelerate import Accelerator
from tqdm.notebook import tqdm
from dotenv import load_dotenv

Checks if CUDA is available and sets the accelerator for hardware-agnostic execution:

In [3]:
print(f"CUDA verfügbar: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    
    accelerator = Accelerator()
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"Anzahl GPUs: {torch.cuda.device_count()}")
    print(f"CUDA Version (PyTorch): {torch.version.cuda}")
    print(f"Accelerator device: {accelerator.device}")
else:
    accelerator = Accelerator(cpu=True) 
    print(f"Accelerator device: {accelerator.device}")

CUDA verfügbar: True
GPU Name: NVIDIA GeForce RTX 5060 Ti
Anzahl GPUs: 1
CUDA Version (PyTorch): 12.8
Accelerator device: cuda


### Import the dataset

In [4]:
rag_dataset_1200 = datasets.load_dataset("neural-bridge/rag-dataset-1200", split="all")

documents_rag_1200 = [item["context"] for item in rag_dataset_1200]
len(documents_rag_1200)

1200

In [5]:
documents = documents_rag_1200

def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

len(chunks)

10950

In [6]:
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embed_model.encode(
    chunks,
    show_progress_bar=True
)

embeddings = np.array(embeddings)
embeddings.shape

Batches:   0%|          | 0/343 [00:00<?, ?it/s]

(10950, 384)

Function used to generate all answers from the models

In [7]:
def gnerate_answer(model, tokenizer, prompt: str, context=None, max_tokens=1000, do_sample=True) -> str:
    if context==None:
        rag_instruction = prompt
    else:
        rag_instruction = (
            "Answer using the information from the given context.\n"
            f"context: {context}\n\n"
            f"{prompt}"
        )

    messages = [
        {"role": "user", "content": rag_instruction}
    ]
    
    text_input = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True,
        #-----------------------------------------------------------
        enable_thinking=False #-> falls kein thinking model löschen!
        #-----------------------------------------------------------
    )
    
    inputs = tokenizer(text_input, return_tensors="pt").to(accelerator.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=do_sample,
            temperature=0.7,
            top_p=0.95,
            eos_token_id=tokenizer.eos_token_id
        )
    generated_ids = output_ids[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

### The three scores mentioned in the Ragas paper
In this section we implement all three scores mentioned in the Ragas paper.
We will use these to evaluate the models rag performance

Faithfulness: $$F = \frac{|F|}{|V|}$$

In [8]:
def build_statement_prompt(question: str, answer: str) -> str:
    return f"""Given a question and answer, create one or more statements from each sentence in the given answer.
question: {question}
answer: {answer}"""

In [9]:
def extract_statements(model, tokenizer, question: str, answer: str, max_tokens=1000) -> list:
    #Extracts statements out of the model generated answer 
    prompt = build_statement_prompt(question, answer)

    answer = gnerate_answer(model, tokenizer, prompt, max_tokens, do_sample=False)

    statements = [line.strip() for line in answer.split("\n") if len(line.strip()) > 3]
    return statements

We have slightly modified the original prompt from the paper to make it easier for the model to understand.

In [10]:
def build_faithfullness_prompt(statements: list) -> str:
    prompt = f"""Consider the given context and following statements, the determine whether they are supported by the information presente in the context. 
Provide a brief explanation for each statement before arriving at the verdict (Yes/No). 
Answer in the format:
Statement: ...
Explanation: ...
Verdict: (Yes/No)
Do not deviate from the specified format.
These are the statements
"""
    for s in statements:
        prompt += f"Statement: {s}\n"
        
    return prompt

In [11]:
def count_supported(answer: str) -> int:
    yes_matches = re.findall(r'Verdict:\s*(Yes)', answer, re.IGNORECASE)
    yes_count = len(yes_matches)
    return yes_count

In [12]:
def calculate_faithfulness_score(total_statements: int, suported_statements: int) -> float:
    if suported_statements == 0.0:
        return 0.0

    return suported_statements / total_statements

Answer Relevance: $$AR = \frac{1}{n} \sum\limits^n_{i=1} sim(q,q_i)$$

In [13]:
def build_answer_relevance_prompt(answer: str) -> str:
    prompt = f"""Generate a question for the given answer.
    answer: {answer}"""
    return prompt

In [14]:
def extract_questions(text: str) -> list:
    text.strip()
    list = text.split("?")
    return [item + "?" for item in list]

In [15]:
def calculate_question_similarity(original_question:str, generated_questions:list) -> list:
    q_embedding = embed_model.encode([original_question])
    gen_q_embeddings = embed_model.encode(generated_questions)

    q_embedding = np.array(q_embedding).astype('float32')
    gen_q_embeddings = np.array(gen_q_embeddings).astype('float32')

    faiss.normalize_L2(q_embedding)
    faiss.normalize_L2(gen_q_embeddings)

    d = q_embedding.shape[1]
    index = faiss.IndexFlatIP(d)
    
    index.add(gen_q_embeddings)
    
    k = len(generated_questions)
    D, I = index.search(q_embedding, k=k)
    
    scores = D[0]
    return scores

In [16]:
def calculate_answer_relevance_score(similatity: list) -> float:
    return float(np.mean(similatity))

Context Relevance: $$CR = \frac{\text{number of extracted sentences}}{\text{total number of senctences in }c(q)}$$

In [17]:
def build_context_relevance_prompt(question: str) -> str:
    prompt = f"""Please extract relevant sentences from the provided context that can potentially help ansewr the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phares "Insufficient Information".While extracting candidate sentences you're not allowed to make any changes to senctences from the given context.
    Question: {question}"""
    return prompt

In [18]:

def count_sentences(text: str) -> int:
    sentences = re.findall(r'[.!?]+(?:\s|$)', text)
    return len(sentences)

In [19]:
def calculate_context_relevance_score(num_extracted: int, num_sentences: int) -> float:
    if num_sentences == 0.0:
        return 0.0
    
    return num_extracted / num_sentences

### RAG analysis

In [20]:
def analyse_faithfullness(model, tokenizer, context: str, question: str, answer: str, do_sample=True) -> list:
    #extracts statements out of the answer and generates a statements list
    statements = extract_statements(model, tokenizer, question, answer)

    #lets the model evaluate the faithfullness
    faithfullness_prompt = build_faithfullness_prompt(statements)
    #do_sample has to be false
    eval = gnerate_answer(model, tokenizer, faithfullness_prompt, context, do_sample=False)
    num_supported = count_supported(eval)

    return [statements, num_supported]

In [21]:
def analyse_answer_relevance(model, tokenizer, question: str, answer: str, do_sample=True) -> list:
    prompt = build_answer_relevance_prompt(answer)
    questions_generated = gnerate_answer(model, tokenizer, prompt, do_sample=do_sample)
    questions_generated = extract_questions(questions_generated)

    sim = calculate_question_similarity(question, questions_generated)

    return sim

In [22]:
def analyse_context_relevance(model, tokenizer, context: str, question: str) -> list:
    prompt = build_context_relevance_prompt(question)
    answer = gnerate_answer(model, tokenizer, prompt, context, do_sample=False)

    num_extracted = count_sentences(answer)
    num_sentences = count_sentences(context)
    
    return [num_extracted, num_sentences]

In [27]:
def analyse_rag(model, tokenizer, do_sample=True) -> dict:
    scores = {
        "faithfullness score" : 0.0,
        "answer relevance score" : 0.0,
        "context relevance score" : 0.0
    }
    #for faithfullness
    statements = []
    supported_statements = 0

    #for answer relevance
    similatity = []

    #for context relevance
    num_extracted = 0
    num_sentences = 0

    dataset_length = int(len(rag_dataset_1200) / 2)
    for i in tqdm(range(dataset_length)):
        #generates a answer with the context for the current question
        question = rag_dataset_1200[i]["question"]
        context = rag_dataset_1200[i]["context"]
        answer = gnerate_answer(model, tokenizer, question, context,do_sample=do_sample)

        faithfullness = analyse_faithfullness(model, tokenizer, context, question, answer, do_sample)
        statements.extend(faithfullness[0])
        supported_statements += faithfullness[1]

        answer_relevance = analyse_answer_relevance(model, tokenizer, question, answer, do_sample=do_sample)
        similatity.extend(answer_relevance)

        context_relevance = analyse_context_relevance(model, tokenizer, context, question)
        num_extracted += context_relevance[0]
        num_sentences += context_relevance[1]

    scores["faithfullness score"] = calculate_faithfulness_score(len(statements), supported_statements)
    scores["answer relevance score"] = calculate_answer_relevance_score(similatity)
    scores["context relevance score"] = calculate_context_relevance_score(num_extracted, num_sentences)

    return scores

### Load Models and Conclusion

Due to hardware limitations, we were not able to load and run all models at the same time. So we had to imlement all scores one after the orther.

Model Compression for Memory Optimization

In [24]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

Load Llama

In [25]:
gc.collect()
torch.cuda.empty_cache()

load_dotenv()
hf_token = os.getenv("HF_TOKEN")

model_name_llama = "meta-llama/Llama-3.1-8B-Instruct"

model_llama = AutoModelForCausalLM.from_pretrained(
    model_name_llama,
    quantization_config=bnb_config,
    device_map="auto",
    token=hf_token                  
)

tokenizer_llama = AutoTokenizer.from_pretrained(
    model_name_llama,
    token=hf_token                  
)
tokenizer_llama.pad_token = tokenizer_llama.eos_token

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Calculate scores for Llama:

In [28]:
scores_llama = analyse_rag(model_llama, tokenizer_llama,do_sample=False)
scores_llama

  0%|          | 0/600 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128009 for op

{'faithfullness score': 0.7013245033112583,
 'answer relevance score': 0.4270346164703369,
 'context relevance score': 0.14500347537828157}

Clear GPU VRAM

In [29]:
del model_llama
del tokenizer_llama
gc.collect()
torch.cuda.empty_cache()

Load Qwen

In [30]:
model_name_qwen = "Qwen/Qwen3-8B"

model_qwen = AutoModelForCausalLM.from_pretrained(
    model_name_qwen,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="sdpa"
)
model_qwen = accelerator.prepare(model_qwen) 
tokenizer_qwen = AutoTokenizer.from_pretrained(model_name_qwen)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Calculalte scores for Qwen

In [31]:
scores_qwen = analyse_rag(model_qwen, tokenizer_qwen, do_sample=False)
scores_qwen

  0%|          | 0/600 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


{'faithfullness score': 0.8215314944421572,
 'answer relevance score': 0.46684741973876953,
 'context relevance score': 0.1111051702935358}

Clear GPU VRAM

In [32]:
del model_qwen
del tokenizer_qwen
gc.collect()
torch.cuda.empty_cache()

Load Mistral

In [34]:
gc.collect()
torch.cuda.empty_cache()

model_name_mistral = "mistralai/Ministral-8B-Instruct-2410"
model_mistral = AutoModelForCausalLM.from_pretrained(
    model_name_mistral,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="sdpa"
)
model_mistral = accelerator.prepare(model_mistral)
tokenizer_mistral = AutoTokenizer.from_pretrained(model_name_mistral)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 0c27901b-39d4-4b1f-9a90-010e5e48ae69)')' thrown while requesting HEAD https://huggingface.co/mistralai/Ministral-8B-Instruct-2410/resolve/main/generation_config.json
Retrying in 1s [Retry 1/5].
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Calculate scores for Mistral:

In [35]:
scores_mistarl = analyse_rag(model_mistral, tokenizer_mistral,do_sample=False)
scores_mistarl

  0%|          | 0/600 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

{'faithfullness score': 0.7282850779510023,
 'answer relevance score': 0.4447920620441437,
 'context relevance score': 0.06277067850077528}

Clear GPU VRAM

In [36]:
del model_mistral
del tokenizer_mistral
gc.collect()
torch.cuda.empty_cache()

### Final Results

In [37]:
print("Llama-3.1-8B-Instruct:")
print(f"Faithfullness Score: {scores_llama["faithfullness score"]}; Answer Relevance Score: {scores_llama["answer relevance score"]}; Context Relevance Score: {scores_llama["context relevance score"]}")
print("")
print("Qwen3-8B:")
print(f"Faithfullness Score: {scores_qwen["faithfullness score"]}; Answer Relevance Score: {scores_qwen["answer relevance score"]}; Context Relevance Score: {scores_qwen["context relevance score"]}")
print("")
print("Ministral-8B-Instruct-2410:")
print(f"Faithfullness Score: {scores_mistarl["faithfullness score"]}; Answer Relevance Score: {scores_mistarl["answer relevance score"]}; Context Relevance Score: {scores_mistarl["context relevance score"]}")

Llama-3.1-8B-Instruct:
Faithfullness Score: 0.7013245033112583; Answer Relevance Score: 0.4270346164703369; Context Relevance Score: 0.14500347537828157

Qwen3-8B:
Faithfullness Score: 0.8215314944421572; Answer Relevance Score: 0.46684741973876953; Context Relevance Score: 0.1111051702935358

Ministral-8B-Instruct-2410:
Faithfullness Score: 0.7282850779510023; Answer Relevance Score: 0.4447920620441437; Context Relevance Score: 0.06277067850077528
