# RAG Evaluation Using RAGAS

## Introduction to Evaluation
Evaluation in the context of Retrieval-Augmented Generation (RAG) systems involves assessing the performance of both the retrieval component (how well relevant documents are fetched from a knowledge base)
and the generation component (how accurate, relevant, and coherent the generated responses are). A RAG system combines a retriever (e.g., a vector store like FAISS) with a language model (e.g., AzureChatOpenAI) to provide contextually informed responses, reducing issues like hallucinations (incorrect or fabricated information).

## Why Do We Use Evaluation?
- Evaluation is critical for the following reasons:
- Quality Assurance: Ensures the RAG system delivers accurate, relevant, and trustworthy responses.
- System Improvement: Identifies weaknesses in retrieval (e.g., irrelevant documents) or generation (e.g., unfaithful answers), guiding optimizations like better embeddings or prompt engineering.
- Performance Monitoring: Quantifies system performance to track improvements or regressions over time.
- Stakeholder Confidence: Provides metrics to demonstrate the system's reliability to stakeholders or end-users.

### The RAGAS framework (Retrieval Augmented Generation Assessment) is used to evaluate RAG systems. It provides metrics like:
- Faithfulness: Measures if the generated answer is factually grounded in the retrieved context.
- Answer Relevancy: Assesses if the answer directly addresses the user's query.
- Context Precision: Checks if the retrieved context contains relevant information with minimal noise.
- Context Recall: Ensures all necessary information is retrieved (requires ground truth).

This notebook sets up a RAG system using AzureChatOpenAI, AzureOpenAIEmbeddings, and FAISS, generates a synthetic test dataset, and evaluates the system using RAGAS.

Loads environment variables (e.g., API keys) from a .env file for secure configuration.

In [1]:
!python -m pip install pymupdf faiss-cpu ragas --quiet

In [8]:
!python -m pip install rapidfuzz --quiet

I0000 00:00:1758523879.085412    4698 fork_posix.cc:71] Other threads are currently calling into gRPC, skipping fork() handlers


In [2]:
import os
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from dotenv import load_dotenv
import os 

load_dotenv()

embedding_model_name = "models/gemini-embedding-001"
model_name = "gemini-2.0-flash"


Initializes AzureChatOpenAI for response generation and AzureOpenAIEmbeddings for creating document embeddings.

In [3]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model_name)
from langchain.chat_models import init_chat_model
llm = init_chat_model(model_name, model_provider="google_genai")


dir_path = r"datasets/supply_chain"
index_path = r"VectorDB_Chroma/faiss"

## Document Loading
This section loads PDF documents from a directory

In [4]:
def load_documents():
    """
    Load PDF documents from the specified directory using PyMuPDFLoader.
    
    Returns:
        list: A list of loaded documents.
    
    Raises:
        FileNotFoundError: If the directory does not exist.
        Exception: For other loading errors.
    """
    if not os.path.exists(dir_path):
        raise FileNotFoundError(f"Directory not found: {dir_path}")
    try:
        loader = DirectoryLoader(dir_path, loader_cls=PyMuPDFLoader)
        return loader.load()
    except Exception as e:
        raise e

def split_documents(documents):
    """
    Split documents into smaller chunks using RecursiveCharacterTextSplitter.
    
    Args:
        documents (list): List of documents to split.
    
    Returns:
        list: A list of document chunks. Returns empty list if no documents.
    """
    try:
        if not documents:
            return []
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=200)
        return text_splitter.split_documents(documents)
    except Exception as e:
        print(f"Error splitting documents: {str(e)}")
        return []

# Load and split documents
documents = load_documents()
documents = split_documents(documents)
print(f"Loaded and split {len(documents)} document chunks.")

Loaded and split 131 document chunks.


## Vector Store Creation

Now splits documents into chunks, and creates a FAISS vector store for retrieval.

In [5]:
def create_vectorstore(documents):
    """
    Create and save a new FAISS vector store from documents.
    
    Args:
        documents (list): List of document objects to convert to vectors.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        os.makedirs(index_path, exist_ok=True)
        vectorstore = FAISS.from_documents(documents, embeddings)
        print("Vector Store created Successfully")
        save_vectorstore(vectorstore)
    except Exception as e:
        return e

def save_vectorstore(vectorstore):
    """
    Save the FAISS vector store to the specified path.
    
    Args:
        vectorstore (FAISS): The vector store to save.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        vectorstore.save_local(index_path)
        print("vector Store saved successfully")
    except Exception as e:
        return e

def load_vectorstore():
    """
    Load an existing FAISS vector store.
    
    Returns:
        FAISS: Loaded vector store, else Exception.
    """
    try:
        print("loading vector Store...")
        vs = FAISS.load_local(index_path, embeddings=embeddings, allow_dangerous_deserialization=True)
        print("loaded successfully")
        return vs
    
    except Exception as e:
        return e

# Load or create vector store
if os.path.exists(index_path) and any(os.listdir(index_path)):
    vectorstore = load_vectorstore()
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Vector store loaded successfully.")
else:
    create_vectorstore(documents)
    vectorstore = load_vectorstore()
    print(vectorstore)
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Created and loaded new vector store.")

Vector Store created Successfully
vector Store saved successfully
loading vector Store...
loaded successfully
<langchain_community.vectorstores.faiss.FAISS object at 0x7dfbdfd31210>
Created and loaded new vector store.


### Explanation:

- Document Loading: Uses DirectoryLoader with PyMuPDFLoader to load PDFs from the data directory.
- Document Splitting: Splits documents into chunks (500 characters, 200 overlap) for efficient retrieval.
- Vector Store: Creates a FAISS index from document embeddings or loads an existing one from the index directory.
- Retriever: Configures the vector store as a retriever, fetching the top 5 relevant documents for a query.

## RAG Chain Setup
This section defines a RAG chain that validates queries, retrieves relevant documents, and generates answers using the AzureChatOpenAI model.

In [6]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def validate_query(query):
    """
    Validates a user's query by ensuring it is not empty and has at least 15 characters.
    
    Args:
        query (str): The input query.
    
    Returns:
        str: The query if valid, or an error message if invalid.
    """
    try:
        if not query:
            return "Query cannot be empty, enter a valid query."
        elif len(query) < 15:
            return "Query is too short, enter a valid query."
        else:
            return query
    except Exception as e:
        return str(e)

def create_rag_chain(query, relevant_documents):
    """
    Creates and executes a RAG chain to answer a query using retrieved documents.
    
    Args:
        query (str): The user query.
        relevant_documents (list): List of retrieved document chunks.
    
    Returns:
        str: The generated response or an error message.
    """
    try:
        prompt_template = """
        Only based on the provided documents, answer the question in points. Do not mention from which document the answer is derived.
        Question: {query}
        Documents: {docs}
        Note: You are a supply chain assistant. If the query is not related to supply chain or the documents do not provide the necessary information, return "Invalid Query".
        """
        prompt = ChatPromptTemplate.from_template(prompt_template)
        valid_query = validate_query(query)
        rag_chain = prompt | llm | StrOutputParser()
        return rag_chain.invoke({"query": valid_query, "docs": relevant_documents})
    except Exception as e:
        return str(e)

# Test the RAG chain
query = "What is Supply Chain?"
relevant_documents = vectorstore_retriever.invoke(query)
response = create_rag_chain(query, relevant_documents)
print("RAG Chain Response:")
print(response)

RAG Chain Response:
*   A network of partners collectively converting a basic commodity (upstream) into a finished product (downstream) valued by end-customers, managing returns at each stage.
*   Each partner is responsible for a process that adds value to a product by transforming inputs into outputs.
*   Supply chain management involves planning and controlling all processes from raw material production to end-user purchase and recycling.
*   It is about planning and controlling all business processes from end-customer to raw material suppliers, linking partners to serve the end-customer's needs.


### Explanation:

- Query Validation: Ensures the query is non-empty and at least 15 characters long.
- RAG Chain: Constructs a prompt that instructs the model to answer in bullet points, using only the retrieved documents, and to return "Invalid Query" if the query is unrelated to supply chain or unsupported by the documents.
- Execution: Combines the prompt, AzureChatOpenAI model, and string output parser to generate a response.
- Test: Runs a sample query to verify the RAG chain's functionality.

## Generating Synthetic Test Data with RAGAS
To evaluate the RAG system, we need a test dataset with questions, answers, contexts, and ground truth. RAGAS's TestsetGenerator can create synthetic data from documents.

In [9]:
import random
from ragas.testset import TestsetGenerator

# Configure the test set generator
testset_generator = TestsetGenerator.from_langchain(
    llm=llm,
    embedding_model=embeddings
)

# Randomly sample a subset of documents (e.g., 50 out of 902 chunks)
sample_size = 10  # Adjust based on your needs
random.seed(42)  # For reproducibility
sampled_documents = random.sample(documents, min(sample_size, len(documents)))

# Generate test dataset with reduced test_size
testset = testset_generator.generate_with_langchain_docs(sampled_documents, 5)



Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlinesExtractor: 100%|██████████| 8/8 [00:14<00:00,  1.76s/it]
Applying HeadlineSplitter: 100%|██████████| 10/10 [00:00<00:00, 12221.17it/s]
Applying SummaryExtractor:   6%|▋         | 1/16 [00:02<00:44,  2.95s/it]Property 'summary' already exists in node 'f8e7c9'. Skipping!
Property 'summary' already exists in node 'df0eee'. Skipping!
Property 'summary' already exists in node 'd749af'. Skipping!
Property 'summary' already exists in node 'f7393c'. Skipping!
Property 'summary' already exists in node '11e85c'. Skipping!
Property 'summary' already exists in node 'e43394'. Skipping!
Applying SummaryExtractor:  81%|████████▏ | 13/16 [00:20<00:03,  1.18s/it]Property 'summary' already exists in node 'f91f57'. Skipping!
Property 'summary' already exists in node 'b3371d'. Skipping!
Applying SummaryExtractor: 100%|██████████| 16/16 [00:25<00:00,  1.58s/it]
Applying CustomNodeFilter: 0it [00:00, ?it/s]
Applying EmbeddingExtractor:   6%|▋         | 1/16 [00:02<00:36,  2.40s/it]Property

In [11]:
testset.samples

[TestsetSample(eval_sample=SingleTurnSample(user_input='Wat iz supply chain manegemnt and how does it impakt end-custmers?', retrieved_contexts=None, reference_contexts=['the conversion of basic commodity into ﬁnished product. At each stage of the\nconversion, there may be returns which could be reject material from the preced-\ning ﬁrm, or waste like the ﬁnished can that needs to be recycled.\nA supply chain is a network of partners who collectively convert a basic commod-\nity (upstream) into a ﬁnished product (downstream) that is valued by end-cus-\ntomers, and who manage returns at each stage.\nEach partner in a supply chain is responsible directly for a process that adds value\nto a product. A process:\nTransforms inputs in the form of materials and information into outputs in the\nform of goods and services.\nIn the case of the cola can, partners carry out processes such as mining, trans-\nportation, reﬁning and hot rolling. The cola can has greater value than the baux-\nite (per

In [12]:
# Convert test dataset to evaluation format
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

for testcase in testset.samples:
    relevant_docs = vectorstore_retriever.invoke(testcase.eval_sample.user_input)
    answer = create_rag_chain(testcase.eval_sample.user_input, relevant_docs)
    eval_data["question"].append(testcase.eval_sample.user_input)
    eval_data["answer"].append(answer)
    eval_data["contexts"].append([doc.page_content for doc in relevant_docs])
    eval_data["ground_truth"].append(testcase.eval_sample.reference)

print(f"Generated {len(eval_data['question'])} test cases.")

Generated 3 test cases.


### Explanation:

- TestsetGenerator: Uses AzureChatOpenAI for generating and critiquing test cases, with AzureOpenAIEmbeddings for document embeddings.
- Test Data Generation: Creates test cases with a mix of random samples
- Evaluation Dataset: For each test case, retrieves relevant documents, generates an answer using the RAG chain, and collects the question, answer, contexts, and ground truth.
- Output: Stores the data in a dictionary format suitable for RAGAS evaluation.

## RAG Evaluation with RAGAS
This section evaluates the RAG system using RAGAS metrics: faithfulness, answer relevancy, context precision, and context recall.

In [13]:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Convert evaluation data to Hugging Face Dataset
eval_dataset = Dataset.from_dict(eval_data)

# Wrap AzureChatOpenAI for RAGAS compatibility
evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Run evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,       # Checks if the answer is grounded in the context
        answer_relevancy,   # Checks if the answer addresses the question
        context_precision,  # Checks if retrieved context is relevant
        context_recall      # Checks if all necessary information is retrieved
    ],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    show_progress=True
)

# Print evaluation results
print("RAGAS Evaluation Results:")
print(results)

  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]Exception raised in Job[5]: IndexError(list index out of range)
Exception raised in Job[1]: IndexError(list index out of range)
Exception raised in Job[9]: IndexError(list index out of range)
Evaluating: 100%|██████████| 12/12 [00:41<00:00,  3.47s/it]


RAGAS Evaluation Results:
{'faithfulness': 1.0000, 'answer_relevancy': nan, 'context_precision': 0.9347, 'context_recall': 1.0000}


### Explanation:

- Dataset Conversion: Converts the evaluation data into a Hugging Face Dataset for RAGAS.
- LLM Wrapper: Wraps AzureChatOpenAI with LangchainLLMWrapper for compatibility with RAGAS.
- Metrics: Evaluates the RAG system on:
    - Faithfulness: Ensures answers are factually consistent with the context.
    - Answer Relevancy: Measures how well answers address the query.
    - Context Precision: Assesses the relevance of retrieved documents.
    - Context Recall: Checks if all necessary information is retrieved (uses ground truth).
- Results: Outputs scores (0 to 1) for each metric, where higher is better.