# RAG Evaluation Using RAGAS

## Introduction to Evaluation
Evaluation in the context of Retrieval-Augmented Generation (RAG) systems involves assessing the performance of both the retrieval component (how well relevant documents are fetched from a knowledge base)
and the generation component (how accurate, relevant, and coherent the generated responses are). A RAG system combines a retriever (e.g., a vector store like FAISS) with a language model (e.g., AzureChatOpenAI) to provide contextually informed responses, reducing issues like hallucinations (incorrect or fabricated information).

## Why Do We Use Evaluation?
- Evaluation is critical for the following reasons:
- Quality Assurance: Ensures the RAG system delivers accurate, relevant, and trustworthy responses.
- System Improvement: Identifies weaknesses in retrieval (e.g., irrelevant documents) or generation (e.g., unfaithful answers), guiding optimizations like better embeddings or prompt engineering.
- Performance Monitoring: Quantifies system performance to track improvements or regressions over time.
- Stakeholder Confidence: Provides metrics to demonstrate the system's reliability to stakeholders or end-users.

### The RAGAS framework (Retrieval Augmented Generation Assessment) is used to evaluate RAG systems. It provides metrics like:
- Faithfulness: Measures if the generated answer is factually grounded in the retrieved context.
- Answer Relevancy: Assesses if the answer directly addresses the user's query.
- Context Precision: Checks if the retrieved context contains relevant information with minimal noise.
- Context Recall: Ensures all necessary information is retrieved (requires ground truth).

This notebook sets up a RAG system using AzureChatOpenAI, AzureOpenAIEmbeddings, and FAISS, generates a synthetic test dataset, and evaluates the system using RAGAS.

Loads environment variables (e.g., API keys) from a .env file for secure configuration.

In [1]:
!python -m pip install pymupdf faiss-cpu ragas --quiet

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langflow 1.0.14 requires google-search-results<3.0.0,>=2.4.1, which is not installed.
pilothub 0.1.8 requires openai==1.14.1, but you have openai 1.109.1 which is incompatible.
openevals 0.1.2 requires langchain>=0.3.18, but you have langchain 0.2.17 which is incompatible.
openevals 0.1.2 requires langchain-openai>=0.3.6, but you have langchain-openai 0.1.25 which is incompatible.
openevals 0.1.2 requires langsmith>=0.3.32, but you have langsmith 0.1.147 which is incompatible.
langgraph-prebuilt 1.0.4 requires langchain-core>=1.0.0, but you have langchain-core 0.2.43 which is incompatible.
langflow 1.0.14 requires certifi<2025.0.0,>=2023.11.17, but you have certifi 2025.1.31 which is incompatible.
langflow 1.0.14 requires huggingface-hub[inference]<0.23.0,>=0.22.0, but you have huggingface-hub 0.36.0 which is inco

In [2]:
!python -m pip install rapidfuzz --quiet



In [1]:
import os
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import AzureChatOpenAI
from dotenv import load_dotenv
import os 

load_dotenv()

MODEL_NAME = "gpt-4o-mini"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"


Initializes AzureChatOpenAI for response generation and AzureOpenAIEmbeddings for creating document embeddings.

In [5]:
llm = AzureChatOpenAI(azure_deployment=MODEL_NAME)

embeddings = AzureOpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

dir_path = r"C:\Users\Anshu Pandey\Desktop\Client Data\Fidelity\FIL_Generative_AI_NOV25\datasets\Fidelity"
index_path = r"VectorDB_Chroma/faiss2"

## Document Loading
This section loads PDF documents from a directory

In [6]:
def load_documents():
    """
    Load PDF documents from the specified directory using PyMuPDFLoader.
    
    Returns:
        list: A list of loaded documents.
    
    Raises:
        FileNotFoundError: If the directory does not exist.
        Exception: For other loading errors.
    """
    if not os.path.exists(dir_path):
        raise FileNotFoundError(f"Directory not found: {dir_path}")
    try:
        loader = DirectoryLoader(dir_path, loader_cls=PyMuPDFLoader)
        return loader.load()
    except Exception as e:
        raise e

def split_documents(documents):
    """
    Split documents into smaller chunks using RecursiveCharacterTextSplitter.
    
    Args:
        documents (list): List of documents to split.
    
    Returns:
        list: A list of document chunks. Returns empty list if no documents.
    """
    try:
        if not documents:
            return []
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
        return text_splitter.split_documents(documents)
    except Exception as e:
        print(f"Error splitting documents: {str(e)}")
        return []

# Load and split documents
documents = load_documents()
documents = split_documents(documents)
print(f"Loaded and split {len(documents)} document chunks.")

Loaded and split 43 document chunks.


## Vector Store Creation

Now splits documents into chunks, and creates a FAISS vector store for retrieval.

In [7]:
def create_vectorstore(documents):
    """
    Create and save a new FAISS vector store from documents.
    
    Args:
        documents (list): List of document objects to convert to vectors.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        os.makedirs(index_path, exist_ok=True)
        vectorstore = FAISS.from_documents(documents, embeddings)
        print("Vector Store created Successfully")
        save_vectorstore(vectorstore)
    except Exception as e:
        return e

def save_vectorstore(vectorstore):
    """
    Save the FAISS vector store to the specified path.
    
    Args:
        vectorstore (FAISS): The vector store to save.
    
    Returns:
        None: If successful, else Exception.
    """
    try:
        vectorstore.save_local(index_path)
        print("vector Store saved successfully")
    except Exception as e:
        return e

def load_vectorstore():
    """
    Load an existing FAISS vector store.
    
    Returns:
        FAISS: Loaded vector store, else Exception.
    """
    try:
        print("loading vector Store...")
        vs = FAISS.load_local(index_path, embeddings=embeddings, allow_dangerous_deserialization=True)
        print("loaded successfully")
        return vs
    
    except Exception as e:
        return e

# Load or create vector store
if os.path.exists(index_path) and any(os.listdir(index_path)):
    vectorstore = load_vectorstore()
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Vector store loaded successfully.")
else:
    create_vectorstore(documents)
    vectorstore = load_vectorstore()
    print(vectorstore)
    vectorstore_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    print("Created and loaded new vector store.")

Vector Store created Successfully
vector Store saved successfully
loading vector Store...
loaded successfully
<langchain_community.vectorstores.faiss.FAISS object at 0x00000208E60FECE0>
Created and loaded new vector store.


### Explanation:

- Document Loading: Uses DirectoryLoader with PyMuPDFLoader to load PDFs from the data directory.
- Document Splitting: Splits documents into chunks (500 characters, 200 overlap) for efficient retrieval.
- Vector Store: Creates a FAISS index from document embeddings or loads an existing one from the index directory.
- Retriever: Configures the vector store as a retriever, fetching the top 5 relevant documents for a query.

## RAG Chain Setup
This section defines a RAG chain that validates queries, retrieves relevant documents, and generates answers using the AzureChatOpenAI model.

In [8]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def validate_query(query):
    """
    Validates a user's query by ensuring it is not empty and has at least 15 characters.
    
    Args:
        query (str): The input query.
    
    Returns:
        str: The query if valid, or an error message if invalid.
    """
    try:
        if not query:
            return "Query cannot be empty, enter a valid query."
        elif len(query) < 15:
            return "Query is too short, enter a valid query."
        else:
            return query
    except Exception as e:
        return str(e)

def create_rag_chain(query, relevant_documents):
    """
    Creates and executes a RAG chain to answer a query using retrieved documents.
    
    Args:
        query (str): The user query.
        relevant_documents (list): List of retrieved document chunks.
    
    Returns:
        str: The generated response or an error message.
    """
    try:
        prompt_template = """
        Only based on the provided documents, answer the question in points. Do not mention from which document the answer is derived.
        Question: {query}
        Documents: {docs}
        Note: You are a Finance Assistant assistant. If the query is not related to FINANCE or the documents do not provide the necessary information, return "Invalid Query".
        """
        prompt = ChatPromptTemplate.from_template(prompt_template)
        valid_query = validate_query(query)
        rag_chain = prompt | llm | StrOutputParser()
        return rag_chain.invoke({"query": valid_query, "docs": relevant_documents})
    except Exception as e:
        return str(e)

# Test the RAG chain
query = "What is Large cap equity market?"
relevant_documents = vectorstore_retriever.invoke(query)
response = create_rag_chain(query, relevant_documents)
print("RAG Chain Response:")
print(response)

RAG Chain Response:
- Large cap equity market refers to the segment of the stock market that comprises companies with large market capitalizations, typically defined as companies with a market cap of over $10 billion.
- Investments in large cap equities are often considered less volatile compared to mid-cap or small-cap equities, as larger companies tend to be more stable and have established business models.
- Large cap companies are often leaders in their industry, which can provide a degree of safety and reliability for investors.
- The large cap equity market can include a variety of sectors, allowing for diversification within a portfolio.
- Investors in large cap equities often seek long-term growth and may also benefit from dividends, as many large cap companies distribute a portion of their profits to shareholders.


### Explanation:

- Query Validation: Ensures the query is non-empty and at least 15 characters long.
- RAG Chain: Constructs a prompt that instructs the model to answer in bullet points, using only the retrieved documents, and to return "Invalid Query" if the query is unrelated to supply chain or unsupported by the documents.
- Execution: Combines the prompt, AzureChatOpenAI model, and string output parser to generate a response.
- Test: Runs a sample query to verify the RAG chain's functionality.

## Generating Synthetic Test Data with RAGAS
To evaluate the RAG system, we need a test dataset with questions, answers, contexts, and ground truth. RAGAS's TestsetGenerator can create synthetic data from documents.

In [9]:
import random
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Wrap AzureChatOpenAI for RAGAS compatibility
evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Configure the test set generator
testset_generator = TestsetGenerator(
    llm=evaluator_llm,
    embedding_model=evaluator_embeddings
)

# Randomly sample a subset of documents (e.g., 50 out of 902 chunks)
sample_size = 30  # Adjust based on your needs
random.seed(42)  # For reproducibility
sampled_documents = random.sample(documents, min(sample_size, len(documents)))

# Generate test dataset with reduced test_size
testset = testset_generator.generate_with_langchain_docs(sampled_documents, 10)


  from pandas.core import (
  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)


Applying HeadlinesExtractor:   0%|          | 0/15 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/30 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/29 [00:00<?, ?it/s]

Property 'summary' already exists in node 'd09267'. Skipping!
Property 'summary' already exists in node 'da1260'. Skipping!
Property 'summary' already exists in node 'c15e63'. Skipping!
Property 'summary' already exists in node '4a1a07'. Skipping!
Property 'summary' already exists in node '1d8526'. Skipping!
Property 'summary' already exists in node '1a974d'. Skipping!
Property 'summary' already exists in node 'e66609'. Skipping!
Property 'summary' already exists in node '337305'. Skipping!
Property 'summary' already exists in node '516201'. Skipping!
Property 'summary' already exists in node '7b118e'. Skipping!
Property 'summary' already exists in node '42fe63'. Skipping!
Property 'summary' already exists in node 'a5d854'. Skipping!
Property 'summary' already exists in node 'f3cc73'. Skipping!
Property 'summary' already exists in node '7a0e15'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying EmbeddingExtractor:   0%|          | 0/29 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'da1260'. Skipping!
Property 'summary_embedding' already exists in node 'd09267'. Skipping!
Property 'summary_embedding' already exists in node '1d8526'. Skipping!
Property 'summary_embedding' already exists in node '1a974d'. Skipping!
Property 'summary_embedding' already exists in node 'a5d854'. Skipping!
Property 'summary_embedding' already exists in node 'c15e63'. Skipping!
Property 'summary_embedding' already exists in node '7b118e'. Skipping!
Property 'summary_embedding' already exists in node '337305'. Skipping!
Property 'summary_embedding' already exists in node '42fe63'. Skipping!
Property 'summary_embedding' already exists in node 'f3cc73'. Skipping!
Property 'summary_embedding' already exists in node 'e66609'. Skipping!
Property 'summary_embedding' already exists in node '516201'. Skipping!
Property 'summary_embedding' already exists in node '4a1a07'. Skipping!
Property 'summary_embedding' already exists in node '7a0e15'. Sk

Applying ThemesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying NERExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CosineSimilarityBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/4 [00:00<?, ?it/s]

In [10]:
testset.samples

[TestsetSample(eval_sample=SingleTurnSample(user_input='What happens if FIL Investment Services (UK) Limited is unable to pay out?', retrieved_contexts=None, reference_contexts=['Key Information Document Fidelity European Trust PLC Ordinary Shares Performance Scenarios Market developments in the future cannot be accurately predicted. The scenarios shown are only an indication of some of the Market developments in the future cannot be accurately predicted. The scenarios shown are only an indication of some of the Market developments in the future cannot be accurately predicted. The scenarios shown are only an indication of some of the Market developments in the future cannot be accurately predicted. The scenarios shown are only an indication of some of the possible outcomes based on recent returns. Actual returns could be lower. possible outcomes based on recent returns. Actual returns could be lower. possible outcomes based on recent returns. Actual returns could be lower. possible out

In [11]:

# Convert test dataset to evaluation format
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

for testcase in testset.samples:
    relevant_docs = vectorstore_retriever.invoke(testcase.eval_sample.user_input)
    answer = create_rag_chain(testcase.eval_sample.user_input, relevant_docs)
    eval_data["question"].append(testcase.eval_sample.user_input)
    eval_data["answer"].append(answer)
    eval_data["contexts"].append([doc.page_content for doc in relevant_docs])
    eval_data["ground_truth"].append(testcase.eval_sample.reference)

print(f"Generated {len(eval_data['question'])} test cases.")

Generated 4 test cases.


In [12]:
eval_data

{'question': ['What happens if FIL Investment Services (UK) Limited is unable to pay out?',
  'What happens if FIL Investmnt Servises (UK) Limited is unable to pay out?',
  'How does the Financial Services Compensation Scheme relate to investments in mutual funds?',
  'What are the potential costs over time associated with investing £10,000 in the Fidelity European Trust PLC Ordinary Shares, and how do these costs impact the returns under different performance scenarios as outlined in the Key Information Document?'],
 'answer': ['- If FIL Investment Services (UK) Limited is unable to pay out, it does not impact the settlement of a payment for the sale of shares, as shares of the Company are traded on the stock market independently.\n- Shares in the investment trust company are not directly covered by the Financial Services Compensation Scheme.',
  '- If FIL Investment Services (UK) Limited is unable to pay out, it does not impact the settlement of a payment for the sale of shares, as s

### Explanation:

- TestsetGenerator: Uses AzureChatOpenAI for generating and critiquing test cases, with AzureOpenAIEmbeddings for document embeddings.
- Test Data Generation: Creates test cases with a mix of random samples
- Evaluation Dataset: For each test case, retrieves relevant documents, generates an answer using the RAG chain, and collects the question, answer, contexts, and ground truth.
- Output: Stores the data in a dictionary format suitable for RAGAS evaluation.

## RAG Evaluation with RAGAS
This section evaluates the RAG system using RAGAS metrics: faithfulness, answer relevancy, context precision, and context recall.

In [13]:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Convert evaluation data to Hugging Face Dataset
eval_dataset = Dataset.from_dict(eval_data)

# Wrap AzureChatOpenAI for RAGAS compatibility
evaluator_llm = LangchainLLMWrapper(llm)
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Run evaluation
results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,       # Checks if the answer is grounded in the context
        answer_relevancy,   # Checks if the answer addresses the question
        context_precision,  # Checks if retrieved context is relevant
        context_recall      # Checks if all necessary information is retrieved
    ],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    show_progress=True
)

# Print evaluation results
print("RAGAS Evaluation Results:")
print(results)

  evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)


Evaluating:   0%|          | 0/16 [00:00<?, ?it/s]

Exception raised in Job[1]: IndexError(list index out of range)
Exception raised in Job[13]: IndexError(list index out of range)
Exception raised in Job[9]: IndexError(list index out of range)
Exception raised in Job[5]: IndexError(list index out of range)


RAGAS Evaluation Results:
{'faithfulness': 0.5833, 'answer_relevancy': nan, 'context_precision': 0.6708, 'context_recall': 0.7500}


query, reference_Context, retrieved_context, generated_answer


- faithfulness = generated_answer v/s retrived_context - shows how good rag prompt is and how good LLM is
- answer_relevancy = generated_answer v/s query - shows how good overall rag pipeline is
- context_precision = reference_Context v/s retrieved_context - shows how good retriever is - how many irrelevant documents are fetched - higher the context precision, lesser the irrelevant documents
- context_recall = reference_Context v/s retrieved_context - shows how good retriever is - how many documents from ideal ground truth are fetched as retrived documents - higher the context recall, more the relevant documents

### Explanation:

- Dataset Conversion: Converts the evaluation data into a Hugging Face Dataset for RAGAS.
- LLM Wrapper: Wraps AzureChatOpenAI with LangchainLLMWrapper for compatibility with RAGAS.
- Metrics: Evaluates the RAG system on:
    - Faithfulness: Ensures answers are factually consistent with the context.
    - Answer Relevancy: Measures how well answers address the query.
    - Context Precision: Assesses the relevance of retrieved documents.
    - Context Recall: Checks if all necessary information is retrieved (uses ground truth).
- Results: Outputs scores (0 to 1) for each metric, where higher is better.