<a href="https://mng.bz/8wdg" target="_blank">
    <img src="../../Assets/Images/NewMEAPHeader.png" alt="New MEAP" style="width: 100%;" />
</a>


# Chapter 05 - RAG Evaluation: Accuracy, Relevance, Faithfulness

### Welcome to chapter 5 of A Simple Introduction to Retrieval Augmented Generation.

In this chapter, we will assess the quality of the RAG pipeline we have built in Chapter 3 & 4. We will re-use the [knowledge base](../../Assets/Data/) we created with the Wikipedia article. We will reuse the Retrieval Augmentation and Generation functions we built in Chapter 4.

## Installing Dependencies

All the necessary libraries for running this notebook along with their versions can be found in __requirements.txt__ file in the root directory of this repository

You should go to the root directory and run the following command to install the libraries

```
pip install -r requirements.txt
```

This is the recommended method of installing the dependencies

___
Alternatively, you can run the command from this notebook too. The relative path may vary

In [1]:
%pip install -r ../../requriements.txt --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 1. Re-Load the RAG Pipeline

In chapter 4, we created the generation pipeline. We will bring that here to use it for evaluations.

In Chapter 3, we were working on indexing the Wikipedia page for the 2023 cricket world cup. If you recall we had used embeddings from OpenAI to encode the text and used FAISS as the vector index to store the embeddings. We also stored the FAISS index in a local directory. We will use this in the RAG pipeline.

Note: You will need an __OpenAI API Key__ which can be obtained from [OpenAI](https://platform.openai.com/api-keys) to reuse the embeddings.

To initialize the __OpenAI client__, we need to pass the api key. There are many ways of doing it. 

####  [Option 1] Creating a .env file for storing the API key and using it # Recommended

Install the __dotenv__ library

_The dotenv library is a popular tool used in various programming languages, including Python and Node.js, to manage environment variables in development and deployment environments. It allows developers to load environment variables from a .env file into their application's environment._

- Create a file named .env in the root directory of their project.
- Inside the .env file, then define environment variables in the format VARIABLE_NAME=value. 

e.g.

OPENAI_API_KEY=YOUR API KEY

In [2]:
from dotenv import load_dotenv
import os

if load_dotenv():
    print("Success: .env file found with some environment variables")
else:
    print("Caution: No environment variables found. Please create .env file in the root directory or add environment variables in the .env file")

Success: .env file found with some environment variables


#### [Option 2] Alternatively, you can set the API key in code. 
However, this is not recommended since it can leave your key exposed for potential misuse. Uncomment the cell below to use this method.

In [3]:
#import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-******" #Imp : Replace with an OpenAI API Key

We can also test if the key is valid or not

In [3]:
api_key=os.environ["OPENAI_API_KEY"]

import openai
from openai import OpenAI

client = OpenAI()


if api_key:
    try:
        client.models.list()
        print("OPENAI_API_KEY is set and is valid")
    except openai.APIError as e:
        print(f"OpenAI API returned an API Error: {e}")
        pass
    except openai.APIConnectionError as e:
        print(f"Failed to connect to OpenAI API: {e}")
        pass
    except openai.RateLimitError as e:
        print(f"OpenAI API request exceeded rate limit: {e}")
        pass

else:
    print("Please set you OpenAI API key as an environment variable OPENAI_API_KEY")



OPENAI_API_KEY is set and is valid


The RAG pipeline takes three inputs - 
1. User Query
2. Location of the Vector Index (Knowledge base)
3. Index Name

And generate an answer along with the retrieved documents


#### RAG function

In [4]:
import re
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS

# Function to clean text
def clean_text(text):
    # Replace non-breaking space with regular space
    text = text.replace('\xa0', ' ')
    
    # Remove any HTML tags (if any)
    text = re.sub(r'<[^>]+>', '', text)  # Removes HTML tags
    
    # Remove references in brackets (e.g., [7], [39])
    text = re.sub(r'\[.*?\]', '', text)  # Removes references inside square brackets
    
    # Remove extra spaces and newlines
    text = ' '.join(text.split())  # This will remove extra spaces and newline characters
    
    return text

def rag_function(query, db_path, index_name):
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small")

    db=FAISS.load_local(folder_path=db_path, index_name=index_name, embeddings=embeddings, allow_dangerous_deserialization=True)

    retrieved_docs = db.similarity_search(query, k=2)

    retrieved_context=[clean_text(retrieved_docs[0].page_content + retrieved_docs[1].page_content)]


    augmented_prompt=f"""

    Given the context below answer the question.

    Question: {query} 

    Context : {retrieved_context}

    Remember to answer only based on the context provided and not from any other source. 

    If the question cannot be answered based on the provided context, say I don’t know.

    """

    llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2
    )

    messages=[("human",augmented_prompt)]

    ai_msg = llm.invoke(messages)

    response=ai_msg.content

    return retrieved_context, response



Let's try sending our question to this function.

In [5]:
rag_function(query="Who won the world cup?", db_path="../../Assets/Data", index_name="CWC_index")

(['The tournament was contested by ten national teams, maintaining the same format used in 2019 . After six weeks of round-robin matches, India , South Africa , Australia , and New Zealand finished as the top four and qualified for the knockout stage. In the knockout stage, India and Australia beat New Zealand and South Africa, respectively, to advance to the final, played on 19 November at the Narendra Modi Stadium in Ahmedabad . Australia won the final by six wickets, winning their sixth Cricket World Cup title.The host India was the first team to qualify for the semi-finals after their 302-run win against Sri Lanka , their seventh successive win in the World Cup. India secured the top place amongst the semi-finalists after they beat South Africa by 243 runs on 5 November at Eden Gardens in Kolkata .'],
 'Australia won the world cup.')

Let's ask another one.

In [6]:
rag_function(query="What was Virat Kohli's achievement in the Cup?",db_path="../../Assets/Data", index_name="CWC_index")

(['Virat Kohli was named the player of the tournament and also scored the most runs, while Mohammed Shami was the leading wicket-taker. A total of 1,250,307 spectators attended the matches, the highest number in any Cricket World Cup to date. The tournament final set viewership records in India, drawing 518 million viewers, with a peak of 57 million streaming viewers.The ICC announced its team of the tournament on 21 November 2023, with Virat Kohli being named as player of the tournament , and Rohit Sharma as captain of the team.'],
 'Virat Kohli was named the player of the tournament and scored the most runs.')

We can also try asking a question which is out of the scope of our knowledge base

In [7]:
rag_function(query="What RAG?",db_path="../../Assets/Data", index_name="CWC_index")

(['(RLQ=window.RLQ||).push(function(){mw.config.set({"wgHostname":"mw-web.codfw.main-85db9df4c9-86vj4","wgBackendResponseTime":174,"wgPageParseReport":{"limitreport":{"cputime":"2.102","walltime":"2.387","ppvisitednodes":{"value":29880,"limit":1000000},"postexpandincludesize":{"value":547658,"limit":2097152},"templateargumentsize":{"value":113569,"limit":2097152},"expansiondepth":{"value":13,"limit":100},"expensivefunctioncount":{"value":22,"limit":500},"unstrip-depth":{"value":1,"limit":20},"unstrip-size":{"value":312186,"limit":5000000},"entityaccesscount":{"value":1,"limit":400},"timingprofile":["100.00% 1812.691 1 -total"," 22.76% 412.523 1 Template:Reflist"," 14.91% 270.321 37 Template:Cite_web"," 11.46% 207.704 58 Template:Single-innings_cricket_match"," 11.12% 201.536 1 Template:2023_CWC_and_2025_ICC_CT_sidebar"," 10.94% 198.332 1 Template:Sidebar_with_collapsible_lists"," 7.79% 141.132 96 Template:Cr"," 7.15% 129Background Host selection'],
 'I don’t know.')

For some of the questions, the response may be "I don't know". That is when the LLM can't find an answer in the retrieved context. In our augmentation step, we had asked the LLM to do so. But how good is this system? We need to be able to evaluate it.

## 2. RAGAs Framework

[Ragas](https://docs.ragas.io/en/stable/) is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. It has been developed by the good folks at [exploding gradients](https://github.com/explodinggradients).

We will look at this evaluation in 2 parts. 

1. Creation of synthetic test data for evaluation.
2. Calculation of evaluation metrics.

### 2.1 Creation of Synthetic Data

Synthetic Data Generation uses LLMs to generate diverse questions and answers from the documents in the knowledge base. LLMs can be prompted to create questions like simple questions, multi-context questions, conditional questions, reasoning questions etc. using the documents from the knowledge base as context.

<img src="../../Assets/Images/5.1.png" width=70%>

In [8]:
from langchain_community.document_loaders import AsyncHtmlLoader

#This is the url of the wikipedia page on the 2023 Cricket World Cup
url="https://en.wikipedia.org/wiki/2023_Cricket_World_Cup"

#Instantiating the AsyncHtmlLoader
loader = AsyncHtmlLoader (url)

#Loading the extracted information
html_data = loader.load()

from langchain_community.document_transformers import Html2TextTransformer

#Instantiate the Html2TextTransformer function
html2text = Html2TextTransformer()


#Call transform_documents
html_data_transformed = html2text.transform_documents(html_data)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|##########| 1/1 [00:00<00:00,  3.25it/s]


In [13]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper


generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(html_data_transformed, testset_size=10)

Generating personas: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it]                                           
Generating Scenarios: 100%|██████████| 2/2 [00:07<00:00,  3.67s/it]
Generating Samples: 100%|██████████| 10/10 [00:02<00:00,  3.40it/s]


In [16]:
sample_queries = dataset.to_pandas()['user_input'].to_list()

In [17]:
expected_responses=dataset.to_pandas()['reference'].to_list()

In [19]:
dataset_to_eval=[]

for query, reference in zip(sample_queries,expected_responses):
    rag_call_response=rag_function(query=query, db_path="../../Assets/Data/", index_name="CWC_index")
    relevant_docs=rag_call_response[0]
    response=rag_call_response[1]
    dataset_to_eval.append(
        {
            "user_input":query,
            "retrieved_contexts":relevant_docs,
            "response":response,
            "reference":reference
        }
    )


In [21]:
from ragas import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(dataset_to_eval)


In [22]:
from ragas import evaluate

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, AnswerCorrectness, ResponseRelevancy

result = evaluate(dataset=evaluation_dataset,metrics=[LLMContextRecall(), Faithfulness(), AnswerCorrectness(), ResponseRelevancy(), FactualCorrectness()],llm=evaluator_llm)


Evaluating: 100%|██████████| 50/50 [00:40<00:00,  1.23it/s]


{'context_recall': 0.3867, 'faithfulness': 0.8000, 'answer_correctness': 0.5802, 'answer_relevancy': 0.5674, 'factual_correctness': 0.3810}

In [23]:
print(result)

{'context_recall': 0.3867, 'faithfulness': 0.8000, 'answer_correctness': 0.5802, 'answer_relevancy': 0.5674, 'factual_correctness': 0.3810}


___
You can interpret the results above. Looks like we are performing well on __faithfulness__ but other metrics are low. How to improve the metrics? We will look at advanced pre-retrieval, retrieval and post retrieval strategies in the next chapter.

---

<img src="../../Assets/Images/profile_s.png" width=100> 

Hi! I'm Abhinav! I am an entrepreneur and Vice President of Artificial Intelligence at Yarnit. I have spent over 15 years consulting and leadership roles in data science, machine learning and AI. My current focus is in the applied Generative AI domain focussing on solving enterprise needs through contextual intelligence. I'm passionate about AI advancements constantly exploring emerging technologies to push the boundaries and create positive impacts in the world. Let’s build the future, together!

[If you haven't already, please subscribe to the MEAP of A Simple Guide to Retrieval Augmented Generation here](https://mng.bz/8wdg)

<a href="https://mng.bz/8wdg" target="_blank">
    <img src="../../Assets/Images/NewMEAPFooter.png" alt="New MEAP" style="width: 100%;" />
</a>

#### If you'd like to chat, I'd be very happy to connect

[![GitHub followers](https://img.shields.io/badge/Github-000000?style=for-the-badge&logo=github&logoColor=black&color=orange)](https://github.com/abhinav-kimothi)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-000000?style=for-the-badge&logo=linkedin&logoColor=orange&color=black)](https://www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=abhinav-kimothi)
[![Medium](https://img.shields.io/badge/Medium-000000?style=for-the-badge&logo=medium&logoColor=black&color=orange)](https://medium.com/@abhinavkimothi)
[![Insta](https://img.shields.io/badge/Instagram-000000?style=for-the-badge&logo=instagram&logoColor=orange&color=black)](https://www.instagram.com/akaiworks/)
[![Mail](https://img.shields.io/badge/email-000000?style=for-the-badge&logo=gmail&logoColor=black&color=orange)](mailto:abhinav.kimothi.ds@gmail.com)
[![X](https://img.shields.io/badge/Follow-000000?style=for-the-badge&logo=X&logoColor=orange&color=black)](https://twitter.com/abhinav_kimothi)
[![Linktree](https://img.shields.io/badge/Linktree-000000?style=for-the-badge&logo=linktree&logoColor=black&color=orange)](https://linktr.ee/abhinavkimothi)
[![Gumroad](https://img.shields.io/badge/Gumroad-000000?style=for-the-badge&logo=gumroad&logoColor=orange&color=black)](https://abhinavkimothi.gumroad.com/)

---