[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/VectorInstitute/rag-bootcamp/blob/uv-migration/implementations/rag_evaluation/rag_evaluation_testset_generation.ipynb)

# RAG Evaluation Test Set Generation

> ⚠️ **Important Note**  
> This notebook is **not compatible with the latest version of `ragas`**.  
> We are currently in the process of updating it to support recent changes in the `ragas` API.


This example shows how to use the [Ragas](https://docs.ragas.io/en/stable/) (```v 0.1.22```) framework to generate a **test set** that can be used to evaluate the quality of a RAG pipeline. We then use the Python [LangChain](https://python.langchain.com/docs/introduction/) library to run some requests through this pipeline and we evaluate the quality of the results.


### 📝 Requirements

To run this notebook, you will need:

- **OpenAI API key**:  
    - Sign up at [OpenAI](https://platform.openai.com/) and create an API key

## Set up the RAG workflow environment

#### Install libraries (Only in Google Colab)

In [1]:
import os

if 'COLAB_RELEASE_TAG' in os.environ:
    # This is a Google Colab environment
    
    # Check if the notebook is running in a GPU environment and install the appropriate version of faiss
    if 'COLAB_GPU' in os.environ:
        !pip3 install faiss-gpu
    else:
        !pip3 install faiss-cpu

    # Install other dependencies
    !pip3 install datasets langchain langchain-community langchain-openai langchain-huggingface ragas==0.1.22 # aieng-rag-utils

#### Import libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import os

from aieng.rag.utils import get_device_name
from aieng.rag.utils.search import DocumentReader, download_file

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision, AnswerCorrectness
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

#### Load OpenAI env variables

In [5]:
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL","https://api.openai.com/v1")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")

#### Download source documents

In [3]:
DIRECTORY_PATH = "./source_documents"
DOCUMENT_URL = "https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf"

download_file(DOCUMENT_URL, DIRECTORY_PATH)

Downloaded https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf to ./source_documents/vector-institute-2021-22-annual-report_accessible.pdf


## Generate a sythentic test set

#### Start by loading in the documents we'll be using to augment our RAG generations

In [None]:
document_reader = DocumentReader(directory_path=DIRECTORY_PATH)
documents, chunks = document_reader.load()

for document in documents:
    document.metadata['file_name'] = document.metadata['source']

#### Now use OpenAI to generate a test set from the data in these documents (This takes about 2-3 minutes)

**IMP Note:** The LLM and embedding model used for test set generation should be more capable than the model being evaluated. Hence, we will use OpenAI GPT-4o and OpenAI embeddings for this purpose.

In [8]:
generator_llm = ChatOpenAI(
    model="gpt-4o",
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)
generator_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

In [None]:
# Create generator with OpenAI model
generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=generator_llm,
    embeddings=generator_embeddings,
)

# Generate the test set
testset = generator.generate_with_langchain_docs(
    documents=documents, 
    test_size=10,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

#### Preview the test dataset so far

In [10]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the purpose of the AI for Public Healt...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The AI for Public Health (AI4PH) initiative ai...,simple,[{'source': 'source_documents/vector-institute...,True
1,What is the purpose of the Vector Faculty Affi...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program brings t...,simple,[{'source': 'source_documents/vector-institute...,True
2,What opportunities does the Digital Talent Hub...,[5 \nAnnual Report 2021–22 Vector Institute\nS...,"The Digital Talent Hub offered 3,700+ postings...",simple,[{'source': 'source_documents/vector-institute...,True
3,How does Vector Institute integrate responsibl...,[35 \nAnnual Report 2021–22 Vector Institute\n...,Vector Institute integrates responsible AI int...,simple,[{'source': 'source_documents/vector-institute...,True
4,What benefits do scholarship recipients gain f...,[26 \n VECTOR SCHOLARSHIPS IN \nAI ATTRACT TO...,Scholarship recipients gain access to Vector's...,simple,[{'source': 'source_documents/vector-institute...,True
5,Who backs Vector's AI efforts?,[41 \n \n \n \n \n FINANCIALS \nVector is fu...,Vector's AI efforts are backed by multi-year c...,reasoning,[{'source': 'source_documents/vector-institute...,True
6,How does the Vector Program boost AI expertise...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program boosts A...,reasoning,[{'source': 'source_documents/vector-institute...,True
7,How does Vector's work with Ontario's AI scene...,[4 \nAnnual Report 2021–22 Vector Institute\n ...,Vector's work with Ontario's AI scene boosts C...,multi_context,[{'source': 'source_documents/vector-institute...,True
8,How does GEMINI's health data integration with...,[31 \n \n \n NEW DATA SHARING \nAGREEMENTS...,GEMINI's health data integration with Vector's...,multi_context,[{'source': 'source_documents/vector-institute...,True
9,What is the purpose of the new Chief Data Offi...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The purpose of the new Chief Data Officer role...,simple,[{'source': 'source_documents/vector-institute...,True


## Now, start the RAG pipeline!

#### Choose the RAG LLM and embedding model
Note: This is different than the OpenAI LLM and embedding model defined above for test set generation.

In [11]:
RAG_LLM_MODEL_NAME = "gpt-4.1"
RAG_EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

#### Generate answers for all the questions in our test set

Go through the embedding, storage and retrieval steps.

In [None]:
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 486


In [12]:
device = get_device_name()

# Define the RAG embeddings model (different than the OpenAI embedding model defined above for test set generation)
model_kwargs = {'device': device, 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the RAG embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=RAG_EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the RAG embeddings model...


In [13]:
# Create the vector store and the retriever
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [14]:
# Define the RAG LLM (different than the OpenAI LLM defined above for test set generation)
print(f"Setting up the RAG LLM...")
llm = ChatOpenAI(
    model=RAG_LLM_MODEL_NAME,
    temperature=0,
    max_tokens=256,
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

Setting up the RAG LLM...


Iterate over the questions in our synthetic testset, and run them each through the RAG pipeline to see what answers get returned. (This also takes 2-3 minutes)

In [18]:
dataset = testset.to_dataset()
answers = np.empty(len(dataset), dtype=object)

for index, row in enumerate(dataset):
    query = row["question"]
    
    # Run the query through the RAG pipeline
    rag_pipeline = RetrievalQA.from_llm(
        llm=llm,
        retriever=retriever
    )
    answer = rag_pipeline.invoke(input=query)
    answer = answer["result"]
    print(f"Result {index}\nQuestion: {query}\nAnswer: {answer}\n")
    
    # Store the result
    answers[index] = answer

Result 0
Question: What is the purpose of the AI for Public Health (AI4PH) initiative launched by Vector?
Answer: The purpose of the AI for Public Health (AI4PH) initiative launched by Vector is to equip a new generation of public health practitioners with practical skills in AI for public health.

Result 1
Question: What is the purpose of the Vector Faculty Affiliates Program in expanding expertise in AI and machine learning across Ontario?
Answer: The Vector Faculty Affiliates Program brings together experts in the field of AI and machine learning to expand expertise in these areas across Ontario.

Result 2
Question: What opportunities does the Digital Talent Hub offer for AI-focused jobs and internships?
Answer: According to the context, the Digital Talent Hub offers the following opportunities for AI-focused jobs and internships:

1. Advertisements for AI-related internships and work opportunities from leading industry sponsors.
2. Access to a growing pool of AI-skilled talent.
3. 

Add the list of answers into our original dataset. Now we have a complete test set that is ready for evaluation.

In [19]:
dataset = dataset.add_column("answer", answers)

## Evaluate the results

#### Preview the final test set

In [20]:
dataset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer
0,What is the purpose of the AI for Public Healt...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The AI for Public Health (AI4PH) initiative ai...,simple,[{'file_name': 'source_documents/vector-instit...,True,The purpose of the AI for Public Health (AI4PH...
1,What is the purpose of the Vector Faculty Affi...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program brings t...,simple,[{'file_name': 'source_documents/vector-instit...,True,The Vector Faculty Affiliates Program brings t...
2,What opportunities does the Digital Talent Hub...,[5 \nAnnual Report 2021–22 Vector Institute\nS...,"The Digital Talent Hub offered 3,700+ postings...",simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the Digital Talent H..."
3,How does Vector Institute integrate responsibl...,[35 \nAnnual Report 2021–22 Vector Institute\n...,Vector Institute integrates responsible AI int...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the provided context, Vector Inst..."
4,What benefits do scholarship recipients gain f...,[26 \n VECTOR SCHOLARSHIPS IN \nAI ATTRACT TO...,Scholarship recipients gain access to Vector's...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, scholarship recipien..."
5,Who backs Vector's AI efforts?,[41 \n \n \n \n \n FINANCIALS \nVector is fu...,Vector's AI efforts are backed by multi-year c...,reasoning,[{'file_name': 'source_documents/vector-instit...,True,Vector's AI efforts are backed by a team that ...
6,How does the Vector Program boost AI expertise...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program boosts A...,reasoning,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the Vector Program b..."
7,How does Vector's work with Ontario's AI scene...,[4 \nAnnual Report 2021–22 Vector Institute\n ...,Vector's work with Ontario's AI scene boosts C...,multi_context,[{'file_name': 'source_documents/vector-instit...,True,"According to the provided context, Vector's wo..."
8,How does GEMINI's health data integration with...,[31 \n \n \n NEW DATA SHARING \nAGREEMENTS...,GEMINI's health data integration with Vector's...,multi_context,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, GEMINI's stable and ..."
9,What is the purpose of the new Chief Data Offi...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The purpose of the new Chief Data Officer role...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the purpose of the n..."


Run the evaluation query to score the results. In this evaluation, we are looking at the following metrics:
- *[Faithfulness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html)*: Are all the claims that are made in the answer inferred from the given context(s)?
- *[Context Precision](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_precision.html)*: Did our retriever return good results that matched the question it was being asked?
- *[Answer Correctness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html)*: Was the generated answer correct? Was it complete?

In [22]:
score = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision(),
        AnswerCorrectness(),
    ],
    llm=generator_llm, # Using OpenAI LLM as the evaluator
    embeddings=generator_embeddings,
)
score.to_pandas()

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer,faithfulness,context_precision,answer_correctness
0,What is the purpose of the AI for Public Healt...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The AI for Public Health (AI4PH) initiative ai...,simple,[{'file_name': 'source_documents/vector-instit...,True,The purpose of the AI for Public Health (AI4PH...,1.0,1.0,0.568036
1,What is the purpose of the Vector Faculty Affi...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program brings t...,simple,[{'file_name': 'source_documents/vector-instit...,True,The Vector Faculty Affiliates Program brings t...,1.0,1.0,0.61092
2,What opportunities does the Digital Talent Hub...,[5 \nAnnual Report 2021–22 Vector Institute\nS...,"The Digital Talent Hub offered 3,700+ postings...",simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the Digital Talent H...",0.1,1.0,0.316686
3,How does Vector Institute integrate responsibl...,[35 \nAnnual Report 2021–22 Vector Institute\n...,Vector Institute integrates responsible AI int...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the provided context, Vector Inst...",0.545455,1.0,0.439291
4,What benefits do scholarship recipients gain f...,[26 \n VECTOR SCHOLARSHIPS IN \nAI ATTRACT TO...,Scholarship recipients gain access to Vector's...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, scholarship recipien...",1.0,1.0,0.217143
5,Who backs Vector's AI efforts?,[41 \n \n \n \n \n FINANCIALS \nVector is fu...,Vector's AI efforts are backed by multi-year c...,reasoning,[{'file_name': 'source_documents/vector-instit...,True,Vector's AI efforts are backed by a team that ...,0.0,1.0,0.167371
6,How does the Vector Program boost AI expertise...,[19 \nVECTOR FACULTY \nAFFILIATES \nThe Vector...,The Vector Faculty Affiliates Program boosts A...,reasoning,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the Vector Program b...",0.0,1.0,0.203814
7,How does Vector's work with Ontario's AI scene...,[4 \nAnnual Report 2021–22 Vector Institute\n ...,Vector's work with Ontario's AI scene boosts C...,multi_context,[{'file_name': 'source_documents/vector-instit...,True,"According to the provided context, Vector's wo...",0.6,1.0,0.412032
8,How does GEMINI's health data integration with...,[31 \n \n \n NEW DATA SHARING \nAGREEMENTS...,GEMINI's health data integration with Vector's...,multi_context,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, GEMINI's stable and ...",0.888889,1.0,0.677368
9,What is the purpose of the new Chief Data Offi...,[33 \n PARTNERSHIPS FOSTER AN AI-FOR-\nHEALTH ...,The purpose of the new Chief Data Officer role...,simple,[{'file_name': 'source_documents/vector-instit...,True,"According to the context, the purpose of the n...",1.0,1.0,0.505192
