[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/VectorInstitute/rag-bootcamp/blob/uv-migration/implementations/rag_evaluation/rag_evaluation_basic.ipynb)

# RAG Evaluation Basic Demo

This example shows a basic RAG evaluation pipeline based on the [Ragas](https://docs.ragas.io/en/stable/) framework. It focuses on two basic concepts:

- **Creating a test set**: This is a set of questions and answers that we'll use to evaluate a RAG pipeline.
- **Evaluation metrics**: Which metrics do we use to score a RAG pipeline? In this example, we measure the following:
    - *[Faithfulness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html)*: Are all the claims that are made in the answer inferred from the given context(s)?
    - *[Context Precision](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_precision.html)*: Did our retriever return good results that matched the question it was being asked?
    - *[Answer Correctness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html)*: Was the generated answer correct? Was it complete?

### 📝 Requirements

To run this notebook, you will need:

- **OpenAI API key**:  
    - Sign up at [OpenAI](https://platform.openai.com/) and create an API key

## Set up the evaluation environment

#### Install libraries (Only in Google Colab)

In [1]:
import os

if 'COLAB_RELEASE_TAG' in os.environ:
    # This is a Google Colab environment
    
    # Check if the notebook is running in a GPU environment and install the appropriate version of faiss
    if 'COLAB_GPU' in os.environ:
        !pip3 install faiss-gpu
    else:
        !pip3 install faiss-cpu

    # Install other dependencies
    !pip3 install datasets langchain langchain-openai langchain-huggingface ragas # aieng-rag-utils

#### Import libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
import os

from aieng.rag.utils import get_device_name

from datasets import Dataset 
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision, AnswerCorrectness

#### Load OpenAI env variables

In [4]:
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL","https://api.openai.com/v1")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")

#### Choose and set evaluator LLM and embedding model

**IMP Note:** It is recommended to use most capable models for this notebook, for e.g. OpenAI's GPT-4o, o1-preview or Meta-AI's Meta-Llama-3.1-70B-Instruct.

In [5]:
EVALUATOR_MODEL_NAME = "gpt-4.1"
EVALUATOR_EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

In [6]:
llm = ChatOpenAI(
    model=EVALUATOR_MODEL_NAME,
    temperature=0,
    max_tokens=1024,
    base_url=OPENAI_BASE_URL,
    api_key=OPENAI_API_KEY
)

In [9]:
device = get_device_name()

model_kwargs = {'device': device, 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

embeddings = HuggingFaceEmbeddings(
    model_name=EVALUATOR_EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

## Create a test set: data samples we'll use to evaluate our RAG pipeline

In the `data_samples` structure below, the **answer** attribute contains the answers that a RAG pipeline might have returned to the questions asked under **question**. Try changing these answers to see how that affects the score in the next section.

In [10]:
rag_answer_1 = "The first superbowl was held on Jan 15, 1967"
rag_answer_2 = "The most super bowls have been won by The New England Patriots"

rag_context_1 = [
    'The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'
]
rag_context_2 = [
    'The Green Bay Packers...Green Bay, Wisconsin.',
    'The Packers compete...Football Conference'
]

test_set = {
    'question': [
        'When was the first super bowl?', 
        'Who won the most super bowls?'
    ],
    'answer': [
        rag_answer_1,
        rag_answer_2 
    ],
    'contexts' : [
        rag_context_1, 
        rag_context_2
    ],
    'ground_truth': [
        'The first superbowl was held on January 15, 1967', 
        'The New England Patriots have won the Super Bowl a record six times'
    ]
}

## Now evaluate the RAG pipeline

Evaluate based on the metrics mentioned above: **faithfulness**, **context precision**, **factual correctness**.
    
There are other metrics that are available via the Ragas framework: [Ragas metrics](https://docs.ragas.io/en/stable/concepts/metrics/)

Preview our test set before sending it for evaluation:

In [11]:
test_dataset = Dataset.from_dict(test_set)
test_dataset.to_pandas()

Unnamed: 0,question,answer,contexts,ground_truth
0,When was the first super bowl?,"The first superbowl was held on Jan 15, 1967",[The First AFL–NFL World Championship Game was...,"The first superbowl was held on January 15, 1967"
1,Who won the most super bowls?,The most super bowls have been won by The New ...,"[The Green Bay Packers...Green Bay, Wisconsin....",The New England Patriots have won the Super Bo...


Evaluation results:

In [12]:
score = evaluate(
    dataset=test_dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision(),
        AnswerCorrectness(),
    ],
    llm=llm,
    embeddings=embeddings,
)
score.to_pandas()

Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,answer_correctness
0,When was the first super bowl?,[The First AFL–NFL World Championship Game was...,"The first superbowl was held on Jan 15, 1967","The first superbowl was held on January 15, 1967",0.0,1.0,0.999617
1,Who won the most super bowls?,"[The Green Bay Packers...Green Bay, Wisconsin....",The most super bowls have been won by The New ...,The New England Patriots have won the Super Bo...,0.0,0.0,0.718121
