# Learning Objectives

- Illustrate RAG evaluation using the `deepeval` package to evaluate RAG outputs using:
    - RAG Triad metrics
    - Ragas metrics


# Setup

In [1]:
!pip install -q openai==1.66.3 \
                tiktoken==0.9.0 \
                langchain==0.3.20 \
                langchain-chroma==0.2.2 \
                langchain-openai==0.3.9 \
                chromadb==0.6.3 \
                deepeval==2.7.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.4/567.4 kB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m35.5 MB/s[0m eta [3

In [2]:
import os
import chromadb

from openai import AzureOpenAI

from deepeval import evaluate
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_chroma import Chroma

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.metrics import ContextualPrecisionMetric, ContextualRecallMetric

from deepeval.models.base_model import DeepEvalBaseLLM

from google.colab import userdata



In [3]:
azure_api_key = userdata.get('azure_api_key')
# Modify the Azure Endpoint and the API Versions as needed
azure_base_url = "https://oait3st.cognitiveservices.azure.com"
azure_api_version = "2024-12-01-preview"

In [4]:
!unzip tesla_db.zip

Archive:  tesla_db.zip
   creating: tesla_db/
   creating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/
  inflating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/header.bin  
  inflating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/link_lists.bin  
  inflating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/data_level0.bin  
  inflating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/index_metadata.pickle  
  inflating: tesla_db/6bcaeb49-ca3f-472e-ba93-578374897301/length.bin  
  inflating: tesla_db/chroma.sqlite3  


# Assembling Test Cases

An important first steps to evaluate RAG responses is to assemble test cases.

A typical test case consists of:
- query representative of expected user questions when the application goes live
- response generated by the RAG application
- (optional) human baseline answer for the same query using the same data sources

While optional, having a small subset of human baselines answers (gold answers) improves the robustness of the evaluation process.

## Composing test case response

Let us begin by setting up the full RAG pipeline that takes in user queries and uses retrieved context from the vector database to answer this question.

In [9]:
qna_system_message = """
You are an assistant to a financial services firm who answers user queries on annual reports.
User input will have the context required by you to answer user queries.
This context will be delimited by: <Context> and </Context>.
The context contains references to specific portions of a document relevant to the user query.

User queries will be delimited by: <Question> and </Question>.

Please answer user queries only using the context provided in the input.
Do not mention anything about the context in your final answer. Your response should only contain the answer to the question.

If the answer is not found in the context, respond "I don't know".
"""

qna_user_message_template = """
<Context>
Here are some documents that are relevant to the question mentioned below.
{context}
</Context>

<Question>
{question}
</Question>
"""

def answer(user_query: str) -> str:
    """Answers user queries using context retrieved from a vector database.

    Retrieves relevant document chunks from a vector database based on the user query,
    formats the context and query into a prompt, and sends it to a large language model
    for answer generation.

    Args:
        user_query: The user's query.

    Returns:
        A tuple containing the generated answer and the list of retrieved context.
    """
    client = AzureOpenAI(
        azure_endpoint="https://gen-ai-teaching-001.openai.azure.com/",
        api_key=azure_api_key,
        api_version="2024-10-21"
    )

    model_name = 'gpt-4o-mini'

    embedding_model = AzureOpenAIEmbeddings(
        api_key=azure_api_key,
        azure_endpoint= azure_base_url,
        api_version="2024-10-21",
        azure_deployment="text-embedding-3-small"
    )

    chromadb_client = chromadb.PersistentClient(
        path="./tesla_db"
    )

    tesla_10k_collection = 'tesla-10k-2019-to-2023'

    vectorstore_persisted = Chroma(
        collection_name=tesla_10k_collection,
        collection_metadata={"hnsw:space": "cosine"},
        embedding_function=embedding_model,
        client=chromadb_client,
        persist_directory="./tesla_db"
    )

    retriever = vectorstore_persisted.as_retriever(
        search_type='similarity',
        search_kwargs={'k': 5}
    )

    relevant_document_chunks = retriever.invoke(user_query)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = "\n---\n".join(context_list)

    prompt = [
        {'role': 'developer', 'content': qna_system_message},
        {'role': 'user', 'content': qna_user_message_template.format(
            context=context_for_query,
            question=user_query
            )
        }
    ]

    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=prompt,
            temperature=0
        )

        prediction = response.choices[0].message.content.strip()
    except Exception as e:
        prediction = f'Sorry, I encountered the following error: \n {e}'

    return prediction, context_list

There are two use case scenarios to consider for each query in the test set:
- when baseline human responses (gold responses) are available
- when baseline human responses are not available

It is always preferable to have gold responses for a sample of test cases.

## Test cases with human baseline

Let us assemble a test case when golden output is available. This is done using the `LLMTestCase` abstraction from `deepeval`.

In [10]:
test_query = "What was the total revenue of the company in 2022?"

In [11]:
golden_output = '$81.46 billion'

In [12]:
output, retrieved_context = answer(test_query)

In [13]:
test_case_with_golden_output = LLMTestCase(
    input=test_query,
    expected_output=golden_output,
    actual_output=output,
    retrieval_context=retrieved_context
)

## Test cases without human baseline


The `LLMTestCase` abstraction also works when there is no expected output. In this case, the output is evaluated with an LLM-as-a-judge approach using the metric definition.

In [14]:
test_query = "What was the total revenue of the company in 2022?"

In [15]:
output, retrieved_context = answer(test_query)

In [16]:
test_case_without_golden_output = LLMTestCase(
    input=test_query,
    actual_output=output,
    retrieval_context=retrieved_context
)

# Evaluation - RAG Triad

The `deepeval` framework offers ready-to-use implementations of the three metrics comprising the RAG Triad: answer relevance, faithfulness, and contextual relevance.

An essential configuration parameter in this framework is the threshold, which defines the minimum acceptable standard for an answer to pass evaluation against the chosen metric. Lowering this threshold decreases the test's stringency, which can, in some cases, be advantageous. Excessively stringent thresholds may result in a high failure rate, potentially hindering the progression of the application towards deployment. Balancing this parameter is therefore critical to ensure both rigorous evaluation and practical viability.

We recommended to pick a reasonable threshold before starting the evaluation and stick with it as we change the parameters of the RAG system (e.g., chunk size, embedding model).

When using Azure Open AI, `deepeval` requires the creation of a custom instance of the `DeepEvalBaseLM` that [exposes specific methods](https://docs.confident-ai.com/guides/guides-using-custom-llms#azure-openai-example).

In [17]:
class CustomAzureOpenAI(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Azure OpenAI Model"


custom_model = AzureChatOpenAI(
    azure_endpoint = azure_base_url,
    api_key=azure_api_key,
    api_version = azure_api_version,
    model='gpt-4o-mini'
)

azure_openai = CustomAzureOpenAI(model=custom_model)

We can now pass in this custom Azure Open AI model twhile instantiating the metrics.

In [18]:
answer_relevancy = AnswerRelevancyMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

faithfulness = FaithfulnessMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

contextual_relevancy = ContextualRelevancyMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

## Single Test Case

With the metrics in place we can now evaluate the performance of the RAG system against the test case using the `evaluate` function.

In [19]:
results = evaluate(
    test_cases=[test_case_with_golden_output],
    metrics=[answer_relevancy, faithfulness, contextual_relevancy]
)

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:06,  6.13s/test case]



Metrics Summary

  - ❌ Answer Relevancy (score: 0.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.00 because the provided output addressed an unrelated technical issue rather than the financial information requested about the company's revenue in 2022., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 1.00 because there are no contradictions, indicating the actual output aligns perfectly with the retrieval context., error: None)
  - ❌ Contextual Relevancy (score: 0.09090909090909091, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.09 because, while the retrieval context includes relevant data about total revenues in 2022, such as 'In 2022, we recognized total revenues of $81.46 billion,' the overwhelming majority of the context is irrelevant to the specific query regarding total revenue fo




As the above output indicates, the RAG system passes answer relevancey and faithfulness but fails the contextual relevancy test. This is because not all the retrieved documents are relevant to the query. The next step in this case would be to improve the retrieval mechanism (e.g., change chunk size, $k$ or the embedding model).

We can also repeat the evaluation for the situation where the gold output is not available.

In [20]:
results = evaluate(
    test_cases=[test_case_without_golden_output],
    metrics=[answer_relevancy, faithfulness, contextual_relevancy]
)

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:10, 10.64s/test case]



Metrics Summary

  - ❌ Answer Relevancy (score: 0.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.00 because the output contained multiple irrelevant statements about access issues and error codes that have no connection to the company's revenue for 2022., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 1.00 because there are no contradictions present, indicating perfect alignment between the actual output and the retrieval context. Great job!, error: None)
  - ❌ Contextual Relevancy (score: 0.041666666666666664, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.04 because the retrieval context overwhelmingly focuses on irrelevant data from 2020, with statements like 'In 2020' dominating the context, while the only relevant information is found in a single statement indicating 'In 2022, we 




## Batch Evaluation

When moving from single test evaluation to evaluating a batch of queries, we need to create an evaluation dataset. This involved looping over a sample of test queries, baseline answers (gold outputs) and adding test cases to the dataset like so:

In [21]:
test_queries = [
    "What was the total revenue of the company in 2022?",
    "What was the company's debt level in 2023?",
    "Present 3 key highlights of the Management Discussion and Analysis section of the 2022 report in 50 words."
]

In [22]:
golden_outputs = [
    "$81.46 billion",
    "$2,061 million ($1,016 million in recourse debt and $1,029 million in non-recourse debt)",
    """
    In 2022, Tesla produced 1.37 million vehicles, recognized revenues of $81.46 billion, and achieved a net income of $12.56 billion.
    The company focused on increasing production capacity, improving battery technologies, and expanding its energy storage and solar energy systems.
    """
]

In [23]:
dataset = EvaluationDataset()

In [24]:
for gold_query, gold_output in zip(test_queries, golden_outputs):
    actual_output, retrieved_context = answer(gold_query)

    dataset.add_test_case(
        LLMTestCase(
            input=gold_query,
            expected_output=gold_output,
            actual_output=actual_output,
            retrieval_context=retrieved_context
        )
    )

Once the dataset is assembled, the evaluation can be executed on the dataset using the RAG Triad metrics we saw earlier.

In [25]:
results = evaluate(
    dataset,
    metrics=[answer_relevancy, faithfulness, contextual_relevancy]
)

Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:15,  5.21s/test case]



Metrics Summary

  - ❌ Answer Relevancy (score: 0.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.00 because the output contained statements exclusively about technical errors and access issues, which do not address the request for highlights of the Management Discussion and Analysis section. The presence of multiple irrelevant statements indicates a complete misalignment with the input, resulting in the lowest score., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 1.00 because there are no contradictions, indicating that the actual output fully aligns with the retrieval context., error: None)
  - ❌ Contextual Relevancy (score: 0.6153846153846154, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.62 because the retrieval context lacks specific highlights from the Management Discussion and 




# Evaluation - Ragas

Evaluation using the Ragas metrics can be done using the same set of steps as the RAG Triad. These metrics are available out-of-the-box in `deepeval` and can be instantiated for batch evaluation as we did with RAG Triad metrics.

In [26]:
answer_relevancy = AnswerRelevancyMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

faithfulness = FaithfulnessMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

contextual_precision = ContextualPrecisionMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

contextual_recall = ContextualRecallMetric(
    threshold=0.7,
    model=azure_openai,
    include_reason=True
)

In [27]:
results = evaluate(
    dataset,
    metrics=[answer_relevancy, faithfulness, contextual_precision, contextual_recall]
)

Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:12,  4.18s/test case]



Metrics Summary

  - ❌ Answer Relevancy (score: 0.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 0.00 because all provided statements pertained to error messaging and access issues, which did not address the request for key highlights of the Management Discussion and Analysis section. This indicates a complete lack of relevance to the input's topic., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 1.00 because there are no contradictions, indicating that the actual output aligns perfectly with the retrieval context., error: None)
  - ✅ Contextual Precision (score: 1.0, threshold: 0.7, strict: False, evaluation model: Custom Azure OpenAI Model, reason: The score is 1.00 because the relevant nodes rank higher, starting with the first node mentioning '1,369,611 consumer vehicles', which is directly related to the output. The subsequent releva


