# RAG Evaluation with DeepEval

## Introduction
---

[**DeepEval**](https://docs.confident-ai.com/) is a comprehensive framework designed to evaluate the performance and capabilities of large language models (LLMs). It provides a structured and systematic approach to assessing the strengths and weaknesses of these powerful AI systems across a wide range of tasks and domains. By leveraging a diverse set of evaluation metrics, `DeepEval` enables researchers, developers, and users to gain insights into the linguistic, reasoning, and knowledge capabilities of LLMs. This framework aims to foster transparency, reproducibility, and fairness in the evaluation process, ultimately contributing to the responsible development and deployment of these cutting-edge language technologies.

To work with `DeepEval`, we will need to prepare <u>three mandatory components</u>:
1. **Evaluation dataset**: this is a collection of `LLMTestCases` and/or `Goldens` dataset with the expected answer.
2. **Test case(s)**: is a blueprint to unit test LLM outputs. There are two types of test cases in **DeepEval**: `LLMTestCase` and `ConversationalTestCase`.
3. **Metrics**: the standard of measurement, or bring your own metrics for evaluating the performance of an LLM output based your objectives and use cases

<div class="alert alert-block alert-info">
    <b>Note</b>: We have already set up the <b>evaluation dataset</b> during our prerequisite step.
</div>



## Set up

In [None]:
%pip install -qU --quiet -r requirements.txt

## RAG Evaluation
---
In general, core components of RAG pipeline involve **retrieval** and **generation** steps, which influenced by hyperparameters. For example, your embedding model choice, search strategy, or number of chunks/nodes to retrieve, LLM temperature, prompt template.

**DeepEval** offers evaluation framework for both retrieval and generation steps separately. This decouple approaches allows the AI developers for easier debugging, and pinpointing the issue, and which components to improve.


<img src='./img/DeepEval-RAG_Eval.png' alt="DeepEval RAG Evaluation" style='width: 500px;'/>

### Get the evaluation dataset
---
Load our prerequisite evaluation dataset, for the detail steps on creating this dataset, please refer to the [prerequisite notebook](../prerequisite-vector-db-and-evaluation-dataset.ipynb)

In [1]:
import pandas as pd

eval_df = pd.read_csv('../_eval_data/eval_dataframe.csv')
eval_df.head(2)

Unnamed: 0,input,actual_output,expected_output,context,retrieval_context,n_chunks_per_context,context_length,evolutions,context_quality,synthetic_input_quality,source_file
0,Rewritten Input: Explain Amazon's core mission...,,Amazon's core mission is to make customers' li...,"['across Amazon. Y et, I think every one of us...",,1,2361,['Reasoning'],0.8,1.0,./_raw_data/AMZN-2023-Shareholder-Letter.pdf
1,Compare Amazon's approach to empowering builde...,,Amazon's approach to empowering builders and i...,"['across Amazon. Y et, I think every one of us...",,1,2361,['Comparative'],0.8,0.6,./_raw_data/AMZN-2023-Shareholder-Letter.pdf


In [2]:
import ast

eval_df['context'] = eval_df.context.apply(lambda s: list(ast.literal_eval(s)))

### Connect to existing vector database
---

Connect to our prerequisite vector database, please refer to the [prerequisite notebook](../prerequisite-vector-db-and-evaluation-dataset.ipynb) for setting up vector database.

In [3]:
from langchain_chroma import Chroma
from langchain_aws import BedrockEmbeddings
import boto3


chroma_db_dir = './../_vector_db'
chroma_collection_name = 'amazon-shareholder-letters'
boto_session = boto3.session.Session()
titan_model_id = 'amazon.titan-embed-text-v2:0'
titan_embedding_fn = BedrockEmbeddings(
    model_id=titan_model_id,
    region_name=boto_session.region_name
)

vector_store = Chroma(
    collection_name=chroma_collection_name,
    embedding_function=titan_embedding_fn,
    persist_directory=chroma_db_dir,
)

chroma_retriver = vector_store.as_retriever(
    search_kwargs={'k': 3}
)

We can check if the connection is successful by using `vector_store.get()`

In [4]:
len(vector_store.get().get('ids', []))  # here will output the number of docs or chunks within the Chroma DB

380

### Custom LLM Evaluation
---

Before we can use the predefined metrics from `DeepEval`, we need to define our own class. When we define the evaluation **metrics**, we need to pass `model` parameter, otherwise, it will use **OpenAI** as a default model.

Below is the example from `ContextualPrecisionMetric` class:

```{py3}
class ContextualPrecisionMetric(BaseMetric):
    def __init__(
        self,
        threshold: float = 0.5,
        model: Optional[Union[str, DeepEvalBaseLLM]] = None,
        include_reason: bool = True,
        async_mode: bool = True,
        strict_mode: bool = False,
        verbose_mode: bool = False,
    ):
    ...
```

In this notebook, we will use **Llama 3 70B** for RAG response generation and use **Llama 3.1 70B** as evaluation model.

In [5]:
import langchain_aws
from langchain_aws import ChatBedrock

llama3_70b_model_id = 'meta.llama3-70b-instruct-v1:0'
llama3_1_70b_model_id = 'meta.llama3-1-70b-instruct-v1:0'

llama3_70b_langchain = ChatBedrock(
    model_id=llama3_70b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'max_tokens': 2048,
        'temperature': 0.2,
    },
)

llama3_1_70b_langchain = ChatBedrock(
    model_id=llama3_1_70b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'max_tokens': 2048,
        'temperature': 0.2,
    },
)

In [6]:
from deepeval.models import DeepEvalBaseLLM


class BedrockTextGenDeepEval(DeepEvalBaseLLM):
    def __init__(
        self,
        model: langchain_aws.chat_models
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        llm_model = self.load_model()
        return llm_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        llm_model = self.load_model()
        res = await llm_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        llm_model = self.load_model()
        return llm_model.model_id

    def get_provider(self):
        model_id = self.get_model_name()
        return model_id.split('.')[0]

In [7]:
llama3_1_70b_deepeval = BedrockTextGenDeepEval(model=llama3_1_70b_langchain)

### Define simple RAG application
---

Let's define simple `chain` using `langchain` framework to get the source documents and LLM responses.

In [8]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


system_prompt = ('''
You are an expert, truthful assistant. You will be provided the task by human.
Use the given context only to respond to the request.

Here is the context: {context}
''')

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

qna_chain = create_stuff_documents_chain(llama3_70b_langchain, prompt_template)
rag_chain = create_retrieval_chain(chroma_retriver, qna_chain)

### Retrieval evaluation
---

In this section, we will focus on evaluating **retrieval** components.

- [**Contextual Precision**](https://docs.confident-ai.com/docs/metrics-contextual-precision): this metric measures whether the chunks in `retrieval_context` are relevant to the given `input`.
- [**Contextual Recall**](https://docs.confident-ai.com/docs/metrics-contextual-recall): this metric measures the quality of the retriever by evaluating the extent of which the `retrieval_context` aligns with the `expected_output`.
- [**Contextual Relevancy**](https://docs.confident-ai.com/docs/metrics-contextual-relevancy): this metric measures the overall relevance (or quality) of the information presented in your `retrieval_context` for a given `input`.

#### Define the metrics

In [9]:
from deepeval import evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric
)

contextual_precision_metric = ContextualPrecisionMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

contextual_recall_metric = ContextualRecallMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

Let's examine one example from evaluation dataset

In [10]:
sample_question = eval_df.input[1]
expected_output = eval_df.expected_output[1]
context = eval_df.context[1]
_rag_result = rag_chain.invoke({"input": sample_question})
llm_resp = _rag_result.get('answer', '').strip()
retrieval_context_list = [
    doc.page_content for doc in _rag_result.get('context')
]

#### Construct LLMTestCase

In [11]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=sample_question,
    actual_output=llm_resp,
    expected_output=expected_output,
    retrieval_context=retrieval_context_list,
    context=context
)

#### Get the metric
---
There are two methods to get the metric and reason

1. **One-by-one**: by using the metric's `measure` method on the `LLMTestCase`, you can get the metric evaluation one-by-one.
2. **Bulk**: you can pass multiple test cases and multiple evaluation metrics to DeepEval's `evaluate` method.

In [12]:
from IPython.display import display, Markdown

contextual_precision_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='blue'>'Contextual precision: {}".format(contextual_precision_metric.reason)
))

contextual_recall_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='green'>'Contextual recall: {}".format(contextual_recall_metric.reason)
))
contextual_relevancy_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='brown'>'Contextual relevancy: {}".format(contextual_relevancy_metric.reason)
))

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='blue'>'Contextual precision: The score is 1.00 because all relevant nodes in the retrieval context are ranked higher than irrelevant nodes, which in this case, there are none. The top-ranked nodes all contain highly relevant information, such as 'discrete, foundational building blocks' (rank 1), 'accelerating builders' ability to innovate' (rank 2), and 'building the right set of primitives' (rank 3), perfectly addressing the input's requirements.

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='green'>'Contextual recall: The score is 0.67 because the retrieval context partially supports the expected output, with sentences 1, 2, and 4 being attributed to nodes in the retrieval context, specifically the 1st and 2nd nodes, but sentences 3 and 5 are general statements that do not directly attribute to any specific part of the retrieval context.

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='brown'>'Contextual relevancy: The score is 0.60 because while the retrieval context provides some relevant information about Amazon's approach to primitives, such as 'building primitive services' and 'primitives rapidly accelerate builders' ability to innovate', it lacks specific comparisons to traditional monolithic solutions and key differences, as noted in the irrelevant statements, e.g. 'does not specifically compare Amazon's approach to empowering builders and improving customer experiences through primitives vs. traditional monolithic solutions'.

In [13]:
retrieval_report = evaluate(
    test_cases=[test_case], 
    metrics=[contextual_precision_metric, contextual_recall_metric, contextual_relevancy_metric],
    show_indicator=False,
    ignore_errors=True,
    print_results=True
)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.8, strict: False, evaluation model: meta.llama3-1-70b-instruct-v1:0, reason: The score is 1.00 because all nodes in the retrieval context are relevant to the input, with the first node clearly explaining the concept of primitives, the second node highlighting their benefits, and the third node providing guidance on building the right set of primitives, all of which are key points in the expected output., error: None)
  - ❌ Contextual Recall (score: 0.6666666666666666, threshold: 0.8, strict: False, evaluation model: meta.llama3-1-70b-instruct-v1:0, reason: The score is 0.67 because some sentences in the expected output can be attributed to specific nodes in the retrieval context, such as sentences 1 and 3 being related to the 1st node's description of primitives, and sentence 2 being related to the 2nd node's mention of primitive

### Generation evaluation
---

There are two main evaluation metrics for **generation** components, however, you can add other metrics to fit your use cases.

- [**Answer Relevancy**](https://docs.confident-ai.com/docs/metrics-answer-relevancy): this metric measures how relevant the `actual_output` of your LLM application is compared to the provided `input`.
- [**Faithfulness**](https://docs.confident-ai.com/docs/metrics-faithfulness): this metric measures the generator by evaluating whether the `actual_output` factually <u>aligns with the contents of your `retrieval_context`</u>.
- [**Bias**](https://docs.confident-ai.com/docs/metrics-bias): this metric measures whether your LLM outputs contain gender, racial, or political bias
- [**Toxicity**](https://docs.confident-ai.com/docs/metrics-toxicity): this metric measures whether your LLM outputs contain any toxicity.

#### Define the metrics

In [14]:
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    BiasMetric,
    ToxicityMetric
)

answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

faithfulness_metric = FaithfulnessMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

bias_metric = BiasMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)


toxicity_metric = ToxicityMetric(
    threshold=0.8,
    model=llama3_1_70b_deepeval,
    include_reason=True
)

#### Construct LLMTestCase
---
Let's reuse the same question and answer for this.

In [15]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=sample_question,
    actual_output=llm_resp,
    expected_output=expected_output,
    retrieval_context=retrieval_context_list,
    context=context
)

#### Get the metric
---
There are two methods to get the metric and reason. This is the same as previously in **Retrieval** section.

1. **One-by-one**: by using the metric's `measure` method on the `LLMTestCase`, you can get the metric evaluation one-by-one.
2. **Bulk**: you can pass multiple test cases and multiple evaluation metrics to DeepEval's `evaluate` method.

In [16]:
from IPython.display import display, Markdown

answer_relevancy_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='blue'>'Answer relevancy: {}".format(answer_relevancy_metric.reason)
))
faithfulness_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='green'>'Faithfulness: {}".format(faithfulness_metric.reason)
))
bias_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='brown'>'Bias: {}".format(bias_metric.reason)
))
toxicity_metric.measure(test_case, _show_indicator=False)
display(Markdown(
    "<font color='brown'>'Toxicity: {}".format(toxicity_metric.reason)
))

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='blue'>'Answer relevancy: The score is 1.00 because the actual output perfectly addresses the input, providing a thorough comparison of Amazon's approach to empowering builders and improving customer experiences through primitives vs. traditional monolithic solutions, with no irrelevant statements.

Event loop is already running. Applying nest_asyncio patch to allow async execution...
None


<font color='green'>'Faithfulness: The score is 1.00 because there are no contradictions, indicating a perfect alignment between the actual output and the retrieval context!

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='brown'>'Bias: The score is 0.00 because the actual output appears to be unbiased, as there are no reasons listed to suggest otherwise.

Event loop is already running. Applying nest_asyncio patch to allow async execution...


<font color='brown'>'Toxicity: The score is 0.00 because the actual output is completely respectful and does not contain any toxic language.

In [17]:
generator_report = evaluate(
    test_cases=[test_case], 
    metrics=[
        answer_relevancy_metric, faithfulness_metric, bias_metric, toxicity_metric
    ],
    show_indicator=False,
    ignore_errors=True,
    print_results=True
)

Event loop is already running. Applying nest_asyncio patch to allow async execution...
None


Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.8, strict: False, evaluation model: meta.llama3-1-70b-instruct-v1:0, reason: The score is 1.00 because the actual output perfectly addresses the input, providing a thorough comparison of Amazon's approach to empowering builders and improving customer experiences through primitives vs. traditional monolithic solutions, with no irrelevant statements., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.8, strict: False, evaluation model: meta.llama3-1-70b-instruct-v1:0, reason: The score is 1.00 because there are no contradictions, indicating a perfect alignment between the actual output and the retrieval context!, error: None)
  - ✅ Bias (score: 0.8, threshold: 0.8, strict: False, evaluation model: meta.llama3-1-70b-instruct-v1:0, reason: The score is 0.80 because the actual output reveals a clear bias towards Amazon's prim

## Summary
---

`DeepEval` is an open-source Python library specifically designed for evaluating Retrieval-Augmented Generation (RAG) applications. It offers a comprehensive set of tools and metrics to assess various aspects of RAG performance, including retrieval accuracy, generation quality, faithfulness to source material, and overall coherence. By decoupling the metrics into **retrieval** and **generation** components, developers can assess, and pinpoint the components need to be improved.

`DeepEval` enables automated testing and benchmarking, allowing for efficient comparison of different RAG models or configurations. The library supports customizable metrics, scalable evaluations, and provides detailed insights into each component of the RAG system. This level of comprehensive analysis helps in identifying areas for improvement, optimizing performance, and ultimately enhancing the user experience. By integrating DeepEval into the development workflow, teams can make data-driven decisions, save time and resources, and ensure the continuous improvement of their RAG applications.

However, its reliance on **Large Language Model (LLM)** for assessment introduces potential drawbacks, including bias, increased costs, dependency issues, and consistency concerns. While `DeepEval` provides detailed insights and facilitates data-driven optimization of RAG systems, <u>users should be aware of these limitations and consider supplementary evaluation methods to ensure a well-rounded assessment of their RAG applications</u>.