# DeepEval Evalutation Framework

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

Copyright 2024 Amazon Web Services, Inc.

![OVERALL FLOW PROCESS](../images/workflow.png)

In this notebook, we will evaluate the the sample output from the LLM for RAG use case using DeepEval. DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

## Install Requirements

In [1]:
# Install from public library
!pip install deepeval==0.21.32 --quiet
#!pip install -U deepeval --quiet

In [2]:
!pip install langchain_text_splitters --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.1.9 requires langchain-core<0.2,>=0.1.26, but you have langchain-core 0.3.12 which is incompatible.
langchain-aws 0.1.6 requires boto3<1.35.0,>=1.34.51, but you have boto3 1.33.9 which is incompatible.
langchain-aws 0.1.6 requires langchain-core<0.3,>=0.1.45, but you have langchain-core 0.3.12 which is incompatible.
langchain-community 0.0.38 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.3.12 which is incompatible.
langchain-openai 0.1.7 requires langchain-core<0.3,>=0.1.46, but you have langchain-core 0.3.12 which is incompatible.[0m[31m
[0m

In [3]:
!pip install langchain_aws==0.1.6 --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.13.1 requires botocore<1.34.132,>=1.34.70, but you have botocore 1.34.162 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.1 which is incompatible.
autogluon-common 0.8.3 requires pandas<1.6,>=1.4.1, but you have pandas 2.1.4 which is incompatible.
autogluon-core 0.8.3 requires pandas<1.6,>=1.4.1, but you have pandas 2.1.4 which is incompatible.
autogluon-core 0.8.3 requires scikit-learn<1.4.1,>=1.1, but you have scikit-learn 1.4.2 which is incompatible.
autogluon-features 0.8.3 requires pandas<1.6,>=1.4.1, but you have pandas 2.1.4 which is incompatible.
autogluon-features 0.8.3 requires scikit-learn<1.4.1,>=1.1, but you have scikit-learn 1.4.2 which is incompatible.
autogluon-multimodal 0.8.3 requires pandas<1.6,>=1.4.1, but you have p

Users can ignore the installation error messages for the packages version if there's any.

## Import Libraries

In [4]:
import pandas as pd
import sys
from langchain_aws import ChatBedrock
from datasets import Dataset
from ast import literal_eval
from botocore.client import Config

In [5]:
AWS_REGION = "us-west-2"

In [6]:
# Import DeepEval Libraries
from deepeval.models.base_model import DeepEvalBaseLLM

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import FaithfulnessMetric
from deepeval.metrics import BiasMetric
from deepeval.metrics import ToxicityMetric

from deepeval import evaluate
from deepeval.dataset import EvaluationDataset




## Load and pre-process the samples for evaluation

#### Load the samples
We are using a sample data file stored in '../data/inputs/'. If you wish to bring in your own data, please store the data in .csv file under '../../outputs/rag/rag_outputs' directory and ensure the schema is following the format that includes the exact columns in the csv file, ie. question, ground_truth, llm_answer, llm_contexts....

There are RAG output results with different models: output_openai.csv/output_claude.csv/output_mistral.csv ready to be evaluated, the output files from RAG are stored under ../../outputs/rag/rag_outputs .
To run through this notebook, please input the desired model name that needs to be evaluated in this notebook:

In [7]:
# src model
# model_output = "mistral"
# model_output="openai"
model_output = "claude"
result_csv_file = (
    f"../outputs/rag_outputs/output_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(result_csv_file)

../outputs/rag_outputs/output_claude.csv


In [8]:
#result_csv_file ='../data/inputs/output_mistral.csv'

#load csv sample file
input_dataset = pd.read_csv(result_csv_file, index_col=0)

input_dataset.head()

Unnamed: 0_level_0,doc_name,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens
financebench_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
financebench_id_03029,3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,['$1577.00'],Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,['<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401
financebench_id_00499,3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"['No, the company is managing its CAPEX and Fi...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,['<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635
financebench_id_00438,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,['No the operating margins of Adobe have recen...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.774023,3.742355,1.554743,4.361795,13324,1038
financebench_id_00591,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"['Yes, the FCF conversion (using net income as...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.891798,0.576164,1.379593,3.781731,12957,526
financebench_id_03069,AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,['4.2%'],ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,['<<Paragraph>> [Source File: AMD_2015_10K] \n...,0.821045,0.420084,1.686061,2.92591,24266,494


### Instantiate Bedrock Class - Inherits DeepEvalClass

In [9]:
class AWSBedrock(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        try: 
            return chat_model.invoke(prompt).content
        except: 
            print("Issue with Invoke")
            return self.generate(prompt)

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        try: 
            res = await chat_model.ainvoke(prompt)
            return res.content
        except: 
            print("Issue with Invoke")
            return await self.a_generate(prompt)

    def get_model_name(self):
        return "Bedrock Model"

# Use of Claude 3 Sonnet
custom_model = ChatBedrock(
    #credentials_profile_name="default", # e.g. "default"
    region_name=AWS_REGION, # e.g. "us-east-1"
    endpoint_url=f"https://bedrock-runtime.{AWS_REGION}.amazonaws.com", # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
    model_id='anthropic.claude-3-sonnet-20240229-v1:0', # e.g. "anthropic.claude-v2:1"
    model_kwargs={"temperature": 0.4},
)

aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock.generate("Write me a joke"))

Here's a joke for you:

Why can't a bicycle stand up by itself? It's two-tired!


For deep eval, context is the ideal context which typically comes from the evaluation dataset. However, retrieval context is the actual context that is being retrieved at runtime. So you could say that the context is a "ground truth" context while the retrieval context is the actual context. 

In [10]:
def get_deep_eval(questions, answers, contexts, gts, metrics): 
    data = {
        'question': questions,
        'answer': answers,
        'contexts': contexts, 
        'ground_truths': gts
    }
    # Create DataFrame
    df = pd.DataFrame(data)
    df.to_csv("input.csv", index=False)
    
    dataset = EvaluationDataset()
    dataset.add_test_cases_from_csv_file(
            file_path='input.csv',
            input_col_name='question',
            actual_output_col_name='answer',
            expected_output_col_name='ground_truths',
            context_col_name='contexts',
        )

    result = evaluate(dataset, metrics, print_results=False)
    return result

## Metrics in DeepEval for evaluation

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. Deepeval offers a range of default metrics for you to quickly get started with. Here, we are picking Answer Relevancy, Faithfulness, Bias, Toxicity as the metrics to evaluate question-answer use case.

- Answer Relevancy: The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input.
- Faithfulness: The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context.
- Toxicity: The toxicity metric is another referenceless metric that evaluates toxicness in your LLM outputs. 
- Bias: The bias metric determines whether your LLM output contains gender, racial, or political bias. 

### Define metrics for evaluation and run DeepEval Evalution for the input dataset

Here we are using Answer Relevancy,Faithfulness, Bias, Toxicity as the metrics to evaluate question-answer use case.


In [11]:
faithfulness_metric = FaithfulnessMetric(model=aws_bedrock)
bias_metric = BiasMetric(model=aws_bedrock)
toxicity_metric = ToxicityMetric(model=aws_bedrock)
answer_relevancy_metric = AnswerRelevancyMetric(model=aws_bedrock)

metrics = [
        answer_relevancy_metric,
        faithfulness_metric, 
        bias_metric, 
        toxicity_metric,
    ]
deep_eval_results = get_deep_eval(input_dataset['question'], input_dataset['llm_answer'], 
                                  input_dataset['llm_contexts'], input_dataset['ground_truths'],
                                  metrics)

Output()

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()





print out metrics used in the evaluation results

In [12]:
deep_eval_results[0].metrics

[<deepeval.metrics.answer_relevancy.answer_relevancy.AnswerRelevancyMetric at 0x7f7e75f1e050>,
 <deepeval.metrics.faithfulness.faithfulness.FaithfulnessMetric at 0x7f7e75f1e260>,
 <deepeval.metrics.bias.bias.BiasMetric at 0x7f7e75f1f580>,
 <deepeval.metrics.toxicity.toxicity.ToxicityMetric at 0x7f7e75f1dbd0>]

### Calulate evaluation scores of each metric and append the scores to the input dataset

In [13]:
answer_relevance_scores = []
faithfulness_scores = []
bias_scores = []
toxicity_scores = []
for r in deep_eval_results: 
    answer_relevance_scores += [r.metrics[0].score]
    faithfulness_scores += [r.metrics[1].score]
    bias_scores += [r.metrics[2].score]
    toxicity_scores += [r.metrics[3].score]
input_dataset['answer_relevance'] = answer_relevance_scores
input_dataset['faithfulness'] = faithfulness_scores
input_dataset['bias'] = bias_scores
input_dataset['toxicity'] = toxicity_scores

#### Diplaying the input dataset along with its evaluation scores

In [14]:
input_dataset

Unnamed: 0_level_0,doc_name,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens,answer_relevance,faithfulness,bias,toxicity
financebench_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
financebench_id_03029,3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,['$1577.00'],Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,['<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401,0.75,1.0,0.0,0.0
financebench_id_00499,3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"['No, the company is managing its CAPEX and Fi...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,['<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635,1.0,1.0,0.0,0.0
financebench_id_00438,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,['No the operating margins of Adobe have recen...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.774023,3.742355,1.554743,4.361795,13324,1038,0.666667,1.0,0.0,0.0
financebench_id_00591,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"['Yes, the FCF conversion (using net income as...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.891798,0.576164,1.379593,3.781731,12957,526,0.8,1.0,0.0,0.0
financebench_id_03069,AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,['4.2%'],ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,['<<Paragraph>> [Source File: AMD_2015_10K] \n...,0.821045,0.420084,1.686061,2.92591,24266,494,0.8,1.0,0.0,0.0
financebench_id_00563,AMD_2022_10K,https://ir.amd.com/sec-filings/content/0000002...,2022,novel-generated,"From FY21 to FY22, excluding Embedded, in whic...",['Data Center'],"Year Ended\nDecember 31,\n2022\nDecember 25,\n...",48,"Based on the financial information provided, e...",['<<Paragraph>> [Source File: AMD_2022_10K] \n...,0.99407,0.424838,2.337973,3.484417,15279,565,0.833333,1.0,0.0,0.0
financebench_id_01351,AMERICANEXPRESS_2022_10K,https://s26.q4cdn.com/747928648/files/doc_fina...,2022,domain-relevant,How much has the effective tax rate of America...,['The effective tax rate for American Express ...,TABLE 1: SUMMARY OF FINANCIAL PERFORMANCE\nYea...,44,The effective tax rate of American Express dec...,['<<Paragraph>> [Source File: AMERICANEXPRESS_...,0.991014,0.405639,6.266499,2.848354,18580,777,0.8,1.0,0.0,0.0
financebench_id_01964,AMERICANEXPRESS_2022_10K,https://s26.q4cdn.com/747928648/files/doc_fina...,2022,novel-generated,What was the largest liability in American Exp...,['Customer deposits'],CONSOLIDATED BALANCE SHEETS\nDecember 31 (Mill...,98,"According to the financial statements, the lar...",['<<Paragraph>> [Source File: AMERICANEXPRESS_...,0.725279,0.290234,1.082905,2.099063,19017,314,0.8,1.0,0.0,0.0
financebench_id_04417,BESTBUY_2019_10K,https://d18rn0p25nwr6d.cloudfront.net/CIK-0000...,2019,metrics-generated,What is the year end FY2019 total amount of in...,['$5409.00'],Table of Contents\nConsolidated Balance Sheets...,52,According to the Consolidated Balance Sheets i...,['<<Paragraph>> [Source File: BESTBUY_2019_10K...,0.765973,0.43235,1.355954,2.428678,17732,407,1.0,1.0,0.0,0.0
financebench_id_00685,BESTBUY_2023_10K,https://d18rn0p25nwr6d.cloudfront.net/CIK-0000...,2023,domain-relevant,Are Best Buy's gross margins historically cons...,"['Yes, the margins have been consistent, there...",Consolidated Statements of Earnings\n$ and sha...,40,Based on the financial information provided in...,['<<Paragraph>> [Source File: BESTBUY_2023_10K...,1.247835,3.447056,2.005445,5.153986,21451,979,0.75,1.0,0.0,0.0


#### Save the full report to the ragas evaluation in .csv format under '../../outputs/rag/evaluation_reports/deepeval/' 

In [15]:
eval_csv_dir = (
    f"../outputs/evaluation_reports/deepeval/deepeval_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(eval_csv_dir)

../outputs/evaluation_reports/deepeval/deepeval_claude.csv


In [16]:
eval_result_csv_file = input_dataset.to_csv(f"{eval_csv_dir}", index=False)
print(f"Save evaluation results to: {eval_csv_dir}'")

Save evaluation results to: ../outputs/evaluation_reports/deepeval/deepeval_claude.csv'


#### Get Mean and Median Scores for Metrics

In [17]:
def get_scores(results, metric): 
    print(f"Qna generation results for {metric}")
    print("Mean")
    print(results[metric].mean())
    print("Median")
    print(results[metric].median())
    print()

In [18]:
deep_eval_metric_names = ['faithfulness', 'answer_relevance', 'bias', 'toxicity']
for m in deep_eval_metric_names:
    get_scores(input_dataset, m)

Qna generation results for faithfulness
Mean
1.0
Median
1.0

Qna generation results for answer_relevance
Mean
0.82
Median
0.8

Qna generation results for bias
Mean
0.0
Median
0.0

Qna generation results for toxicity
Mean
0.0
Median
0.0

