# RAGAS Evaluation Framework

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.Copyright 2024 Amazon Web Services, Inc.

![OVERALL FLOW PROCESS](../images/workflow.png)

 In this notebook, we will evaluate the the sample output from the LLM for RAG use case using RAGAS. 

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.
Ragas aims to create an open standard, providing developers with the tools and techniques to leverage continual learning in their RAG applications. With Ragas, you would be able to

- Synthetically generate a diverse test dataset that you can use to evaluate your app.

- Use LLM-assisted evaluation metrics designed to help you objectively measure the performance of your application.

- Monitor the quality of your apps in production using smaller, cheaper models that can give actionable insights. For example, the number of hallucinations in the generated answer.

- Use these insights to iterate and improve your application.

## Preliminary
1. The ragas-evaluation package is in the "libraries/ragas-evaluation" directory.
2. Run `pip install -e . --quiet` in "libraries/ragas-evaluation"
3. Install LlamaIndex, run `pip install llama-index==0.9.6.post1`


## Load packages and tools

In [1]:
!pip install -e ../libraries/ragas-evaluation --quiet

In [2]:
!pip install llama-index==0.9.6.post1 --quiet

In [3]:
#!pip install --upgrade pydantic --quiet

In [4]:
import os
import sys
import pandas as pd
from datasets import Dataset
from datasets import load_dataset

In [5]:
sys.path.append('../libraries/ragas-evaluation/src/')
from ragas import evaluate

from ragas.metrics.answer_precision import AnswerPrecision, answer_precision
from ragas.metrics.answer_recall import AnswerRecall, answer_recall
from ragas.metrics.answer_correctness import AnswerCorrectness, answer_correctness
from ragas.metrics.answer_relevance import AnswerRelevancy, answer_relevancy
from ragas.metrics.answer_similarity import AnswerSimilarity, answer_similarity
from ragas.metrics.context_precision import (
    ContextPrecision,
    ContextRelevancy,
    context_precision,
)
from ragas.metrics.context_recall import ContextRecall, context_recall
from ragas.metrics.critique import AspectCritique
from ragas.metrics.faithfulness import Faithfulness, faithfulness

  warn_deprecated(
2024-10-22 19:36:06.321055: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  return self.fget.__get__(instance, owner)()


Users can ignore the installation error messages for the packages version if there's any.

## Load and pre-process the samples for evaluation

#### Load the samples
We are using a sample data file stored in '../data/inputs/'. If you wish to bring in your own data, please store the data in .csv file under '../../outputs/rag/rag_outputs/' directory and ensure the schema is following the format that includes the exact columns in the csv file, ie. question, ground_truth, llm_answer, llm_contexts....

There are RAG output results with different models: output_openai.csv/output_claude.csv/output_mistral.csv ready to be evaluated, the output files from RAG are stored under ../../outputs/rag/rag_outputs
To run through this notebook, please input the desired model name that needs to be evaluated in this notebook:

In [6]:
# src model
# model_output = "mistral"
# model_output="openai"
model_output= "claude"

result_csv_file = (
    f"../outputs/rag_outputs/output_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(result_csv_file)

../outputs/rag_outputs/output_claude.csv


In [7]:
#result_csv_file ='../data/inputs/output_claude.csv'

result_df = pd.read_csv(result_csv_file, index_col=0)

result_df.head()

Unnamed: 0_level_0,doc_name,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens
financebench_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
financebench_id_03029,3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,['$1577.00'],Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,['<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401
financebench_id_00499,3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"['No, the company is managing its CAPEX and Fi...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,['<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635
financebench_id_00438,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,['No the operating margins of Adobe have recen...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.774023,3.742355,1.554743,4.361795,13324,1038
financebench_id_00591,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"['Yes, the FCF conversion (using net income as...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.891798,0.576164,1.379593,3.781731,12957,526
financebench_id_03069,AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,['4.2%'],ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,['<<Paragraph>> [Source File: AMD_2015_10K] \n...,0.821045,0.420084,1.686061,2.92591,24266,494


## Pre-process the samples: This step depends on the format of the input samples 

In [8]:
# Ensure the "contexts" field is a list
result_df['llm_contexts']=result_df['llm_contexts'].apply(lambda x: eval(x))

# Ensure the "ground_truths" is field name for ground truth answer
result_df.rename(columns={"answer":"ground_truths"}, inplace=True)

# Ensure the "ground_truths" field is a list
result_df['ground_truths']=result_df['ground_truths'].apply(lambda x: [x])

# Ensure the "llm_answer" field has no None type
result_df["llm_answer"]=result_df['llm_answer'].fillna(value="Unfortunately, I cannot answer this question")


#print(selected_rows)
result_df.head()

Unnamed: 0_level_0,doc_name,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens
financebench_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
financebench_id_03029,3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,[['$1577.00']],Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401
financebench_id_00499,3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"[['No, the company is managing its CAPEX and F...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635
financebench_id_00438,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,[['No the operating margins of Adobe have rece...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.774023,3.742355,1.554743,4.361795,13324,1038
financebench_id_00591,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"[['Yes, the FCF conversion (using net income a...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.891798,0.576164,1.379593,3.781731,12957,526
financebench_id_03069,AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,[['4.2%']],ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,0.821045,0.420084,1.686061,2.92591,24266,494


Pre-process the samples: This step depends on the format of the input samples 

In [9]:
result_ds = Dataset.from_pandas(result_df)

## Run ragas evaluation

RAGAS Evaluation Scores
RAGAS is a framework that helps to evaluate Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM context. It provides a variety of LLM-powered automated evaluation metrics. The below metrics are introduced evaluating RAG use case.

- Answer Precision: Measures how accurately the model generated answer contain relevant and correct claims compared to the ground truth answer.
- Answer Recall: Evaluates the completeness of the answer, i.e., model's ability to retrieve all correct claims comparing to the ground truth answer. High recall indicates that the answer thoroughly covers the necessary details in line with the ground truth.
- Answer Correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
- Answer Similarity: The assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.




In [10]:
%%time
# NOTE: Comment out any metrics you don't want to use
metrics = [
    answer_precision,
    answer_recall,
    answer_correctness,
    answer_similarity,
    # answer_relevancy,
    # faithfulness,
    # context_precision,
    # context_recall, # currently this metric might trigger timeout error raised by bedrock: ValueError: Error raised by bedrock service: Read timeout on endpoint URL: "https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2/invoke"
]

column_map = {
        "question": "question",
        "contexts": "llm_contexts",
        "answer": "llm_answer",
        "ground_truths": "ground_truths",
    }


# Evaluate
eval_result = evaluate(result_ds, metrics=metrics, column_map=column_map)

evaluating with [answer_precision]


100%|██████████| 1/1 [02:02<00:00, 122.61s/it]


evaluating with [answer_recall]


100%|██████████| 1/1 [01:02<00:00, 62.46s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [02:06<00:00, 126.29s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:02<00:00,  2.12s/it]

CPU times: user 2.39 s, sys: 1.36 s, total: 3.75 s
Wall time: 5min 13s





## Save the evaluation results with the sample input

In [11]:
print(eval_result)

{'ragas_score': 0.2768, 'answer_precision': 0.1867, 'answer_recall': 0.4000, 'answer_correctness': 0.2680, 'answer_similarity': 0.3493}


In [12]:
# Add the fields from the input dataframe to the evaluation result dataframe
eval_result_df = eval_result.to_pandas()
eval_result_df.head()

Unnamed: 0,question,contexts,answer,ground_truths,answer_precision,answer_recall,answer_correctness,answer_similarity
0,What is the FY2018 capital expenditure amount ...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,According to the cash flow statement in the 3M...,[['$1577.00']],0.0,1.0,0.168175,0.336351
1,Is 3M a capital-intensive business based on FY...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,Based on the financial information provided in...,"[['No, the company is managing its CAPEX and F...",0.0,0.0,0.184868,0.369736
2,Does Adobe have an improving operating margin ...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,Based on the financial information provided in...,[['No the operating margins of Adobe have rece...,0.2,0.0,0.306414,0.412829
3,Does Adobe have an improving Free cashflow con...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,Based on the financial information provided in...,"[['Yes, the FCF conversion (using net income a...",0.0,0.0,0.198792,0.397583
4,Answer the following question as if you are an...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,According to the details in the Profit and Los...,[['4.2%']],0.0,0.0,0.146506,0.293012


#### Add the colomns of input data to the evaluation result

In [13]:
metrics_keys = ['answer_precision','answer_recall','answer_correctness','answer_similarity']

# Reset the index if necessary
result_df = result_df.reset_index(drop=True)
eval_result_df = eval_result_df.reset_index(drop=True)

# Now perform the merge
eval_result_df_new = result_df.merge(eval_result_df[metrics_keys], 
                                     how='left', left_index=True, right_index=True)

eval_result_df_new.head()


Unnamed: 0,doc_name,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens,answer_precision,answer_recall,answer_correctness,answer_similarity
0,3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,[['$1577.00']],Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401,0.0,1.0,0.168175,0.336351
1,3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"[['No, the company is managing its CAPEX and F...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635,0.0,0.0,0.184868,0.369736
2,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,[['No the operating margins of Adobe have rece...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.774023,3.742355,1.554743,4.361795,13324,1038,0.2,0.0,0.306414,0.412829
3,ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"[['Yes, the FCF conversion (using net income a...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.891798,0.576164,1.379593,3.781731,12957,526,0.0,0.0,0.198792,0.397583
4,AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,[['4.2%']],ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,0.821045,0.420084,1.686061,2.92591,24266,494,0.0,0.0,0.146506,0.293012


#### Save the full report to the ragas evaluation in .csv format under '../data/results/data/results/evaluation_report/ragas/' 

In [14]:
eval_csv_dir = (
    f"../outputs/evaluation_reports/ragas/ragas_eval_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(eval_csv_dir)

../outputs/evaluation_reports/ragas/ragas_eval_claude.csv


In [15]:
eval_result_csv_file = eval_result_df_new.to_csv(f"{eval_csv_dir}", index=False)
print(f"Save evaluation results to: {eval_csv_dir}'")

Save evaluation results to: ../outputs/evaluation_reports/ragas/ragas_eval_claude.csv'


#### Get Mean and Median Scores for Metrics

In [16]:
def get_scores(results, metric): 
    print(f"Qna generation results for {metric}")
    print("Mean")
    print(results[metric].mean())
    print("Median")
    print(results[metric].median())
    print()

In [17]:
ragas_eval_metric_names = ['answer_precision','answer_recall','answer_correctness','answer_similarity']
for m in ragas_eval_metric_names:
    get_scores(eval_result_df_new, m)

Qna generation results for answer_precision
Mean
0.18666666666666668
Median
0.0

Qna generation results for answer_recall
Mean
0.4
Median
0.0

Qna generation results for answer_correctness
Mean
0.2680050845940908
Median
0.19182968884706497

Qna generation results for answer_similarity
Mean
0.3493435025215149
Median
0.35304318368434906

