# RAGAS Evaluation Framework

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.Copyright 2024 Amazon Web Services, Inc.

![OVERALL FLOW PROCESS](../images/workflow.png)

 In this notebook, we will evaluate the the sample output from the LLM for RAG use case using RAGAS. 

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.
Ragas aims to create an open standard, providing developers with the tools and techniques to leverage continual learning in their RAG applications. With Ragas, you would be able to

- Synthetically generate a diverse test dataset that you can use to evaluate your app.

- Use LLM-assisted evaluation metrics designed to help you objectively measure the performance of your application.

- Monitor the quality of your apps in production using smaller, cheaper models that can give actionable insights. For example, the number of hallucinations in the generated answer.

- Use these insights to iterate and improve your application.

## Preliminary
1. The ragas-evaluation package is in the "libraries/ragas-evaluation" directory.
2. Run `pip install -e . --quiet` in "libraries/ragas-evaluation"
3. Install LlamaIndex, run `pip install llama-index==0.9.6.post1`


## Load packages and tools

In [1]:
!pip install -e ../libraries/ragas-evaluation --quiet

In [2]:
!pip install llama-index==0.9.6.post1 --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sagemaker 2.237.1 requires importlib-metadata<7.0,>=1.4.0, but you have importlib-metadata 7.0.2 which is incompatible.
sphinx 8.1.3 requires docutils<0.22,>=0.20, but you have docutils 0.16 which is incompatible.[0m[31m
[0m

In [3]:
#!pip install --upgrade pydantic --quiet

In [1]:
import os
import sys
import pandas as pd
from datasets import Dataset
from datasets import load_dataset

In [2]:
sys.path.append('../libraries/ragas-evaluation/src/')
from ragas import evaluate
from ragas.metrics._answer_precision import AnswerPrecision, answer_precision
from ragas.metrics._answer_recall import AnswerRecall, answer_recall
from ragas.metrics._answer_correctness import AnswerCorrectness, answer_correctness
from ragas.metrics._answer_similarity import AnswerSimilarity, answer_similarity
from ragas.metrics._answer_relevance import AnswerRelevancy, answer_relevancy
from ragas.metrics._context_precision import (
    ContextPrecision,
    context_precision,
)
from ragas.metrics._context_recall import ContextRecall, context_recall
from ragas.metrics._aspect_critic import AspectCritic, SUPPORTED_ASPECTS
from ragas.metrics._faithfulness import Faithfulness, faithfulness

Users can ignore the installation error messages for the packages version if there's any.

## Load and pre-process the samples for evaluation

#### Load the samples
We are using a sample data file stored in '../data/inputs/'. If you wish to bring in your own data, please store the data in .csv file under '../../outputs/rag/rag_outputs/' directory and ensure the schema is following the format that includes the exact columns in the csv file, ie. question, ground_truth, llm_answer, llm_contexts....

There are RAG output results with different models: output_openai.csv/output_claude.csv/output_mistral.csv ready to be evaluated, the output files from RAG are stored under ../../outputs/rag/rag_outputs
To run through this notebook, please input the desired model name that needs to be evaluated in this notebook:

In [15]:
# src model
# model_output = "mistral"
# model_output="openai"
model_output= "claude"

result_csv_file = (
    f"../outputs/rag_outputs/output_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(result_csv_file)

../outputs/rag_outputs/output_claude.csv


In [16]:
# result_csv_file ='../data/inputs/output_claude.csv'

result_df = pd.read_csv(result_csv_file, index_col=0)

result_df.head()

Unnamed: 0_level_0,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens
doc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,'$1577.00',Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,['<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401
3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"'No, the company is managing its CAPEX and Fix...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,['<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635
ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,'No the operating margins of Adobe have recent...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.774023,3.742355,1.554743,4.361795,13324,1038
ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"'Yes, the FCF conversion (using net income as ...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,['<<Paragraph>> [Source File: ADOBE_2022_10K] ...,0.891798,0.576164,1.379593,3.781731,12957,526
AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,'4.2%',ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,['<<Paragraph>> [Source File: AMD_2015_10K] \n...,0.821045,0.420084,1.686061,2.92591,24266,494


## Pre-process the samples: This step depends on the format of the input samples 

In [17]:
# Ensure the "contexts" field is a list
result_df['llm_contexts']=result_df['llm_contexts'].apply(lambda x: eval(x))

# Ensure the "ground_truths" is field name for ground truth answer
result_df.rename(columns={"answer":"ground_truths"}, inplace=True)

# Ensure the "llm_answer" field has no None type
result_df["llm_answer"]=result_df['llm_answer'].fillna(value="Unfortunately, I cannot answer this question")


#print(selected_rows)
result_df.head()

Unnamed: 0_level_0,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens
doc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
3M_2018_10K,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,'$1577.00',Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401
3M_2022_10K,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"'No, the company is managing its CAPEX and Fix...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635
ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,'No the operating margins of Adobe have recent...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.774023,3.742355,1.554743,4.361795,13324,1038
ADOBE_2022_10K,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"'Yes, the FCF conversion (using net income as ...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.891798,0.576164,1.379593,3.781731,12957,526
AMD_2015_10K,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,'4.2%',ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,0.821045,0.420084,1.686061,2.92591,24266,494


Pre-process the samples: This step depends on the format of the input samples 

In [18]:
result_ds = Dataset.from_pandas(result_df)

## Run ragas evaluation

RAGAS Evaluation Scores
RAGAS is a framework that helps to evaluate Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM context. It provides a variety of LLM-powered automated evaluation metrics. The below metrics are introduced evaluating RAG use case.

- Answer Precision: Measures how accurately the model generated answer contain relevant and correct claims compared to the ground truth answer.
- Answer Recall: Evaluates the completeness of the answer, i.e., model's ability to retrieve all correct claims comparing to the ground truth answer. High recall indicates that the answer thoroughly covers the necessary details in line with the ground truth.
- Answer Correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
- Answer Similarity: The assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.




Set Bedrock Configurations. Ignore if you are not using Bedrock

---

In [19]:
import boto3
from botocore.client import Config
from langchain_aws import ChatBedrock
from langchain_community.embeddings import BedrockEmbeddings

In [20]:
# Set Eval Model IDS
eval_modelId = "anthropic.claude-3-5-sonnet-20241022-v2:0"
eval_embedId = "amazon.titan-embed-text-v2:0"

# Initiate LLMs
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 2})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_model = ChatBedrock(model_id=eval_modelId, client=bedrock_client, model_kwargs={
        "max_tokens": 5000
    })

# init the embeddings
bedrock_embeddings = BedrockEmbeddings(model_id=eval_embedId)

  bedrock_embeddings = BedrockEmbeddings(model_id=eval_embedId)


---

In [30]:
%%time
# NOTE: Comment out any metrics you don't want to use
metrics = [
    answer_precision,
    answer_recall,
    answer_correctness,
    answer_similarity,
    answer_relevancy,
    faithfulness,
    # context_precision, # currently this metric might trigger an error
    context_recall, 
]

column_map = {
        "question": "question",
        "contexts": "llm_contexts",
        "answer": "llm_answer",
        "ground_truth": "ground_truths",
    }


# Evaluate
eval_result = evaluate(result_ds, metrics=metrics, llm=bedrock_model, 
                    embeddings=bedrock_embeddings, column_map=column_map)

Evaluating:   0%|          | 0/70 [00:00<?, ?it/s]

Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
Exception raised in Job[8]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to pa

CPU times: user 2.58 s, sys: 107 ms, total: 2.69 s
Wall time: 1min 19s


## Save the evaluation results with the sample input

In [31]:
print(eval_result)

{'answer_precision': 0.3220, 'answer_recall': 0.5370, 'answer_correctness': 0.6223, 'semantic_similarity': 0.3737, 'answer_relevancy': 0.8598, 'faithfulness': 0.9559, 'context_recall': 0.7000}


In [32]:
# Add the fields from the input dataframe to the evaluation result dataframe
eval_result_df = eval_result.to_pandas()
eval_result_df.head()

Unnamed: 0,user_input,retrieved_contexts,response,reference,answer_precision,answer_recall,answer_correctness,semantic_similarity,answer_relevancy,faithfulness,context_recall
0,What is the FY2018 capital expenditure amount ...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,According to the cash flow statement in the 3M...,'$1577.00',0.0,1.0,0.586305,0.09522,0.819077,1.0,1.0
1,Is 3M a capital-intensive business based on FY...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,Based on the financial information provided in...,"'No, the company is managing its CAPEX and Fix...",0.785714,,0.666366,0.332131,0.845723,0.857143,0.5
2,Does Adobe have an improving operating margin ...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,Based on the financial information provided in...,'No the operating margins of Adobe have recent...,,0.333333,0.657629,0.784362,0.810163,,0.0
3,Does Adobe have an improving Free cashflow con...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,Based on the financial information provided in...,"'Yes, the FCF conversion (using net income as ...",0.2,0.0,0.161393,0.64557,0.731836,1.0,0.0
4,Answer the following question as if you are an...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,According to the details in the Profit and Los...,'4.2%',0.0,0.0,0.798554,0.194215,0.858818,1.0,0.5


#### Add the colomns of input data to the evaluation result

In [33]:
metrics_keys = ['answer_precision','answer_recall','answer_correctness','semantic_similarity']

# Reset the index if necessary
result_df = result_df.reset_index(drop=True)
eval_result_df = eval_result_df.reset_index(drop=True)

# Now perform the merge
eval_result_df_new = result_df.merge(eval_result_df[metrics_keys], 
                                     how='left', left_index=True, right_index=True)

eval_result_df_new.head()


Unnamed: 0,doc_link,doc_period,question_type,question,ground_truths,evidence_text,page_number,llm_answer,llm_contexts,latency_meta_time,latency_meta_kwd,latency_meta_comb,latency_meta_ans_gen,input_tokens,output_tokens,answer_precision,answer_recall,answer_correctness,semantic_similarity
0,https://investors.3m.com/financials/sec-filing...,2018,metrics-generated,What is the FY2018 capital expenditure amount ...,'$1577.00',Table of Contents \n3M Company and Subsidiarie...,60,According to the cash flow statement in the 3M...,[<<Paragraph>> [Source File: 3M_2018_10K] \n ...,0.927055,0.606658,1.448762,2.483714,21147,401,0.0,1.0,0.586305,0.09522
1,https://investors.3m.com/financials/sec-filing...,2022,domain-relevant,Is 3M a capital-intensive business based on FY...,"'No, the company is managing its CAPEX and Fix...",3M Company and Subsidiaries\n Consolidated Sta...,485052,Based on the financial information provided in...,[<<Paragraph>> [Source File: 3M_2022_10K] \n ...,0.714294,0.503733,1.938756,5.232772,23180,635,0.785714,,0.666366,0.332131
2,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,domain-relevant,Does Adobe have an improving operating margin ...,'No the operating margins of Adobe have recent...,ADOBE INC.\nCONSOLIDATED STATEMENTS OF INCOME\...,54,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.774023,3.742355,1.554743,4.361795,13324,1038,,0.333333,0.657629,0.784362
3,https://www.adobe.com/pdf-page.html?pdfTarget=...,2022,novel-generated,Does Adobe have an improving Free cashflow con...,"'Yes, the FCF conversion (using net income as ...",ADOBE INC.\n CONSOLIDATED STATEMENTS OF CASH F...,57,Based on the financial information provided in...,[<<Paragraph>> [Source File: ADOBE_2022_10K] \...,0.891798,0.576164,1.379593,3.781731,12957,526,0.2,0.0,0.161393,0.64557
4,https://ir.amd.com/sec-filings/filter/annual-f...,2015,metrics-generated,Answer the following question as if you are an...,'4.2%',ITEM 8.\nFINANCIAL STATEMENTS AND SUPPLEMENTAR...,5660,According to the details in the Profit and Los...,[<<Paragraph>> [Source File: AMD_2015_10K] \n ...,0.821045,0.420084,1.686061,2.92591,24266,494,0.0,0.0,0.798554,0.194215


#### Save the full report to the ragas evaluation in .csv format under '../data/results/data/results/evaluation_report/ragas/' 

In [34]:
eval_csv_dir = (
    f"../outputs/evaluation_reports/ragas/ragas_eval_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(eval_csv_dir)

../outputs/evaluation_reports/ragas/ragas_eval_claude.csv


In [35]:
eval_result_csv_file = eval_result_df_new.to_csv(f"{eval_csv_dir}", index=False)
print(f"Save evaluation results to: {eval_csv_dir}'")

Save evaluation results to: ../outputs/evaluation_reports/ragas/ragas_eval_claude.csv'


#### Get Mean and Median Scores for Metrics

In [36]:
def get_scores(results, metric): 
    print(f"Qna generation results for {metric}")
    print("Mean")
    print(results[metric].mean())
    print("Median")
    print(results[metric].median())
    print()

In [38]:
ragas_eval_metric_names = ['answer_precision','answer_recall','answer_correctness','semantic_similarity']
for m in ragas_eval_metric_names:
    get_scores(eval_result_df_new, m)

Qna generation results for answer_precision
Mean
0.32204585537918873
Median
0.2

Qna generation results for answer_recall
Mean
0.537037037037037
Median
0.5

Qna generation results for answer_correctness
Mean
0.622294777994382
Median
0.6619975983575948

Qna generation results for semantic_similarity
Mean
0.37368137442096694
Median
0.266861604432434

