# RAG  evaluation with RAGAS

## Introduction
---

In this notebook, we'll explore ways to evaluate the quality of [Retrieval-Augmented Generation (RAG) application](https://aws.amazon.com/what-is/retrieval-augmented-generation/) using open-source solution at solution development time with [**Ragas**](https://docs.ragas.io/en/stable/).


## Set up

### Update and install prerequisite libraries

In [1]:
%pip install -qU --quiet -r requirements.txt

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.1.1 requires nvidia-ml-py3==7.352.0, which is not installed.
aiobotocore 2.13.2 requires botocore<1.34.132,>=1.34.70, but you have botocore 1.35.48 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.1 which is incompatible.
autogluon-core 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.4.2 which is incompatible.
autogluon-features 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.4.2 which is incompatible.
autogluon-multimodal 1.1.1 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.
autogluon-multimodal 1.1.1 requires scikit-learn<1.4.1,>=1.3.0, but you have scikit-learn 1.4.2 which is incompatible.
autogluon-tabular 1.1.1 requires scikit-learn<1.4.1,>=

### Load evaluation dataset

In [1]:
import pandas as pd
import ast

eval_df = pd.read_csv('./../_eval_data/eval_dataframe.csv')
eval_df['context'] = eval_df.context.apply(lambda s: list(ast.literal_eval(s)))  # convert str to list
print(eval_df.shape)
eval_df.head(2)

(8, 11)


Unnamed: 0,input,actual_output,expected_output,context,retrieval_context,n_chunks_per_context,context_length,evolutions,context_quality,synthetic_input_quality,source_file
0,Rewritten Input: Explain Amazon's core mission...,,Amazon's core mission is to make customers' li...,"[across Amazon. Y et, I think every one of us ...",,1,2361,['Reasoning'],0.8,1.0,./_raw_data/AMZN-2023-Shareholder-Letter.pdf
1,Compare Amazon's approach to empowering builde...,,Amazon's approach to empowering builders and i...,"[across Amazon. Y et, I think every one of us ...",,1,2361,['Comparative'],0.8,0.6,./_raw_data/AMZN-2023-Shareholder-Letter.pdf


### Connect to existing vector database

In [2]:
from langchain_chroma import Chroma
from langchain_aws import BedrockEmbeddings
import boto3


chroma_db_dir = './../_vector_db'
chroma_collection_name = 'amazon-shareholder-letters'
boto_session = boto3.session.Session()
titan_model_id = 'amazon.titan-embed-text-v2:0'
titan_embedding_fn = BedrockEmbeddings(
    model_id=titan_model_id,
    region_name=boto_session.region_name
)

vector_store = Chroma(
    collection_name=chroma_collection_name,
    embedding_function=titan_embedding_fn,
    persist_directory=chroma_db_dir,
)

chroma_retriver = vector_store.as_retriever(
    search_kwargs={'k': 3}
)

## RAGAS
---

As shown in [Ragas' API reference](https://docs.ragas.io/en/stable/references/evaluate/), records in Ragas evaluation datasets typically included:
- The `question/input prompt` that was asked
- The `answer` that LLM generated
- The `actual text contexts` the answer was based on (i.e., retrieval chunks from vector database)
- The `ground truth` answer(s)

Let's quickly build simple RAG flow to gather and build the evaluation dataset.

In [3]:
import langchain_aws
from langchain_aws import ChatBedrock
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llama3_1_70b_model_id = 'meta.llama3-1-70b-instruct-v1:0'
llama3_1_70b_langchain = ChatBedrock(
    model_id=llama3_1_70b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'max_tokens': 2048,
        'temperature': 0.01,
    },
)
system_prompt = ('''
You are an expert, truthful assistant. You will be provided the task by human.
No need to mention on the context, provide your respond directly.

Here is the context: {context}
''')

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

qna_chain = create_stuff_documents_chain(llama3_1_70b_langchain, prompt_template)
rag_chain = create_retrieval_chain(chroma_retriver, qna_chain)

In [4]:
df_to_eval = pd.DataFrame(
    columns=[
        'input', 'expected_answer', 'llm_answer',
        'expected_context', 'llm_context'
    ]
)

for _idx, row in eval_df.iterrows():
    _question = row['input'].split(':')[1].strip() \
        if len(row['input'].split(':')) > 1 else row['input'].strip()
    _expected_ans = row['expected_output'].strip()
    _expected_context = row['context'][0]
    _rag_resp = rag_chain.invoke({'input': _question})
    _llm_resp = _rag_resp.get('answer', '').strip()
    _llm_context = [doc.page_content for doc in _rag_resp.get('context', [])]
    _row_to_concat = {
        'input': _question,
        'expected_answer': _expected_ans,
        'llm_answer': _llm_resp,
        'expected_context': _expected_context,
        'llm_context': _llm_context,
    }
    df_to_eval = pd.concat([
        df_to_eval,
        pd.DataFrame([_row_to_concat])
    ], ignore_index=True)

In [5]:
df_to_eval.head(3)

Unnamed: 0,input,expected_answer,llm_answer,expected_context,llm_context
0,Explain Amazon's core mission and its approach...,Amazon's core mission is to make customers' li...,"Based on the provided text, Amazon's core miss...","across Amazon. Y et, I think every one of us a...",[Amazon.com you can’t choose two out of three”...
1,Compare Amazon's approach to empowering builde...,Amazon's approach to empowering builders and i...,Here's a comparison of Amazon's approach to em...,"across Amazon. Y et, I think every one of us a...",[The best way we know how to do this is by bui...
2,Summarize AWS's noteworthy advancements in chi...,"In 2022, AWS achieved several noteworthy advan...",Here is a summary of AWS's noteworthy advancem...,past year was also a significant delivery yea...,"[customers, AWS continues to deliver new capab..."


Ragas offers a [broad range of metrics](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/) with the option to configure which ones you calculate, however it **depends on evaluator LLM (or evaluator embedding models)** which will be used in scoring.

In this example, we will set up **Claude 3 Sonnet** as the evaluator LLM and **Amazon Titan Text Embeddings V2** as the evaluator embedding model, to demonstrate the full suite of available metrics. However, do feel free to change to your requirements.

Although Ragas defines its own base classes([`BaseRagasLLM`](https://github.com/explodinggradients/ragas/blob/main/src/ragas/llms/base.py#L47), and [`BaseRagasEmbeddings`](https://github.com/explodinggradients/ragas/blob/main/src/ragas/embeddings/base.py#L22)) for these interfaces, integration is typically via **LangChain** for simplicity.

### Batch evaluation 
---
You can pass `Dataset` object and list of metrics to `evaluate` API to conduct batch evaluation.

In [6]:
import ragas
from datasets import Dataset

ragas_result = ragas.evaluation.evaluate(
    Dataset.from_pandas(df_to_eval.sample(n=2)),
    metrics=[
        ragas.metrics.answer_relevancy,
        ragas.metrics.faithfulness,
        ragas.metrics.context_precision,
        ragas.metrics.context_recall,
        ragas.metrics.answer_similarity,
    ],
    llm=ChatBedrock(
        model_id="anthropic.claude-3-sonnet-20240229-v1:0",
        model_kwargs={
            'temperature': 0,
            'max_tokens': 2048,
        },
    ),
    embeddings=titan_embedding_fn,
    column_map={
        'answer': 'llm_answer',
        'contexts': 'llm_context',
        'ground_truths': 'expected_answer',
        'question': 'input',
        'reference': 'expected_context',
    },
)

print("Overall scores")
print("--------------")
print(ragas_result, end="\n\n")
print("Details")
print("-------")
scores_df = ragas_result.to_pandas()
scores_df

Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

Overall scores
--------------
{'answer_relevancy': 0.6464, 'faithfulness': 0.6500, 'context_precision': 1.0000, 'context_recall': 0.6167, 'semantic_similarity': 0.7389}

Details
-------


Unnamed: 0,user_input,retrieved_contexts,response,reference,answer_relevancy,faithfulness,context_precision,context_recall,semantic_similarity
0,Compare the company's approach to investment d...,"[Because of our emphasis on the long term, we ...","Based on the provided text, here is a comparis...",• We will make bold rather than timid investme...,0.337317,1.0,1.0,0.833333,0.639586
1,Explain primitives' role in enabling rapid inn...,"[Of course, this concept of primitives can be ...",Primitives play a crucial role in enabling rap...,document:\n“Primitives are the raw parts or t...,0.955527,0.3,1.0,0.4,0.838155


### Evaluation Sample
---
Alternatively, you can pick one evaluation metric and evaluate one at a time. However, you will either need to pass `BaseRagasLLM` or `BaseRagasEmbeddings` to that evaluation metric.

For this you will need to construct **evaluation sample**, an **evaluation sample** is a single structured data instance used for assessing and measuring the performance for your LLM application. This represents a single unit of interaction or specific use case. In Ragas, evaluation samples can be represented using the `SingleTurnSample` or `MultiTurnSample` classes.

#### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.


#### MultiTurnSample

`MultiTurnSample` represents a multi-turn interaction between Human, AI and, optionally, a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation.


In this example, we will only explore `SingleTurnSample` use case.

In [7]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    model_kwargs={
        'temperature': 0,
        'max_tokens': 2048,
    },
))
evaluator_embeddings = LangchainEmbeddingsWrapper(
    titan_embedding_fn
)


In [8]:
sample_question_df = df_to_eval.sample(n=1)
input_prompt = sample_question_df['input'].values[0]
expected_ans = sample_question_df['expected_answer'].values[0]
llm_ans = sample_question_df['llm_answer'].values[0]
reference = sample_question_df['expected_context'].values[0]
retrieved_chunks = sample_question_df['llm_context'].values[0]

In [None]:
from ragas import SingleTurnSample 
from ragas.metrics import NoiseSensitivity, SemanticSimilarity, FactualCorrectness

sample_single_turn = SingleTurnSample(
    user_input=input_prompt,
    retrieved_contexts=retrieved_chunks,
    reference_contexts=[reference],
    response=llm_ans,
    reference=expected_ans,
)

ss_scorer = SemanticSimilarity(embeddings=evaluator_embeddings)
rel_noise_scorer = NoiseSensitivity(llm=evaluator_llm, focus='relevant')
factual_scorer = FactualCorrectness(llm=evaluator_llm)

print(
    'semantic similarity score: {}'.format(ss_scorer.single_turn_score(sample_single_turn))
)
print(
    'Relevant noise sensitivity score: {}'.format(rel_noise_scorer.single_turn_score(sample_single_turn))
)
print(
    'Factual correctness score: {}'.format(factual_scorer.single_turn_score(sample_single_turn))
)

semantic similarity score: 0.9433280634525408
Relevant noise sensitivity score: 0.2
Factual correctness score: 0.74


## Summary
---

In this notebook, we have seen how to integrate and use **Ragas** open source LLM evaluation toolkits, with easy to use and integration with `langchain` and `llama-index` framework. AI developers and researchers can use it to improve and optimize their RAG application.

Even though Ragas provides easy to use, prebuilt evaluation metrics and bring-your-own evaluation metrics, it's important to remember that the included metrics will be using LLM-evaluated and therefore potentially subject to bias on the selected (evaluator) LLM.
