# RAG  evaluation with RAGAS

## Introduction
---

In this notebook, we'll explore ways to evaluate the quality of [Retrieval-Augmented Generation (RAG) application](https://aws.amazon.com/what-is/retrieval-augmented-generation/) using open-source solution at solution development time with [**Ragas**](https://docs.ragas.io/en/stable/).

You can utilize **Ragas** to generate synthetic dataset, and do RAG evaluation. In this notebook, we will explore on evaluating RAG workflow with **Meta Llama** foundation model.


## Set up

### Update and install prerequisite libraries

In [None]:
%pip install -qU --quiet -r requirements.txt

### Load evaluation dataset

In [1]:
import pandas as pd
import ast

eval_df = pd.read_csv('./../_eval_data/eval_dataframe.csv')
eval_df['source_chunk'] = eval_df.source_chunk.apply(
    lambda s: list(ast.literal_eval(s))
)
print(eval_df.shape)
eval_df.head(2)

(20, 10)


Unnamed: 0,question,compressed_question,ref_answer,source_sentence,source_chunk,source_document,groundedness_rating,groundedness_reason,relevance_rating,relevance_reason
0,What are the names of the chips that were anno...,Which 2nd-gen chipsets are being utilized by A...,Trainium and Inferentia.,announced second versions of our Trainium and ...,[announced second versions of our Trainium and...,{'source': '_raw_data/AMZN-2023-Shareholder-Le...,5.0,The context provides all the necessary informa...,1.0,The question is not relevant to a business ana...
1,What are the company's priorities in terms of ...,What are key spend & cultural priorities for a...,The company's priorities in terms of spending ...,We will work hard to spend wisely and maintain...,"[the present value of future cash flows, we’ll...",{'source': '_raw_data/AMZN-2021-Shareholder-Le...,5.0,The context provides all the necessary informa...,5.0,The question is very relevant to a business an...


### Connect to existing vector database
---

Connect to existing vector store and set the retriever (`k=2` for this example).

In [2]:
from langchain_chroma import Chroma
from langchain_aws import BedrockEmbeddings
import boto3


chroma_db_dir = './../vector_db'
chroma_collection_name = 'amazon-shareholder-letters'
boto_session = boto3.session.Session()
titan_model_id = 'amazon.titan-embed-text-v2:0'
titan_embedding_fn = BedrockEmbeddings(
    model_id=titan_model_id,
    region_name=boto_session.region_name
)

vector_store = Chroma(
    collection_name=chroma_collection_name,
    embedding_function=titan_embedding_fn,
    persist_directory=chroma_db_dir,
)

chroma_retriver = vector_store.as_retriever(
    search_kwargs={'k': 2}
)

## RAGAS
---

As shown in [Ragas' API reference](https://docs.ragas.io/en/stable/references/evaluate/), records in Ragas evaluation datasets typically included:
- The `question/input prompt` that was asked
- The `answer` that LLM generated
- The `actual text contexts` the answer was based on (i.e., retrieval chunks from vector database)
- The `ground truth` answer(s)

Let's quickly build simple RAG flow to gather and build the evaluation dataset. We will use `Llama 3 70B` as candidate (generator) LLM.

In [4]:
import langchain_aws
from langchain_aws import ChatBedrock
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

llama3_70b_model_id = 'meta.llama3-70b-instruct-v1:0'
llama3_70b_langchain = ChatBedrock(
    model_id=llama3_70b_model_id,
    region_name=boto_session.region_name,
    model_kwargs={
        'max_tokens': 2048,
        'temperature': 0.01,
    },
)
system_prompt = ('''
You are an expert, truthful assistant. You will be provided the task by human.
No need to mention on the context, provide your respond directly.

Here is the context: {context}
''')

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

qna_chain = create_stuff_documents_chain(llama3_70b_langchain, prompt_template)
rag_chain = create_retrieval_chain(chroma_retriver, qna_chain)

Let's evaluate one sample data from the evaluation dataset.

In [10]:
df_to_eval = pd.DataFrame(
    columns=[
        'input', 'expected_answer', 'llm_answer',
        'expected_context', 'llm_context'
    ]
)

for _idx, row in eval_df.sample(n=1).iterrows():
    _question = row['question'].strip()
    _expected_ans = row['ref_answer'].strip()
    _expected_context = row['source_chunk'][0]
    _rag_resp = rag_chain.invoke({'input': _question})
    _llm_resp = _rag_resp.get('answer', '').strip()
    _llm_context = [doc.page_content for doc in _rag_resp.get('context', [])]
    _row_to_concat = {
        'input': _question,
        'expected_answer': _expected_ans,
        'llm_answer': _llm_resp,
        'expected_context': _expected_context,
        'llm_context': _llm_context,
    }
    df_to_eval = pd.concat([
        df_to_eval,
        pd.DataFrame([_row_to_concat])
    ], ignore_index=True)

In [11]:
df_to_eval

Unnamed: 0,input,expected_answer,llm_answer,expected_context,llm_context
0,"How many SKUs are housed in the new, same-day ...","The new, same-day fulfillment facilities in th...","According to the text, the new, same-day fulfi...",constrained by the primitives you’ve built and...,[constrained by the primitives you’ve built an...


Ragas offers a [broad range of metrics](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/) with the option to configure which ones you calculate, however it **depends on evaluator LLM (or evaluator embedding models)** which will be used in scoring.

In this example, we will set up **Llama 3.2 11B** as the evaluator LLM and **Amazon Titan Text Embeddings V2** as the evaluator embedding model, to demonstrate the full suite of available metrics. However, do feel free to change to your requirements.

Although Ragas defines its own base classes([`BaseRagasLLM`](https://github.com/explodinggradients/ragas/blob/main/src/ragas/llms/base.py#L47), and [`BaseRagasEmbeddings`](https://github.com/explodinggradients/ragas/blob/main/src/ragas/embeddings/base.py#L22)) for these interfaces, integration is typically via **LangChain** for simplicity.

### Batch evaluation 
---
You can pass `Dataset` object and list of metrics to `evaluate` API to conduct batch evaluation.

In [12]:
from langchain_aws import ChatBedrockConverse

llama3_2_11b_langchain = ChatBedrockConverse(
    model_id='us.meta.llama3-2-11b-instruct-v1:0',
    region_name=boto_session.region_name,
    temperature=0.,
    max_tokens=4000,
)

In [13]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

llama3_2_11b_evaluator = LangchainLLMWrapper(
    langchain_llm=llama3_2_11b_langchain,
)

titan_embedding_evaluator = LangchainEmbeddingsWrapper(
    titan_embedding_fn
)


In [14]:
import time
import ragas
from datasets import Dataset
from ragas.evaluation import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_precision,
    context_recall,
    answer_similarity
)

ragas_result = evaluate(
    Dataset.from_pandas(df_to_eval),
    metrics=[
        answer_relevancy,
        faithfulness,
        context_precision,
        context_recall,
    ],
    llm=llama3_2_11b_evaluator,
    embeddings=titan_embedding_evaluator,
    column_map={
        'answer': 'llm_answer',
        'contexts': 'llm_context',
        'ground_truths': 'expected_answer',
        'question': 'input',
        'reference': 'expected_context',
    },
    raise_exceptions=True,
    run_config=ragas.RunConfig(
        max_workers=1,
        timeout=300,
        max_retries=5,
        max_wait=120,
        seed=123,
        log_tenacity=True,
    ),
)

print("Overall scores")
print("--------------")
print(ragas_result, end="\n\n")
print("Details")
print("-------")
scores_df = ragas_result.to_pandas()
scores_df

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

Overall scores
--------------
{'answer_relevancy': 0.8389, 'faithfulness': 1.0000, 'context_precision': 1.0000, 'context_recall': 1.0000}

Details
-------


Unnamed: 0,user_input,retrieved_contexts,response,reference,answer_relevancy,faithfulness,context_precision,context_recall
0,"How many SKUs are housed in the new, same-day ...",[constrained by the primitives you’ve built an...,"According to the text, the new, same-day fulfi...",constrained by the primitives you’ve built and...,0.838868,1.0,1.0,1.0


### Evaluation Sample
---
Alternatively, you can pick one evaluation metric and evaluate one at a time. However, you will either need to pass `BaseRagasLLM` or `BaseRagasEmbeddings` to that evaluation metric.

For this you will need to construct **evaluation sample**, an **evaluation sample** is a single structured data instance used for assessing and measuring the performance for your LLM application. This represents a single unit of interaction or specific use case. In Ragas, evaluation samples can be represented using the `SingleTurnSample` or `MultiTurnSample` classes.

#### SingleTurnSample

`SingleTurnSample` represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.


#### MultiTurnSample

`MultiTurnSample` represents a multi-turn interaction between Human, AI and, optionally, a Tool and expected results for evaluation. It is suitable for representing conversational agents in more complex interactions for evaluation.


In this example, we will only explore `SingleTurnSample` use case.

In [15]:
input_prompt = df_to_eval['input'].values[0]
expected_ans = df_to_eval['expected_answer'].values[0]
llm_ans = df_to_eval['llm_answer'].values[0]
reference = df_to_eval['expected_context'].values[0]
retrieved_chunks = df_to_eval['llm_context'].values[0]

In [16]:
from ragas import SingleTurnSample 
from ragas.metrics import NoiseSensitivity, SemanticSimilarity, FactualCorrectness

sample_single_turn = SingleTurnSample(
    user_input=input_prompt,
    retrieved_contexts=retrieved_chunks,
    reference_contexts=[reference],
    response=llm_ans,
    reference=expected_ans,
)

ss_scorer = SemanticSimilarity(embeddings=titan_embedding_evaluator)
rel_noise_scorer = NoiseSensitivity(llm=llama3_2_11b_evaluator, focus='relevant')
factual_scorer = FactualCorrectness(llm=llama3_2_11b_evaluator)

print(
    'semantic similarity score: {}'.format(ss_scorer.single_turn_score(sample_single_turn))
)
print(
    'Relevant noise sensitivity score: {}'.format(rel_noise_scorer.single_turn_score(sample_single_turn))
)
print(
    'Factual correctness score: {}'.format(factual_scorer.single_turn_score(sample_single_turn))
)

semantic similarity score: 0.9677737957766728
Relevant noise sensitivity score: 0.0
Factual correctness score: 1.0


## Summary
---

In this notebook, we have seen how to integrate and use **Ragas** open source LLM evaluation toolkits, with easy to use and integration with `langchain` framework. AI developers and researchers can use it to improve and optimize their RAG application.

Even though Ragas provides easy to use, prebuilt evaluation metrics and bring-your-own evaluation metrics, it's important to remember that the included metrics will be using LLM-evaluated and therefore potentially subject to bias on the selected (evaluator) LLM.
