# Improve RAG evaluation with Ragas and LlamaIndex

### Pre-requisites section

Import the necessary libraries:


In [None]:
import warnings
warnings.filterwarnings('ignore')

# Install necessary packages
%pip install --force-reinstall -q -r ./requirements.txt

Lets restart the kernel to make sure python packages are installed and imported correctly.

In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Overview

In this notebook, we will use the Sagemaker FAQ, consisting of 170 questions and answers, to build and evaluate a RAG (Retrieval-Augmented Generation) application using Ragas and LlamaIndex.

We'll use Langchain, a Python framework for developing applications powered by language models, to create our RAG application. Langchain simplifies the creation and deployment of RAG applications.

We use Amazon Bedrock wih Claude 3 sonnet as the underlying language model for our RAG application.

In [None]:
import logging
import boto3
import time
import json
import uuid
import pprint
import os

# getting boto3 clients for required AWS services
sts_client = boto3.client('sts')
iam_client = boto3.client('iam')
s3_client = boto3.client('s3')
lambda_client = boto3.client('lambda')
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime')
bedrock_client = boto3.client('bedrock-runtime')

session = boto3.session.Session()
region = session.region_name
account_id = sts_client.get_caller_identity()["Account"]
region, account_id

Setup the source web url, model Ids 
Before you start, setup a Knowledge base manually and put the ID in the utils


In [None]:
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_aws import ChatBedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
import nest_asyncio
nest_asyncio.apply()

#URL to fetch the document
SAGEMAKER_URL="https://aws.amazon.com/sagemaker/faqs/"

#Bedrock parameters
EMBEDDING_MODEL="amazon.titan-embed-text-v2:0"
BEDROCK_MODEL_ID="anthropic.claude-3-sonnet-20240229-v1:0"


bedrock_embeddings = BedrockEmbeddings(model_id=EMBEDDING_MODEL,client=bedrock_client)

model_kwargs = {"temperature": 0, "top_k": 250, "top_p": 1,"stop_sequences": ["\n\nHuman:"]}

llm_bedrock = ChatBedrock(model_id=BEDROCK_MODEL_ID,model_kwargs=model_kwargs)
    

## Split FAQ document and chuncks


1. **Website Scraping and Data Loading**: Load the FAQ data using the WebBaseLoader class from Langchain to parse the FAQ website and load it into a Langchain documents object.
2. **Document Splitting**: Split the document into chunks of 2000 words with an overlap of 200 words.


In [None]:
from utils import split_document_from_url, get_bedrock_retriever

from botocore.exceptions import ClientError


text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000,  chunk_overlap=100)
retriever_db= get_bedrock_retriever(text_chunks, region)

Let's create a retrival chain using Lnagchain and Amazon Bedrock.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain


system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise and short. "
    "Context: {context}"
    )

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
    ]
)
question_answer_chain = create_stuff_documents_chain(llm_bedrock, prompt_template)

chain = create_retrieval_chain(retriever_db, question_answer_chain)


In [None]:
query = "What is Amazon SageMaker?"
result=chain.invoke({"input": query})['answer']
print(result)

## Evaluate RAG with Ragas


In this example we use the Ragas library to evaluate the RAG application with 3 metrics:


1. **Faithfulness**: This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. This is useful for measuring if the response was hallucinated.

2. **Answer relevancy**: The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. This is useful for measuring if the query was actually answered by the response.

3. **Answer correctness**: This metric measures the correctness of the generated answer. It is calculated from the groundtruth answer and the question.


Note : Ragas offer a wide range of metrics to evaluate RAG applications. For more information, please refer to the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/index.html#)

In [None]:
EVAL_QUESTIONS = [
    "Can I stop a SageMaker Autopilot job manually?",
    "Do I get charged separately for each notebook created and run in SageMaker Studio?",
    "Do I get charged for creating and setting up an SageMaker Studio domain?",
    "Will my data be used or shared to update the base model that is offered to customers using SageMaker JumpStart?",
]

#Defining the ground truth answers for each question
EVAL_ANSWERS = [
    "Yes. You can stop a job at any time. When a SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.",
    """No. You can create and run multiple notebooks on the same compute instance. 
    You pay only for the compute that you use, not for individual items. 
    You can read more about this in our metering guide.
    In addition to the notebooks, you can also start and run terminals and interactive shells in SageMaker Studio, all on the same compute instance.""",
    "No, you don’t get charged for creating or configuring an SageMaker Studio domain, including adding, updating, and deleting user profiles.",
    "No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers."
]


Let's use the batch invokation from Langchain to get answers for each question inside the `EVAL_QUESTIONS` list.

Once we have the answer, RAGAS expect a dataset in a Hugging face format. Let's create the dataset before evaluating the RAG application.

In [None]:
from utils import build_dataset
from ragas.metrics import answer_relevancy, faithfulness, answer_correctness
from ragas import evaluate

#Batch invoke and dataset creation
result_batch_questions = chain.batch([{"input": q} for q in EVAL_QUESTIONS])

dataset= build_dataset(EVAL_QUESTIONS,EVAL_ANSWERS,result_batch_questions, text_chunks)

result = evaluate(dataset=dataset, metrics=[answer_relevancy, faithfulness, answer_correctness],llm=llm_bedrock, embeddings=bedrock_embeddings, raise_exceptions=False )
df = result.to_pandas()
df.head()

# Evaluate RAG with LLamaIndex


LlamaIndex is a data framework for LLM-based applications to ingest, structure, and access private or domain-specific data. It’s available in Python (these docs) and Typescript.






In [None]:

from llama_index.llms.bedrock import Bedrock
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    CorrectnessEvaluator,
    FaithfulnessEvaluator
)

from utils import evaluate_llama_index_metric



bedrock_llm_llama = Bedrock(model=BEDROCK_MODEL_ID)
faithfulness= FaithfulnessEvaluator(llm=bedrock_llm_llama)
answer_relevancy= AnswerRelevancyEvaluator(llm=bedrock_llm_llama)
correctness= CorrectnessEvaluator(llm=bedrock_llm_llama)

In [None]:

df_faithfulness= evaluate_llama_index_metric(faithfulness, dataset)
df_faithfulness.head()

In [None]:

df_answer_relevancy= evaluate_llama_index_metric(answer_relevancy, dataset)
df_answer_relevancy.head()


In [None]:

df_correctness= evaluate_llama_index_metric(correctness, dataset)
df_correctness.head()