# Chapter 11 - RAG and Model Evaluation: Evaluating Amazon Bedrock Knowledge Bases with RAGAS Framework

## Overview
This notebook demonstrates how to build and evaluate a Question & Answer application using Amazon Bedrock Knowledge Bases with the Retrieval Augmented Generation Assessment (RAGAS) framework. We'll use Amazon Bedrock's Retrieve API to perform semantic search and evaluate responses using RAGAS evaluation metrics.

## Introduction
This notebook demonstrates how to build and evaluate a Question & Answer application using Amazon Bedrock Knowledge Bases with the Retrieval Augmented Generation Assessment (RAGAS) framework. We'll use Amazon Bedrock's Retrieve API to perform semantic search against a knowledge base, generate responses with Anthropic Claude, and evaluate those responses using the RAGAS evaluation metrics.

## Prerequisites
- An Amazon Bedrock Knowledge Base created and populated with documents
- Knowledge Base ID available from a previous setup step
- Access to Amazon Bedrock foundation models (Claude 3 Haiku and Claude 3 Sonnet)
- Python 3.10 environment

## Setup

### Install Required Dependencies

In [None]:
%pip install --upgrade pip --quiet
%pip install -r requirements.txt --no-deps --quiet
%pip install -r requirements.txt --upgrade --quiet

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### Initialize AWS Clients and Models

In [None]:
kb_id = "SI12QCDPJO" # Replace with your knowledge base id here.


In [None]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_community.chat_models.bedrock import BedrockChat
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA

pp = pprint.PrettyPrinter(indent=2)
# Configure Bedrock clients with appropriate timeouts
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )
# Initialize Claude 3 Haiku for text generation
llm_for_text_generation = BedrockChat(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=bedrock_client)
# Initialize Claude 3 Sonnet for evaluation (more powerful model)
llm_for_evaluation = BedrockChat(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)
# Initialize Titan embeddings model
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",client=bedrock_client)

### Create Retriever from Knowledge Base

In [None]:
# Initialize the Amazon Knowledge Bases retriever to fetch top 5 results
retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb_id,
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 5}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
    )

## Generate Model Response

### Set Up Retrieval QA Chain and Test with a Sample Query

In [None]:
query = "How did Amazon's operating income change in 2023 compared to 2022?"
# Create a RetrievalQA chain that combines retrieval and LLM generation
qa_chain = RetrievalQA.from_chain_type(
    llm=llm_for_text_generation, retriever=retriever, return_source_documents=True
)
# Generate a response to the query
response = qa_chain.invoke(query)
print(response["result"])

## Prepare Evaluation Dataset

In [None]:
from datasets import Dataset
# Define evaluation questions and ground truth answers
questions = [
    "How did Amazon's operating income change in 2023 compared to 2022?",
    "What were the key factors driving Amazon's revenue growth in 2023?",
    "What is the primary revenue mix for Amazon's AWS segment?",
    "How does Amazon describe its approach to primitives in its business strategy?"
]
ground_truths = [
    "Operating income increased from $12.2 billion in 2022 to $36.9 billion in 2023, which represents a 201% improvement.",
    "Key factors driving revenue growth included increased unit sales primarily by third-party sellers, advertising sales, and subscription services, as well as increased customer usage in AWS.",
    "AWS sales primarily come from global sales of compute, storage, database, and other services, with revenue recognized when customers use these services based on quantity of services rendered.",
    "Amazon describes primitives as discrete, foundational building blocks that builders can weave together. They enable innovation and experimentation at high rates, allowing Amazon to rapidly improve customer experiences."
]
# Generate answers and retrieve contexts for each question
answers = []
contexts = []

for query in questions:
  answers.append(qa_chain.invoke(query)["result"])
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
# Create dictionary and convert to dataset
# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

## Evaluate with RAGAS Framework

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)

from ragas.metrics.critique import (
harmfulness, 
maliciousness, 
coherence, 
correctness, 
conciseness
)

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy
    ]
# Run the evaluation
result = evaluate(
    dataset = dataset, 
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
)
# Convert results to pandas DataFrame
df = result.to_pandas()

## Display Evaluation Results

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 800
df

## Conclusion

The RAGAS framework provides several key metrics for evaluating RAG systems:

- **Faithfulness**: Measures factual consistency between the answer and retrieved context (0-1, higher is better)
- **Answer Relevancy**: Assesses how pertinent the answer is to the given query
- **Context Precision**: Evaluates if relevant context items are ranked higher
- **Context Recall**: Measures how well retrieved context aligns with ground truth
- **Context Entity Recall**: Evaluates if entities from ground truth appear in retrieved context
- **Answer Semantic Similarity**: Measures semantic resemblance between answer and ground truth
- **Answer Correctness**: Gauges accuracy of the answer compared to ground truth
- **Aspect Critique**: Assesses submissions on predefined aspects like harmlessness and correctness

Note: Based on evaluation results, you may need to optimize your RAG workflow by reviewing your chunking strategy, prompt instructions, or adjusting the number of retrieved results.