# Knowledge Base Evaluation using RAGAS
This notebook implements evaluation of Amazon Bedrock Knowledge bases using the RAGAS framework

In [37]:
!pip install ragas
!pip install datasets
!pip install pandas
!pip install boto3
!pip install langchain
!pip install langchain-aws
!pip install nltk

Collecting Jinja2
  Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting MarkupSafe>=2.0 (from Jinja2)
  Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Using cached jinja2-3.1.4-py3-none-any.whl (133 kB)
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Installing collected packages: MarkupSafe, Jinja2
Successfully installed Jinja2-3.1.4 MarkupSafe-3.0.2


In [84]:
# Import required libraries
import pandas as pd
import boto3
from datetime import datetime
from langchain_aws import ChatBedrockConverse, BedrockEmbeddings
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper


In [85]:
# This is a knowledge base created via the AWS console with all defaults and has an S3 datasource that indexes this
# filie https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-overview/aws-overview.pdf
BEDROCK_KNOWLEDGE_BASE_ID = "GXHTSVCWZI"
REGION_NAME = "us-east-1"
DEFAULT_LLM_MODEL_ID = 'anthropic.claude-3-haiku-20240307-v1:0'

# The questions we'll be asking the knowledge base and the answers we expect to get back ("ground truth")
test_data_aws_services = [
    {
        "question": "What is AWS Lambda and how does it work?",
        "ground_truth": "AWS Lambda is a serverless compute service that runs code in response to events without managing servers. It automatically scales and only charges for actual compute time used."
    },
    {
        "question": "What is Amazon S3's durability guarantee?",
        "ground_truth": "Amazon S3 provides 99.999999999% (11 9's) durability for objects stored in all S3 storage classes across multiple Availability Zones."
    },
    {
        "question": "How does AWS Direct Connect differ from VPN?",
        "ground_truth": "AWS Direct Connect provides dedicated physical connections to AWS, while VPN creates encrypted tunnels over the public internet. Direct Connect offers more consistent network performance and lower latency."
    },
    {
        "question": "What is Amazon Aurora and its key benefits?",
        "ground_truth": "Amazon Aurora is a MySQL/PostgreSQL-compatible database offering up to 5x performance of MySQL and 3x of PostgreSQL, with automated scaling, backup, and fault tolerance built-in."
    },
    {
        "question": "How does AWS Shield protect against DDoS attacks?",
        "ground_truth": "AWS Shield provides automatic DDoS protection for all AWS customers at the network/transport layer (Standard) and additional protection with advanced monitoring for higher-level attacks (Advanced)."
    },
    {
        "question": "What is Amazon EKS and its primary use case?",
        "ground_truth": "Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service for running containerized applications at scale, eliminating the need to manage the Kubernetes control plane."
    },
    {
        "question": "How does AWS CloudFormation enable Infrastructure as Code?",
        "ground_truth": "AWS CloudFormation allows you to define infrastructure using templates (JSON/YAML), enabling automated, version-controlled deployment and management of AWS resources."
    },
    {
        "question": "Which AWS service should I used to store my applicative passwords?",
        "ground_truth": "For storing application passwords securely in AWS use AWS Secrets Manager."
    },
    {
        "question": "How do replace a spare tire?",
        "ground_truth": "Park on flat surface, loosen lug nuts, jack up car, remove flat tire, mount spare, tighten lug nuts in star pattern, lower car, verify lug nut tightness."
    },
    {
        "question": "What is Amazon SageMaker's core functionality?",
        "ground_truth": "Amazon SageMaker generates animations of flying shawarmas using serverless technology."
    }
]

In [86]:
''' 
This function return Langchain LLM and Embedding wrapper with Bedrock LLMs and Embeddings.
'''
def get_bedrock_llm_and_embeddings_for_ragas(llm_model = DEFAULT_LLM_MODEL_ID):
    config = {
        "region_name": REGION_NAME,
        "llm": llm_model,
        "embeddings": "amazon.titan-embed-text-v1",
        "temperature": 0.1,
    }

    bedrock_llm = ChatBedrockConverse(
        region_name=config["region_name"],
        model=config["llm"],
        temperature=config["temperature"],
    )

    bedrock_embeddings = BedrockEmbeddings(
        region_name=config["region_name"],
        model_id=config["embeddings"],
    )

    return LangchainLLMWrapper(bedrock_llm), LangchainEmbeddingsWrapper(bedrock_embeddings)

In [87]:
bedrock_runtime = boto3.client(
    service_name = 'bedrock-agent-runtime',
    region_name = REGION_NAME
)

def query_knowledge_base(question :str, model_arn :str, number_of_results :int): 
    try:
        response = bedrock_runtime.retrieve_and_generate(
            input={'text': question},
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': BEDROCK_KNOWLEDGE_BASE_ID,
                    'modelArn': model_arn,
                    'retrievalConfiguration':{
                        'vectorSearchConfiguration': {
                            'numberOfResults': number_of_results
                        }
                    }
                }
            }
        )
        
        return {
            "output": response["output"]["text"],
            "citations": [ref['content']['text'] for citation in response.get('citations', [])
                         for ref in citation.get('retrievedReferences', [])
                         if ref.get('content', {}).get('text')]
        }
    except Exception as e:
        print(f"Error: {str(e)}")
        return None

In [88]:
query_knowledge_base

<function __main__.query_knowledge_base(question: str, model_arn: str, number_of_results: int)>

In [89]:
def generate_answers(test_data :str, model_arn : str = DEFAULT_LLM_MODEL_ID, number_of_results : int = 3):
    print('Generating answers')
    answers = []
    for item in test_data:
        response = query_knowledge_base(
            question = item["question"], 
            model_arn = model_arn, 
            number_of_results = number_of_results)
                    
        if response:
            answers.append({
                "question": item["question"],
                "answer": response["output"],
                "ground_truth": item["ground_truth"],
                "retrieved_contexts": response["citations"]
            })
    return answers


def evaluate_knowledge_base(answers):
    dataset = Dataset.from_pandas(pd.DataFrame(answers))
    
    metrics = [
        SemanticSimilarity(),
        LLMContextRecall(),
        FactualCorrectness(),
        Faithfulness()
    ]

    llm, embeddings = get_bedrock_llm_and_embeddings_for_ragas()
    print('Evaluating answers')
    results = evaluate(
        dataset=dataset,
        metrics=metrics,
        llm=llm,
        embeddings=embeddings
    )

    return results

In [90]:
answers_aws_services = generate_answers(test_data_aws_services) 
answers_aws_services[0]

Generating answers


{'question': 'What is AWS Lambda and how does it work?',
 'answer': 'AWS Lambda is a serverless computing service provided by Amazon Web Services (AWS). It allows you to run code without provisioning or managing servers. With Lambda, you can run code for virtually any type of application or backend service, and you only pay for the compute time you consume - there is no charge when your code is not running. To use AWS Lambda, you simply upload your code, and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically run from other AWS services, or you can call it directly from any web or mobile app.',
 'ground_truth': 'AWS Lambda is a serverless compute service that runs code in response to events without managing servers. It automatically scales and only charges for actual compute time used.',
 'retrieved_contexts': ['Amazon ECS has two modes: Fargate launch type and EC2 launch type. With Fargate launch type, a

## Run evaluation

In [91]:
eval_results = evaluate_knowledge_base(answers_aws_services)

Evaluating answers


Evaluating:  40%|████      | 16/40 [00:08<00:22,  1.07it/s]Exception raised in Job[9]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating:  42%|████▎     | 17/40 [00:09<00:25,  1.11s/it]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
Exception raised in Job[7]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating:  70%|███████   | 28/40 [00:15<00:06,  1.88it/s]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to p

## Analyze evaluation results
### Show overall results

In [93]:
print(eval_results)

# Convert RAGAS results to DataFrame
df_eval_results = eval_results.to_pandas()
# To get statistical summary
display(df_eval_results.describe())

{'semantic_similarity': 0.8131, 'context_recall': 0.8889, 'factual_correctness': 0.5000, 'faithfulness': 0.7500}


Unnamed: 0,semantic_similarity,context_recall,factual_correctness,faithfulness
count,10.0,9.0,6.0,5.0
mean,0.813126,0.888889,0.5,0.75
std,0.28044,0.333333,0.34854,0.433013
min,0.09795,0.0,0.0,0.0
25%,0.908352,1.0,0.2525,0.75
50%,0.920366,1.0,0.585,1.0
75%,0.948943,1.0,0.7675,1.0
max,0.970209,1.0,0.86,1.0


### Show individual results sorted by descending similarity 
The cases with lowest similarity will be interesting to examine

In [95]:
# Display all metrics per row sorted by semantic_similarity score:
display(df_eval_results.sort_values('semantic_similarity', ascending=False))

Unnamed: 0,user_input,retrieved_contexts,response,reference,semantic_similarity,context_recall,factual_correctness,faithfulness
2,How does AWS Direct Connect differ from VPN?,[This solution can be time consuming to build ...,AWS Direct Connect allows you to establish a d...,AWS Direct Connect provides dedicated physical...,0.970209,,0.86,
0,What is AWS Lambda and how does it work?,[Amazon ECS has two modes: Fargate launch type...,AWS Lambda is a serverless computing service p...,AWS Lambda is a serverless compute service tha...,0.965721,1.0,,1.0
4,How does AWS Shield protect against DDoS attacks?,[Security Hub has out-of-the-box integrations ...,AWS Shield provides two tiers of DDoS protecti...,AWS Shield provides automatic DDoS protection ...,0.952671,1.0,0.67,
5,What is Amazon EKS and its primary use case?,[Elastic Kubernetes Service (Amazon EKS) — Ful...,Amazon EKS (Elastic Kubernetes Service) is a f...,Amazon Elastic Kubernetes Service (EKS) is a m...,0.937758,1.0,,1.0
3,What is Amazon Aurora and its key benefits?,[Amazon Aurora is up to five times faster than...,Amazon Aurora is a fully managed database engi...,Amazon Aurora is a MySQL/PostgreSQL-compatible...,0.923487,1.0,,
6,How does AWS CloudFormation enable Infrastruct...,[AWS Chatbot manages the integration between A...,AWS CloudFormation enables Infrastructure as C...,AWS CloudFormation allows you to define infras...,0.917246,1.0,,
7,Which AWS service should I used to store my ap...,"[For general information, see Security, Identi...","Based on the search results, the AWS service y...",For storing application passwords securely in ...,0.913954,1.0,0.8,0.75
1,What is Amazon S3's durability guarantee?,[Amazon Simple Storage Service Amazon Simp...,Amazon S3 is designed for 99.999999999% (11 9s...,Amazon S3 provides 99.999999999% (11 9's) dura...,0.906484,1.0,0.5,
9,What is Amazon SageMaker's core functionality?,[You can increase your productivity by using p...,Amazon SageMaker is a fully managed machine le...,Amazon SageMaker generates animations of flyin...,0.545776,0.0,0.17,1.0
8,How do replace a spare tire?,[],"Sorry, I am unable to assist you with this req...","Park on flat surface, loosen lug nuts, jack up...",0.09795,1.0,0.0,0.0


In [97]:
display(df_eval_results.iloc[9])
print('\n')
print(f'Response: {df_eval_results.iloc[9]["response"]}')
print('\n')
print(f'Reference: {df_eval_results.iloc[9]["reference"]}')

user_input                What is Amazon SageMaker's core functionality?
retrieved_contexts     [You can increase your productivity by using p...
response               Amazon SageMaker is a fully managed machine le...
reference              Amazon SageMaker generates animations of flyin...
semantic_similarity                                             0.545776
context_recall                                                       0.0
factual_correctness                                                 0.17
faithfulness                                                         1.0
Name: 9, dtype: object



Response: Amazon SageMaker is a fully managed machine learning service that provides the ability to build, train, and deploy machine learning models quickly. Its core functionality includes: - Providing purpose-built algorithms and pre-trained ML models to speed up model building and training
- Offering built-in visualization tools to explore prediction outputs on an interactive map
- Enabling collaboration across teams on insights and results
- Automating the process of finding the best machine learning model for a given dataset through SageMaker Autopilot
- Providing a visual point-and-click interface through SageMaker Canvas that allows business analysts to generate accurate ML predictions without coding
- Detecting potential bias in data and models, and explaining model predictions through SageMaker Clarify


Reference: Amazon SageMaker generates animations of flying shawarmas using serverless technology.
