## Evaluating Q&A applications using Amazon Bedrock Knowledge Base - Retrieve API, Langchain, and LLaMa Index for Prompt Completion Evaluations - LLaMa-2-70b for role play + Titan for text embeddings - Retrieve Vs. RetreiveAndGenerate API eval

In [5]:
#install knowledge base sdk
%pip install --upgrade pip
%pip install boto3 --force-reinstall
%pip install botocore --force-reinstall
%pip install botocore --force-reinstall
%pip install langchain --force-reinstall --quiet

[0mNote: you may need to restart the kernel to use updated packages.
[0mCollecting boto3
  Using cached boto3-1.33.5-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.34.0,>=1.33.5 (from boto3)
  Using cached botocore-1.33.5-py3-none-any.whl.metadata (6.1 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.9.0,>=0.8.2 (from boto3)
  Using cached s3transfer-0.8.2-py3-none-any.whl.metadata (1.8 kB)
Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.34.0,>=1.33.5->boto3)
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting urllib3<2.1,>=1.25.4 (from botocore<1.34.0,>=1.33.5->boto3)
  Using cached urllib3-2.0.7-py3-none-any.whl.metadata (6.6 kB)
Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore<1.34.0,>=1.33.5->boto3)
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Using cached boto3-1.33.5-py3-none-any.whl (139 kB)
Using cached botocore-1.33.5-py3-none-

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
import nest_asyncio
nest_asyncio.apply()

### Follow the steps below to initiate hte bedrock client:

1. Import the necessary libraries, along with langchain for bedrock model selection, llama index to store the service context containing the llm and embedding model instances.

2. Use langchain to import bedrock embeddings and llama index for langchain embeddings

3. Configure the bedrock-runtime and the bedrock-agent-runtime to be able to initiate execution with the knowledge base associated to your account toe perform RAG and model evaluation using llama index.

4. Use the amazon.titan-embed-text-v1 as our embeddings model for chunk enbeddings during the RAG performance on user queries.

5. Initialize 'anthropic.claude-v2' as our large language model to perform query completions on using the RAG with the given knowledge base, once we get all vector searches through the retrieve API.

In [43]:


import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from llama_index import (
    ServiceContext,
    set_global_service_context
)
from langchain.embeddings.bedrock import BedrockEmbeddings
from llama_index.embeddings import LangchainEmbedding

pp = pprint.PrettyPrinter(indent=2)



bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              # endpoint_url=endpoint_url,
                              region_name='us-east-1',
                              config=bedrock_config)
                              # aws_access_key_id=ACCESS_KEY,
                              # aws_secret_access_key=SECRET_KEY)

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

llm = Bedrock(model_id="anthropic.claude-v2",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model)
set_global_service_context(service_context)

### Retrieve API: Process flow  - first evaluating retrieve API chunks 

Define a retrieve function that initializes the agent client to get a query that the user inputs, along with the knowledge base id they configured on the bedrock console, and the number of results they need to retrieve while performing RAG on the Knowledge base 

In [44]:
def retrieve(query, kbId, numberOfResults=3):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults
            }
        }
    )

#### Initialize your Knowledge base id before querying responses from the initialized LLM

In [45]:
# Ask in-context question. 
kb_id =  'CJAOGIRY8U'# replace it with the Knowledge base id which you created in the first half of the workshop.

#### Have the user query inputed along with the knowledge base id, and '5' that represents the number of responses that the user wants to retrieve using the retrieve API on the knowledge base that is configured.


In [46]:
query = "Amazon Shield"
response = retrieve(query, kb_id, 3)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

[ { 'content': { 'text': 'There is no software to deploy, or data sources to '
                         'enable and maintain.   Amazon GuardDuty Amazon '
                         'GuardDuty is a threat detection service that '
                         'continuously monitors for malicious or unauthorized '
                         'behavior to help you protect your AWS accounts and '
                         'workloads. It monitors for activity such as unusual '
                         'API calls or potentially unauthorized deployments '
                         'that indicate a possible account compromise. '
                         'GuardDuty also detects potentially compromised '
                         'instances or reconnaissance by attackers.   Enabled '
                         'with a few clicks in the AWS Management Console, '
                         'Amazon GuardDuty can immediately begin analyzing '
                         'billions of events across your AWS accounts for 

### You can view the scores above to double down on which chunk retrieve is the most relevant to the information that you are trying to retreive on your given query!

### Prompt Engineering Phase: Engineer LLaMa-2-70b to personalize responses 

In [48]:
from langchain.prompts import PromptTemplate

PROMPT_TEMPLATE = """
Human: You are an advanced AI system specialized in Amazon Web Services (AWS), capable of providing detailed and accurate information about various AWS services. 
Use the available resources and knowledge to answer the question enclosed in <question> tags. 
If the answer to a question is not within your current scope of knowledge, please indicate that you don't know, and do not attempt to speculate or fabricate a response.
<context>
{context_str}
</context>

<question>
{query_str}
</question>

Your response should be precise, detailed, and include any relevant AWS-specific terminology, features, or concepts. Utilize your extensive knowledge base about AWS services to provide the most accurate and current information available.

Assistant:"""
claude_prompt = PromptTemplate(template=PROMPT_TEMPLATE, 
                               input_variables=["context_str","query_str"])

### Engineer the Model to pick up the most relevant and accurate information 


In [49]:
# fetch context from the response
def get_contexts(retrievalResults):
    contexts = []
    for retrievedResult in reversed(retrievalResults): 
        contexts.append(retrievedResult['content']['text'])
    return contexts

#### View all of the relevant chunks re ordered below from most to least relevant based on the embedding scores generated

In [50]:
contexts = get_contexts(retrievalResults)
pp.pprint(contexts)

[ 'AWS Shield provides you with always-on detection and automatic inline '
  'mitigations that minimize application downtime and latency, so there is no '
  'need to engage AWS Support to benefit from DDoS protection. There are two '
  'tiers of AWS Shield: Standard and Advanced.   All AWS customers benefit '
  'from the automatic protections of AWS Shield Standard, at no additional '
  'charge. AWS Shield Standard defends against most common, frequently '
  'occurring network and transport layer DDoS attacks that target your website '
  'or applications. When you use AWS Shield Standard with Amazon CloudFront '
  'and Amazon Route 53 , you receive comprehensive availability protection '
  'against all known infrastructure (Layer 3 and 4) attacks.   For higher '
  'levels of protection against attacks targeting your applications running on '
  'Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing (ELB), '
  'Amazon CloudFront, and Amazon Route 53 resources, you can subscri

In [52]:
import json
prompt = claude_prompt.format(context_str=contexts, 
                                 query_str=query)

In [53]:
response = llm(prompt)
pp.pprint(response)

(' Based on the context provided, here are the key points about Amazon '
 'Shield:\n'
 '\n'
 '- AWS Shield comes in two tiers - Standard and Advanced. \n'
 '\n'
 '- AWS Shield Standard provides automatic protection against common, '
 'frequently occurring network and transport layer DDoS attacks for AWS '
 'resources like CloudFront, Route 53 and load balancers at no additional '
 'charge.\n'
 '\n'
 '- AWS Shield Advanced provides additional protection against larger and more '
 'sophisticated attacks on Amazon EC2, ELB, CloudFront and Route 53. It also '
 'provides 24/7 access to AWS DDoS Response Team (DRT) and protection against '
 'spikes in charges due to DDoS attacks.\n'
 '\n'
 '- AWS Shield Advanced is available globally on all CloudFront and Route 53 '
 'edge locations. You can protect web applications hosted anywhere by '
 'deploying CloudFront in front.\n'
 '\n'
 '- AWS Shield Advanced can be enabled directly on Elastic IP or ELB in '
 'certain regions like Virginia, Ohio, Or

## Evaluation Pipeline: Utilizing LLaMaIndex for end-end evaluations on Faithfulness, Correctness, Guidelines given, and Relevancy of answers generated by the model.

- Faithfulness - to measure if the response from the model matches any source nodes. This is useful for measuring if the response was hallucinated.
- Relevancy - to measure if the response + source nodes match the query.This is useful for measuring if the query was actually answered by the response.
- Correctness - to evaluate the relevance and correctness of a generated answer against a reference answer.
- Guidelines - to evaluate a question answer system given user specified guidelines for example, if the response generated is complete, not toxic, or biased or uses facts in the context.

### 1. Faithfulness Evaluation of Prompt Completions: Using LLaMa Index


In [54]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
faith_eval = faithfulness_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f"Faithful response?: {str(faith_eval.passing)}"  )
pp.pprint(f"Reason: {faith_eval.feedback} ")

Faithful response?: True
('Reason:  YES\n'
 '\n'
 'The context provides details that support all the key points listed about '
 'Amazon Shield in the information. ')


### Above, we can see how the response from the retrieveAPI is faithful when we ask about amazon shield from the kB

### 2. Relevancy Evaluation of Prompt Completions: Using LLaMa Index

In [55]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
relevant_eval = relevancy_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f'Relevant response?: {str(relevant_eval.passing)}')
pp.pprint(f"Reason: {relevant_eval.feedback} ")

Relevant response?: True
('Reason:  Yes, the response is in line with the context information provided. '
 'The key points summarized in the response align with the details provided in '
 "the context about AWS Shield's tiers, protections, integration, "
 'availability, and pricing model. The response covers the main aspects of AWS '
 'Shield mentioned in the context in a concise and accurate manner. ')


### Here, we can see whatever we get from the KB is relevant and matches the query ingested by the Kb

## Now, let's test the same for a bunch of questions below for correction checks

In [56]:
eval_question_answer_pair = [
    ("By what percentage did AWS revenue grow year-over-year in 2022?",
     "AWS had a 29% year-over-year ('YoY') revenue growth in 2022 on a $62B revenue base."),

    ("Approximately how many new features and services did AWS launch in 2022?",
     "AWS launched over 3,300 new features and services in 2022."),

    ("Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver?",
     "In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors."),

    ("Which was the first inference chip launched by AWS?",
     "AWS launched their first inference chips ('Inferentia') in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense."),

    ("What kind of throughput and latency improvements does the new Inferentia2 chip offer compared to the original Inferentia chip?",
     "Inferentia2 chip, launched by AWS, offers up to four times higher throughput and ten times lower latency than the first Inferentia processor."),

    ("What are some of the key benefits of AWS's Inferentia and Inferentia2 chips?",
     "AWS's Inferentia and Inferentia2 chips are known for their high throughput and low latency, significantly reducing capital expenses for companies using them."),

    ("How has the introduction of Graviton3 chips impacted AWS's computing capabilities?",
     "The introduction of Graviton3 chips has significantly enhanced AWS's computing capabilities, offering a 25% performance improvement over the previous generation Graviton2 processors."),

    ("Can you describe the growth of AWS in terms of new service launches in 2022?",
     "AWS saw considerable growth in 2022, marked by the launch of over 3,300 new features and services.")
]


### 3. Correctness Evaluation of Prompt Completions: Using LLaMa Index


In [57]:
from typing import Tuple, List
import pandas as pd
from llama_index.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
)

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
correctness_evaluator = CorrectnessEvaluator(service_context=service_context)

def run_evals(qa_pairs: List[Tuple[str, str]], topK):
    results_list = []
    for question, reference_answer in qa_pairs:
        # retrieve matching documents
        result = retrieve(question, kb_id, topK)
        retrievalResults = result['retrievalResults']
        contexts = get_contexts(retrievalResults=retrievalResults)
        prompt = claude_prompt.format(context_str=contexts, 
                                 query_str=question)
        response = llm(prompt)
        generated_answer = str(response)
        faithfulness_results = faithfulness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        relevancy_results = relevancy_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        cur_result_dict = {
            "query": question,
            "generated_answer": generated_answer,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score
        }
        results_list.append(cur_result_dict)
    evals_df = pd.DataFrame(results_list)
    return evals_df

In [58]:
evaluation_results = run_evals(eval_question_answer_pair, 3)

### Visualize evaluations 


In [59]:
evaluation_results

Unnamed: 0,query,generated_answer,faithfulness,faithfulness_feedback,faithfulness_score,relevancy,relevancy_feedback,relevancy_score
0,By what percentage did AWS revenue grow year-o...,Unfortunately I do not have enough informatio...,False,NO\n\nThe context does not provide any inform...,0.0,False,"No, the response is not in line with the cont...",0.0
1,Approximately how many new features and servic...,"Based on the context provided, I do not have ...",False,NO,0.0,True,"Based on the context provided, I do not have ...",1.0
2,"Compared to Graviton2 processors, what perform...",Unfortunately I do not have enough context to...,False,NO\n\nThe provided context does not contain a...,0.0,False,NO\n\nThe response indicates that there is no...,0.0
3,Which was the first inference chip launched by...,"Based on the provided context, the first infe...",True,"YES\n\nThe context mentions that ""AWS Inferen...",1.0,True,YES\n\nThe response indicates that AWS Infere...,1.0
4,What kind of throughput and latency improvemen...,"Based on the information provided, I do not h...",False,NO\n\nThe context does not provide enough det...,0.0,False,NO\n\nThe response states that there is insuf...,0.0
5,What are some of the key benefits of AWS's Inf...,"Based on the provided context, some key benef...",True,YES\n\nThe context mentions several key benef...,1.0,True,"Yes, the response is in line with the context...",1.0
6,How has the introduction of Graviton3 chips im...,Unfortunately I do not have enough context to...,False,NO\n\nThe provided context does not mention a...,0.0,False,NO\n\nThe response clearly states that there ...,0.0
7,Can you describe the growth of AWS in terms of...,Unfortunately I do not have enough current in...,False,NO\n\nThe context does not contain any inform...,0.0,True,Yes\n\nThe response indicates that the assist...,1.0


### Overall score for faithfulness and relevancy


In [60]:
print(f'Faithfulness score: {evaluation_results.faithfulness.mean()} \nRelevancey score: {evaluation_results.relevancy.mean()}')


Faithfulness score: 0.25 
Relevancey score: 0.5


## Now, LLaMa Index Evaluations with Claude-Instant Prompt Completions

In [83]:
## Setting the embeddings model 

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

## Setting the claude instant model as our llm

instant_llm = Bedrock(model_id="anthropic.claude-instant-v1",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context_new = ServiceContext.from_defaults(llm=instant_llm,
                                               embed_model=embed_model)
set_global_service_context(service_context_new)

In [84]:
response = instant_llm(prompt)
pp.pprint(response)

(' Amazon Shield is a managed Distributed Denial of Service (DDoS) protection '
 'service that safeguards web applications running on AWS. It provides '
 'always-on detection and automatic inline mitigations against volumetric and '
 'complex DDoS attacks. There are two tiers of Amazon Shield:\n'
 '\n'
 '- Amazon Shield Standard provides baseline protection against common network '
 'and transport layer (Layer 3 and 4) DDoS attacks. It is available at no '
 'additional charge to all AWS customers. \n'
 '\n'
 '- Amazon Shield Advanced provides additional protections including detection '
 'and mitigation of larger and more complex DDoS attacks, near real-time '
 'visibility into attacks, and integration with AWS Web Application Firewall '
 '(WAF). It also provides 24x7 access to the AWS DDoS Response Team (DRT) for '
 'response to severe attacks. \n'
 '\n'
 'Amazon Shield can protect applications running on Amazon CloudFront, Amazon '
 'Route 53, Elastic Load Balancing (ELB), and Amazon

### Faithfulness Evaluation - Claude Instant responses on KB

In [85]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_new)
faith_eval = faithfulness_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f"Faithful response?: {str(faith_eval.passing)}"  )
pp.pprint(f"Reason: {faith_eval.feedback} ")

Faithful response?: False
('Reason:  NO\n'
 '\n'
 'The context does not support the information that Amazon Shield tastes bad. '
 'The context discusses Amazon Shield as an AWS service for DDoS protection, '
 'but does not mention anything about its taste. ')


### Relevancy Evaluation - Claude Instant responses on KB

In [86]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context_new)
relevant_eval = relevancy_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f'Relevant response?: {str(relevant_eval.passing)}')
pp.pprint(f"Reason: {relevant_eval.feedback} ")

Relevant response?: True
('Reason:  Yes, the response is in line with the context information '
 'provided.\n'
 '\n'
 'The context provides an overview of AWS Shield, describing the two tiers '
 '(Standard and Advanced), the types of protection they offer, and what '
 'resources can be protected. \n'
 '\n'
 'The response aligns with this by explaining in more detail what AWS Shield '
 'is, the two tiers, the types of DDoS attacks they protect against, the '
 'additional features of Advanced like visibility and integration with WAF, '
 'and what resources it can protect.\n'
 '\n'
 'So the response elaborates on the context and is consistent with the '
 'information provided there. Therefore, my answer is YES. ')


### For Correctness on Claude Instant:

In [87]:
from typing import Tuple, List
import pandas as pd
from llama_index.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
)

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_new)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_new)
correctness_evaluator = CorrectnessEvaluator(service_context=service_context_new)

def run_evals_on_instant(qa_pairs: List[Tuple[str, str]], topK):
    results_list = []
    for question, reference_answer in qa_pairs:
        # retrieve matching documents
        result = retrieve(question, kb_id, topK)
        retrievalResults = result['retrievalResults']
        contexts = get_contexts(retrievalResults=retrievalResults)
        prompt = claude_prompt.format(context_str=contexts, 
                                 query_str=question)
        response = llm(prompt)
        generated_answer = str(response)
        faithfulness_results = faithfulness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        relevancy_results = relevancy_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        cur_result_dict = {
            "query": question,
            "generated_answer": generated_answer,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score
        }
        results_list.append(cur_result_dict)
    evals_df = pd.DataFrame(results_list)
    return evals_df


In [88]:
evaluation_results_on_instant = run_evals_on_instant(eval_question_answer_pair, 3)

### Let's look at Claude Instant's performance on these three metrics:

In [89]:
evaluation_results_on_instant

Unnamed: 0,query,generated_answer,faithfulness,faithfulness_feedback,faithfulness_score,relevancy,relevancy_feedback,relevancy_score
0,By what percentage did AWS revenue grow year-o...,Unfortunately I do not have enough informatio...,False,NO\n\nThe context does not provide enough inf...,0.0,False,"No, the response is not in line with the cont...",0.0
1,Approximately how many new features and servic...,"Based on the context provided, I do not have ...",False,NO,0.0,True,"Based on the context provided, I do not have ...",1.0
2,"Compared to Graviton2 processors, what perform...",Unfortunately I do not have enough context to...,False,NO\n\nThe provided context does not contain a...,0.0,False,NO\n\nThe response indicates that there is no...,0.0
3,Which was the first inference chip launched by...,"Based on the provided context, the first infe...",True,"YES\n\nThe context mentions that ""AWS Inferen...",1.0,True,YES\n\nThe response indicates that AWS Infere...,1.0
4,What kind of throughput and latency improvemen...,"Based on the information provided, I do not h...",False,NO\n\nThe context does not provide enough det...,0.0,False,NO\n\nThe response states that there is insuf...,0.0
5,What are some of the key benefits of AWS's Inf...,"Based on the provided context, some key benef...",True,YES\n\nThe context mentions several key benef...,1.0,True,YES\n\nThe response provides details that are...,1.0
6,How has the introduction of Graviton3 chips im...,Unfortunately I do not have enough context to...,False,NO\n\nThe context does not mention anything a...,0.0,False,NO\n\nThe response clearly states that there ...,0.0
7,Can you describe the growth of AWS in terms of...,Unfortunately I do not have enough current in...,False,NO\n\nThe context does not contain any inform...,0.0,True,Yes\n\nThe response indicates that the assist...,1.0


In [90]:
print(f'Faithfulness score: {evaluation_results_on_instant.faithfulness.mean()} \nRelevancey score: {evaluation_results_on_instant.relevancy.mean()}')

Faithfulness score: 0.25 
Relevancey score: 0.5
