## Building and evaluating Q&A application using Knowledge Bases for Amazon Bedrock - Retrieve API, Langchain, and LLaMa Index for Prompt Completion Evaluations

### Context

In this notebook, we will dive deep into building Q&A application using Retrieve API provide by Knowledge Bases for Amazon Bedrock, along with LangChain and LlamaIndex for evaluating the responses. Here, we will query the knowledge base to get the desired number of document chunks based on similarity search, prompt the query using Amazon Titan Text Lite, and then evaluate the responses effectively using LLaMaIndex evaluation metrics, such as faithfulness, correctness, relevancy or guideline based expectations.

With knowledge bases, you can securely connect foundation models (FMs) in Amazon Bedrock to your company
data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant,
context-speciﬁc, and accurate responses without continuously retraining the FM. All information retrieved from
knowledge bases comes with source attribution to improve transparency and minimize hallucinations. For more information on creating a knowledge base using console, please refer to this [post](!https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

### Pattern

We can implement the solution using Retreival Augmented Generation (RAG) pattern. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. Here, we are performing RAG effectively on the knowledge base created in the previous notebook or using console. 

### Pre-requisite

Before being able to answer the questions, the documents must be processed and stored in Amazon knowledge base.

1. Load the documents into the knowledge base by connecting your s3 bucket (data source). 
2. Ingestion - Knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings and store it in the associated vectore store.

![data_ingestion.png](./images/data_ingestion.png)


#### Notebook Walkthrough



For our notebook we will use the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 


We will then use the text chunks being generated and augment it with the original prompt and pass it through the `titan-text-lite` model using prompt engineering patterns based for your use case.

Finally we will evaluate the generated responses using LLaMaIndex on using metrics such as faithfulness, correctness, guideline and relevancy metrics. For evaluation, we will use `Anthropic Claude v2 model`.
### Ask question


![retrieveapi.png](./images/retrieveAPI.png)


#### Evaluation
1. Utilize LLaMa Index for answer evaluation on 
    1. Faithfulness
    2. Correctness
    3. Relevancy
    4. Guidelines
    

### USE CASE:

#### Dataset

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on. This data is already ingested into the knowledge base. You will need the `knowledge base id` to run this example.
In your specific use case, you can sync different files for different domain topics and query this notebook in the same manner to evaluate model responses using the retrieve API from knowledge bases.


### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and LLaMa Index and the updated boto3, botocore whls.


In [None]:
%pip install --upgrade pip
%pip install boto3==1.33.2 --force-reinstall --quiet
%pip install botocore==1.33.2 --force-reinstall --quiet
%pip install langchain==0.0.342 --force-reinstall --quiet
%pip install llama-index==0.9.3.post1 --force-reinstall --quiet

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import nest_asyncio
nest_asyncio.apply()

### Follow the steps below to initiate the bedrock client:

1. Import the necessary libraries, along with langchain for bedrock model selection, llama index to store the service context containing the llm and embedding model instances. We will use this service context later in the notebook for evaluating the responses from our Q&A application. 

2. Initialize `amazon.titan-text-lite-v1` as our large language model to perform query completions using the RAG pattern with the given knowledge base, once we get all text chunk searches through the `retrieve` API.

3. For evaluating the response with LlamaIndex we will use `anthropic.claude-v2` model. 

In [None]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from llama_index import (
    ServiceContext,
    set_global_service_context
)
from langchain.embeddings.bedrock import BedrockEmbeddings
from llama_index.embeddings import LangchainEmbedding

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

parameters = {
    "maxTokenCount":2000,
    "stopSequences":[],
    "temperature":0,
    "topP":0.9
    }

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000
}

embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

llm = Bedrock(model_id = "amazon.titan-text-lite-v1",
              model_kwargs=parameters,
              client = bedrock_client,)

llm_claude = Bedrock(model_id = "anthropic.claude-v2",
              model_kwargs=model_kwargs_claude,
              client = bedrock_client,)

service_context = ServiceContext.from_defaults(llm=llm_claude,
                                               embed_model=embed_model)
set_global_service_context(service_context)

### Retrieve API: Process flow 

Define a retrieve function that calls the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:
def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults
            }
        }
    )

#### Initialize your Knowledge base id before querying responses from the initialized LLM

In [None]:
kb_id = "<knowledge_base_id>" # replace it with the Knowledge base id.

Next, we will call the `retreive API`, and pass `knowledge base id`, `number of results` and `query` as paramters. 

`score`: You can view the associated score of each of the text chunk that was returned which depicts its correlation to the query in terms of how closely it matches it.

In [None]:
query = "What is Amazon's doing in the field of generative AI?"
response = retrieve(query, kb_id, 5)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

### Prompt specific to the model to personalize responses 

Here, we will use the specific prompt below for the model to act as a financial advisor AI system that will provide answers to questions by using fact based and statistical information when possible. We will provide the `Retrieve API` responses from above as a part of the `{context_str}` in the prompt for the model to refer to, along with the user `query`.  

In [None]:
from langchain.prompts import PromptTemplate

PROMPT_TEMPLATE = """
You are a financial advisor AI system, and provides answers to questions by using fact based and statistical information when possible. 
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context_str}
</context>

<question>
{query_str}
</question>

The response should be specific and use statistics or numbers when possible.
"""
titan_prompt = PromptTemplate(template=PROMPT_TEMPLATE, 
                               input_variables=["context_str","query_str"])

### Fetch the text chunks from the RetrieveAPI response

In [None]:
# fetch context from the response
def get_contexts(retrievalResults):
    contexts = []
    for retrievedResult in retrievalResults: 
        contexts.append(retrievedResult['content']['text'])
    return contexts

In [None]:
contexts = get_contexts(retrievalResults)
pp.pprint(contexts)

### Initiate the user prompt and response via the LLM

Here, we are going to format our prompt using the context generated by the retrieve API associated to our KB as well as the user query to get the final response that we will use to evaluate generated answers using LLaMaIndex

In [None]:
import json
prompt = titan_prompt.format(context_str=contexts, 
                                 query_str=query)

In [None]:
response = llm(prompt)
pp.pprint(response)

## Evaluation Pipeline: Utilizing LLaMaIndex for end-end evaluations on Faithfulness, Correctness, Guidelines given, and Relevancy of answers generated by the model.

- Faithfulness - to measure if the response from the model matches any source nodes. This is useful for measuring if the response was hallucinated.
- Relevancy - to measure if the response + source nodes match the query.This is useful for measuring if the query was actually answered by the response.
- Correctness - to evaluate the relevance and correctness of a generated answer against a reference answer.
- Guidelines - to evaluate a question answer system given user specified guidelines for example, if the response generated is complete, not toxic, or biased or uses facts in the context.

### 1. Faithfulness Evaluation of Prompt Completions: Using LLaMa Index

This is useful for measuring if the response was hallucinated. Here we are going to essentially first focus on correlating the response of the model back to the context that was given to the model to answer the prompt/query, and check for how much the response and the context of the model correlates, or in other words, how faithful the prompt completion is based on the RAG performed using the query.

![faithfullness.png](./images/rag-eval-flow-faithfulness.png)

Here, we are matching whether the response generated correlates and matches with the corresponding context or not.


In [None]:
from llama_index.evaluation import FaithfulnessEvaluator

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
faith_eval = faithfulness_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f"Faithful response?: {str(faith_eval.passing)}"  )
pp.pprint(f"Reason: {faith_eval.feedback} ")

### 2. Relevancy Evaluation of Prompt Completions: Using LLaMa Index

 In this section will focus on using the `RelevancyEvaluator` module to measure if the response + source nodes match the query. This is useful for measuring if the query was actually answered by the response.

![relevancy](./images/rag-eval-flow-relevancy.png)

Here, we are using the query, along with the correlation with the response generated from that query, as well as the context utilized to evaluate whether the response generated by the model answers the question asked by user.

In [None]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
relevant_eval = relevancy_evaluator.evaluate(query=query,
                                              response=response, 
                                              contexts=contexts)
print(f'Relevant response?: {str(relevant_eval.passing)}')
pp.pprint(f"Reason: {relevant_eval.feedback} ")

### 3. Correctness Evaluation of Prompt Completions: Using LLaMa Index

#### The CorrectnessEvaluator is used to evaluate the relevance and correctness of a generated answer against a reference answer.

Here, we provide a reference answer along with the query in order to check for the correctness and exact accuracy of the response. This process can be viewed from a 'Ground Truth' perspective to check for specific use cases where correctness is the highest priority.
![correctness](./images/rag-eval-flow-correctness.png)


In this section we will use a batch of questions to evaluate on all three evaluators discussed above - Faithfullness, Relevancy and Correctness, to evaluate the performance of our RAG application.

Now, we will check for all three in our case: Faithfulness of responses, relevancy as well as correctness and display that as a pandas report table:

In [None]:
eval_question_answer_pair = [("How many days has Amazon asked employees to come to work in office?",
                          "Amazon has asked corporate employees to come back to office at least three days a week beginning May 2022."),
                         ("By what percentage did AWS revenue grow year-over-year in 2022?",
                          "AWS had a 29% year-over-year ('YoY') revenue in 2022 on $62B revenue base."),
                         ("Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver according to the passage?",
                          "In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors."),
                         ("Which was the first inference chip launched by AWS according to the passage?",
                          "AWS launched their first inference chips (“Inferentia”) in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense."),
                         ("According to the context, in what year did Amazon's annual revenue increase from $245B to $434B?",
                          "Amazon's annual revenue increased from $245B in 2019 to $434B in 2022."
                          )
                          ]

In [None]:
from typing import Tuple, List
import pandas as pd
from llama_index.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
)

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
correctness_evaluator = CorrectnessEvaluator(service_context=service_context)

def run_evals(qa_pairs: List[Tuple[str, str]], topK):
    results_list = []
    for question, reference_answer in qa_pairs:
        # retrieve matching documents
        result = retrieve(question, kb_id, topK)
        retrievalResults = result['retrievalResults']
        contexts = get_contexts(retrievalResults=retrievalResults)
        #call LLM with updated context and question.
        prompt = titan_prompt.format(context_str=contexts, 
                                 query_str=question)
        response = llm(prompt)
        generated_answer = str(response)
        # evaluate results.
        correctness_results = correctness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            reference=reference_answer
        )
        faithfulness_results = faithfulness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        relevancy_results = relevancy_evaluator.evaluate(
            query=question,
            response=generated_answer,
            contexts=contexts
            )
        cur_result_dict = {
            "query": question,
            "generated_answer": generated_answer,
            "correctness": correctness_results.passing,
            "correctness_feedback": correctness_results.feedback,
            "correctness_score": correctness_results.score,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score
        }
        results_list.append(cur_result_dict)
    evals_df = pd.DataFrame(results_list)
    return evals_df

In [None]:
# please note the execution of this cell might take 3-5mins. 
evaluation_results = run_evals(eval_question_answer_pair, 3)

### Visualize evaluations - all the questions, generated responses, along with their correlating evaluation metrics


In [None]:
evaluation_results

### Overall score for all 3 metrics


In [None]:
print(f'Correctness score: {evaluation_results.correctness.mean()} \nFaithfulness score: {evaluation_results.faithfulness.mean()} \nRelevancey score: {evaluation_results.relevancy.mean()}')


> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 

### 4. Guideline Evaluation of Prompt Completions: Using LLaMaIndex

#### This section will focus on using the GuidelineEvaluator module to evaluate a question answer system given user specified guidelines.

In this code below, we will define `certain` guidelines to look out for while evaluating a response to our query. Once we set these guidelines, we can check for certain pass and fail rates to make sure our query is following the guidelines.

This may be useful for use cases where more than one component needs to be displayed within the response for a more holistic evaluation that is not only faithful, relevant or correct, but also follows expected guidelines.

![guidelines.png](./images/rag-eval-flow-guidelines.png)

In [None]:
from llama_index.evaluation import GuidelineEvaluator

GUIDELINES = [
    "The response should fully answer the query.",
    "The response should avoid being vague or ambiguous.",
    "The response should not use toxic or profane language.",
    "The response should not be bias or discriminatory.",
    (
        "The response should be specific and use statistics or numbers when"
        " possible."
    ),
]

evaluators = [
    GuidelineEvaluator(service_context=service_context, guidelines=guideline)
    for guideline in GUIDELINES
]

In [None]:
query = "What is Amazon's generative AI strategy?"
result = retrieve(query, kb_id, 3)
retrievalResults = result['retrievalResults']
contexts = get_contexts(retrievalResults=retrievalResults)
#call LLM with updated context and question.
prompt = titan_prompt.format(context_str=contexts, 
                            query_str=query)
response = llm(prompt)
generated_answer = str(response)
pp.pprint(generated_answer)

#### Shows how specific guidelines are satisfied vs. not satisfied based on the response generated from the model


In [None]:
for guideline, evaluator in zip(GUIDELINES, evaluators):
    eval_result = evaluator.evaluate(
        query=query,
        response=generated_answer,
    )
    print("=====")
    print(f"Guideline: {guideline}")
    print(f"Pass: {eval_result.passing}")
    print(f"Feedback: {eval_result.feedback}")