# Building and evaluating Q&A Application using Knowledge Bases for Amazon Bedrock using Arize Phoenix Evals

### Context

In this notebook, we will dive deep into building Q&A application using Retrieve API provide by Knowledge Bases for Amazon Bedrock, along with LangChain and Arize Phoenix for evaluating the responses. Here, we will query the knowledge base to get the desired number of document chunks based on similarity search, prompt the query using Anthropic Claude, and then evaluate the responses effectively using Arize Phoenix evaluation metrics, such as Q&A Correctness and Context Relevance. Arize Phoenix provides tools and evaluation prompts to compute built-in or custom metrics than can be integrated with Arize Phoenix's tracing and observability platform.

### Knowledge Bases for Amazon Bedrock Introduction

With knowledge bases, you can securely connect foundation models (FMs) in Amazon Bedrock to your company
data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant,
context-speciﬁc, and accurate responses without continuously retraining the FM. All information retrieved from
knowledge bases comes with source attribution to improve transparency and minimize hallucinations. For more information on creating a knowledge base using console, please refer to this [post](!https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html).

### Pattern

We can implement the solution using Retreival Augmented Generation (RAG) pattern. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. Here, we are performing RAG effectively on the knowledge base created in the previous notebook or using console. 

### Pre-requisite

Before being able to answer the questions, the documents must be processed and stored in Knowledge Bases for Amazon Bedrock.

1. Load the documents into the knowledge base by connecting your s3 bucket (data source). 
2. Ingestion - Knowledge base will split them into smaller chunks (based on the strategy selected), generate embeddings and store it in the associated vectore store.

![data_ingestion.png](./images/data_ingestion.png)


#### Notebook Walkthrough



For our notebook we will use the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 


We will then use the text chunks being generated and augment it with the original prompt and pass it through the `anthropic.claude-3-haiku` model using prompt engineering patterns based for your use case.

Finally we will evaluate the generated responses using Arize Phoenix on using metrics such as Q&A Correctness and Context Relevance. For evaluation, we will use `anthropic.claude-3-sonnet`.
### Ask question


![retrieveapi.png](./images/retrieveAPI.png)


#### Evaluation
1. Utilize Arize Phoenix for evaluation on 
    1. **Q&A Correctness:** This evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.
    2. **Context Relevance:** This evaluates whether a retrieved chunk contains an answer to the query. It's extremely useful for evaluating retrieval systems.

### USE CASE:

#### Dataset

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on. This data is already ingested into the knowledge base. You will need the `knowledge base id` to run this example.
In your specific use case, you can sync different files for different domain topics and query this notebook in the same manner to evaluate model responses using the retrieve API from knowledge bases.


### Python 3.11

⚠  For this lab we need to run the notebook based on a Python 3.11 runtime. ⚠

### Setup

To run this notebook you would need to install dependencies, langchain and arize-phoenix and the updated boto3, botocore whls.


In [None]:
%pip install --upgrade pip
%pip install boto3==1.34.85 --force-reinstall --quiet
%pip install botocore==1.34.85 --force-reinstall --quiet
%pip install langchain==0.1.16 --force-reinstall --quiet
%pip install 'arize-phoenix[evals]'==3.22.0 --force-reinstall --quiet
%pip install tiktoken==0.6.0 --force-reinstall --quiet
%pip install nest-asyncio==1.6.0 --force-reinstall --quiet

#### Restart the kernel with the updated packages that are installed through the dependencies above

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-3-haiku` as our large language model to perform query completions using the RAG pattern. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [None]:
import boto3
import pprint
from botocore.client import Config
from langchain_community.chat_models import BedrockChat
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever

pp = pprint.PrettyPrinter(indent=2)

kb_id = "" # replace it with your Knowledge base id.


bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens": 3000
}

llm_for_text_generation = BedrockChat(model_id="anthropic.claude-3-haiku-20240307-v1:0",
              model_kwargs=model_kwargs_claude,
              streaming=True,
              client = bedrock_client,)

bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retreive API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:

retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=kb_id,
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
    )

`score`: You can view the associated score of each of the text chunk that was returned which depicts its correlation to the query in terms of how closely it matches it.

### Prompt specific to the model to personalize responses 

Here, we will use the specific prompt below for the model to act as a financial advisor AI system that will provide answers to questions by using fact based and statistical information when possible. We will provide the `Retrieve API` responses from above as a part of the `{context}` in the prompt for the model to refer to, along with the user `query`.  

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

PROMPT_TEMPLATE = """
    Human: You are a financial advisor AI system, and provides answers to questions by using fact based and statistical information when possible. 
    Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    <context>
    {context}
    </context>

    <question>
    {question}
    </question>

    The response should be specific and use statistics or numbers when possible.

    Assistant:"""

prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm_for_text_generation
    | StrOutputParser() 
)

## Preparing the Evaluation Data

As Arize Phoenix aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `input` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `output`.

In [None]:
import pandas as pd
input = ["How many days has Amazon asked employees to come to work in office?", 
             "By what percentage did AWS revenue grow year-over-year in 2022?",
             "Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver according to the passage?",
             "Which was the first inference chip launched by AWS according to the passage?",
             "According to the context, in what year did Amazon's annual revenue increase from $245B to $434B?"
            ]
ground_truths = [["Amazon has asked corporate employees to come back to office at least three days a week beginning May 2022."],
                ["AWS had a 29% year-over-year ('YoY') revenue in 2022 on $62B revenue base."],
                ["In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors."],
                ["AWS launched their first inference chips (“Inferentia”) in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense."],
                ["Amazon's annual revenue increased from $245B in 2019 to $434B in 2022."]]

output = []
reference = []

for query in input:
  output.append(rag_chain.invoke(query))
  reference.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "input": input,
    "output": output,
    "reference": reference,
    "ground_truths": ground_truths
}

# create dataframe from data
df = pd.DataFrame(data)

In [None]:
df

## Evaluating the RAG application
First, We will initialize bedrock model  `anthropic.claude-3-sonnet` as our large language model to perform RAG evaluation using the Arize Phoenix SDK.



In [None]:
from phoenix.evals import BedrockModel

eval_model = BedrockModel(model_id="anthropic.claude-3-sonnet-20240229-v1:0")

Import all the metrics you want to use from `phoenix.experimental.evals`. Then, you can use the `run_evals` function and simply pass in the needed eval and the prepared dataset.


In [None]:
from phoenix.evals import (
    QAEvaluator,
    RelevanceEvaluator,
)

qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

In [None]:
import nest_asyncio
from phoenix.evals import (
    run_evals,
)

nest_asyncio.apply()  # needed for concurrency in notebook environments

qa_correctness_eval_df = run_evals(
    dataframe=df,
    evaluators=[qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

Below, you can see the resulting evals for the examples:

In [None]:
import pandas as pd
qa_correctness_eval_df

In [None]:
relevance_eval_df

> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 