# Retrieval Augmented Generation with Amazon Bedrock - Enhancing Chat Applications with RAG

> *PLEASE NOTE: This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

---

## Chat with LLMs Overview

Conversational interfaces such as chatbots and virtual assistants can be used to enhance the user experience for your customers. Chatbots can be used in a variety of applications, such as customer service, sales, and e-commerce, to provide quick and efficient responses to users.

The key technical detail which we need to include in our system to enable a chat feature is conversational memory. This way, customers can ask follow up questions and the LLM will understand what the customer has already said in the past. The image below shows how this is orchestrated at a high level.

![Amazon Bedrock - Conversational Interface](./images/chatbot_bedrock.png)

## Extending Chat with RAG

However, in our workshop's situation, we want to be able to enable a customer to ask follow up questions regarding documentation we provide through RAG. This means we need to build a system which has conversational memory AND contextual retrieval built into the text generation.

![4](./images/context-aware-chatbot.png)

Let's get started!

---

## Setup `boto3` Connection

#### Libraries needed for the installs

In [None]:
%pip install  \
    "langchain>=0.1.11" \
    "transformers>=4.24,<5" \
    "faiss-cpu>=1.7.4,<2" \
    "pypdf>=3.8,<4" \
    pinecone-client==2.2.4 \
    apache-beam==2.52. \
    tiktoken==0.5.2 \
    "ipywidgets>=7,<8" \
    matplotlib==3.8.2 \
    anthropic==0.9.0 \
    llama-index==0.9.0

In [1]:
import boto3
import os
from IPython.display import Markdown, display

region = os.environ.get("AWS_REGION")
boto3_bedrock = boto3.client(
    service_name='bedrock-runtime',
    region_name=region,
)

### Use the following COT prompt to test 

In [2]:
prompt = "Human: You are a supply chain inspector. Your job is to assess the risk of a supplier. You measure the supplier risk by evaluating the supplier across three dimensions: country, size, and reputation.\nCountry - north America and west Europe countries are considered low risk. The rest of the world is considered medium risk.\nSize - supplier with over 1000 employees is low risk. Supplier with 50 to 999 employees is medium risk, and a supplier with under 50 employees is high risk.\nReputation - reputation scores are between 1 to 10 where a score of 1 to 3 is low risk, a score of 4 to 7 is medium risk, and a reputation score of 8 to 10 is high risk.\nThe risk formula is to take the maximum risk across the three dimensions.\n##\nExample:\nSupplier: A\nCountry: Chad\nSize: 30\nReputation: 8\nLet's think step by step:\nChad is not in North America or West Europe therefore country risk is medium.\nA size of 30 is below 50 and therefore considered high risk.\nA reputation score of 8 is between 8 to 10 and therefore considered high risk.\nFinal Answer taking the maximum risk among all: Supplier A is at High risk.\n##\nSupplier: B\nCountry: USA\nSize: 40\nReputation: 2\nLet's think step by step: \nAssistant: "


model_output = " Okay, let's evaluate Supplier B:\n\n- Country: USA is in North America, so the country risk is low.\n\n- Size: 40 employees is below 50, so the size risk is high. \n\n- Reputation: A score of 2 is between 1-3, so the reputation risk is low.\n\nTo determine the overall risk, we take the maximum risk across the three dimensions. \n\nThe maximum risk for Supplier B is high due to its small size.\n\nTherefore, my assessment is that Supplier B is at high risk overall."

---
## Using LangChain for Conversation Memory

We will use LangChain's `ConversationBufferMemory` class provides an easy way to capture conversational memory for LLM chat applications. Let's check out an example of Claude being able to retrieve context through conversational memory below.

Similar to the last workshop, we will use both a prompt template and a LangChain LLM for this example. Note that this time our prompt template includes a `{history}` variable where our chat history will be included to the prompt.

In [8]:
from langchain import PromptTemplate

CHAT_PROMPT_TEMPLATE = '''You are a helpful conversational assistant.
{history}

Human: {human_input}

Assistant:
'''
PROMPT = PromptTemplate.from_template(CHAT_PROMPT_TEMPLATE)

In [9]:
from langchain.llms import Bedrock

llm = Bedrock(
    client=boto3_bedrock,
    model_id="anthropic.claude-instant-v1",
    model_kwargs={
        "max_tokens_to_sample": 500,
        "temperature": 0.9,
    },
)

The `ConversationBufferMemory` class is instantiated here and you will notice that we use Claude specific human and assistant prefixes. When we initialize the memory, the history is blank.

In [10]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(human_prefix="\nHuman", ai_prefix="\nAssistant")
history = memory.load_memory_variables({})['history']
print(history)




We now ask Claude a simple question "How can I check for imbalances in my model?". The LLM responds to the question and we can use the `add_user_message` and `add_ai_message` functions to save the input and output into memory. We can then retrieve the entire conversation history and print the response. Currently the model will still return answer using the data it was trained upon. Further will examine how to get a curated answer using our own FAq's

In [11]:
human_input = 'How can I check for imbalances in my model?'

prompt_data = PROMPT.format(human_input=human_input, history=history)
ai_output = llm(prompt_data)

memory.chat_memory.add_user_message(human_input)
memory.chat_memory.add_ai_message(ai_output.strip())
history = memory.load_memory_variables({})['history']
display(Markdown(f'{history}'))


Human: How can I check for imbalances in my model?

Assistant: Here are a few things you can do to check for imbalances in your model:

- Evaluate model performance on different demographic groups. Look for significant differences in accuracy, recall, precision, etc. across groups. Large gaps may indicate the model has learned biases.

- Check the data distribution of sensitive attributes like gender, race, age in the training data compared to the overall population. Imbalances here could influence the model. 

- Look at the model's predictions on individual examples and check if they align with your sense of fairness. Manual inspection can catch subtle biases automated metrics may miss.

- Run the model on synthetic edge case data you construct to test specific fairness hypotheses. For example, all male profiles with an otherwise identical profile. 

- Use bias quantification metrics like disparate impact, statistical parity difference, equal opportunity difference to measure fairness numerically across groups. 

- Profile which features/variables the model relies most heavily on for predictions. If these correlate too much with a sensitive attribute, it could indicate unfair biases.

- Perform causal analysis to better understand how sensitive attributes may directly or indirectly influence predictions.

Regularly checking models for potential biases through multiple techniques helps catch issues early and build more equitable systems. Oversight helps ensure models act with fairness and for the benefit of all groups.

Now we will ask a follow up question about the kind of imbalances does it detect and save the input and outputs again. Notice how the model is able to understand that when the human says "it", because it has access to the context of the chat history, the model is able to accurately understand what the user is asking about.

In [12]:
human_input = 'What kind does it detect?'

prompt_data = PROMPT.format(human_input=human_input, history=history)
ai_output = llm(prompt_data)

memory.chat_memory.add_user_message(human_input)
memory.chat_memory.add_ai_message(ai_output.strip())
#display(Markdown(f'{history}'))
display(Markdown(f'{ai_output}'))

 When checking for imbalances or biases in a model, some common types of potential issues that can be detected include:

- Demographic biases - Where the model performs significantly differently based on attributes like gender, race, age, location, etc. This includes things like lower accuracy for minority groups.

- Historical biases in data - If the training data reflects past real-world biases or lack of representation, it can perpetuate those through the model predictions. 

- Redlining/discrimination - When a model systematically disadvantages or denies opportunities/resources to certain segments of the population, like specific neighborhoods.

- Stereotypical biases - Relying too heavily on attributes that correlate with protected groups, like associating career traits with gender roles.  

- Proxy variables - Variables in the data that correlate with and act as a proxy for sensitive attributes, allowing discrimination through proxies.

- Disparate impact - When a model negatively impacts or has worse outcomes for protected groups, even if not explicitly using sensitive attributes.

- Lack of counterfactual fairness - The model's predictions are not robust and may change drastically under small perturbations to inputs that have little causal influence.

- Unfair treatment of outliers/edge cases - Certain individuals or profiles may be systemically disadvantaged due to their unique attributes not being well represented.

Regular bias detection aims to uncover these types of issues to ensure all groups are treated fairly by a model and have equal access to opportunities.

---
## Creating a class to help facilitate conversation

To help create some structure around these conversations, we create a custom `Conversation` class below. This class will hold a stateful conversational memory and be the base for conversational RAG later.

In [13]:
class Conversation:
    def __init__(self, client, model_id: str="anthropic.claude-instant-v1") -> None:
        """instantiates a new rag based conversation

        Args:
            model_id (str, optional): which bedrock model to use for the conversational agent. Defaults to "anthropic.claude-instant-v1".
        """

        # instantiate memory
        self.memory = ConversationBufferMemory(human_prefix="\nHuman", ai_prefix="\nAssistant")

        # instantiate LLM connection
        self.llm = Bedrock(
            client=client,
            model_id=model_id,
            model_kwargs={
                "max_tokens_to_sample": 500,
                "temperature": 0.9,
            },
        )

    def ai_respond(self, user_input: str=None):
        """responds to the user input in the conversation with context used

        Args:
            user_input (str, optional): user input. Defaults to None.

        Returns:
            ai_output (str): response from AI chatbot
        """

        # format the prompt with chat history and user input
        history = self.memory.load_memory_variables({})['history']
        llm_input = PROMPT.format(history=history, human_input=user_input)

        # respond to the user with the LLM
        ai_output = self.llm(llm_input).strip()

        # store the input and output
        self.memory.chat_memory.add_user_message(user_input)
        self.memory.chat_memory.add_ai_message(ai_output.strip())

        return ai_output

Let's see the class in action with two contextual questions. Again, notice the model is able to correctly interpret the context because it has memory of the conversation.

In [14]:
chat = Conversation(client=boto3_bedrock)

In [15]:
output = chat.ai_respond('How can I check for imbalances in my model?')
display(Markdown(f'{output}'))

Here are a few things you can do to check for and address imbalances in your machine learning model:

- Check the distribution of your target variable(s) across different groups. Look for significant skews that could indicate imbalance. For example, if one class is 90% of your data.

- Look at evaluation metrics like precision, recall, F1 score separately for each group. Imbalances may show up as much lower scores for minority groups. 

- Stratify your training, validation and test sets to ensure each group is properly represented in each set. Random splits can exacerbate imbalances.

- Use resampling techniques like oversampling minority classes or undersampling majority classes to balance your training data. 

- Weight your loss function to place more emphasis on correct predictions for minority classes. This helps address skews during model training.

- Evaluate calibration and ensure predicted probabilities are well-calibrated across groups. Imbalanced data can lead to poorly calibrated predictions.

- For classification, check the confusion matrix to see if some groups are being disproportionately misclassified compared to others.

- Analyze feature importances to check if the model is relying too heavily on features strongly correlated with the majority class.

Regular monitoring and metrics by group can help spot model issues stemming from data or label imbalances early. Resampling and loss weighting are also commonly used to directly address imbalances.

In [17]:
output = chat.ai_respond('What kind does it detect?')
display(Markdown(f'{output}'))

Here are the main types of imbalances that can be detected when checking a machine learning model:

- Class imbalance - When the distribution of target classes is uneven, with some classes under-represented. This is common with rare event prediction.

- Sample size imbalance - Unequal number of samples from different subgroups defined by variables like gender, location, etc. 

- Feature distribution imbalance - Differences in how features are distributed between subgroups. For example, an important predictor only present for one group.

- Missing data imbalance - Certain subgroups more likely to have missing or incomplete data for specific features.

- Label noise imbalance - Uneven label accuracy between subgroups, such as different error rates during manual labeling. 

- Evaluation metric imbalance - Subgroups show very different performance on common metrics like accuracy, F1 score, AUC. 

- Prediction calibration imbalance - Predicted probabilities are not well-calibrated across all subgroups.

- Feature importance imbalance - Model overly relies on features strongly correlated with majority subgroups.

- Outcome imbalance - Subgroups see disproportionate outcomes even with balanced predictions, revealing hidden biases.

The key things checked are imbalances in: class/target distributions, sample sizes, feature distributions, missingness, label quality, model performance metrics, prediction reliability, feature effects, and actual outcome distributions between defined subgroups. Identifying these skews helps diagnose potential fairness, representation or bias issues.

---
## Combining RAG with Conversation

Now that we have a conversational system built, lets incorporate the RAG system we built in notebook 02 into the chat paradigm. 

First, we will create the same vector store with LangChain and FAISS from the last notebook.

Our goal is to create a curated response from the model and only use the FAQ's we have provided.

In [19]:
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS


# create instantiation to embedding model
embedding_model = BedrockEmbeddings(
    client=boto3_bedrock,
    model_id="amazon.titan-embed-text-v1"
)

# create vector store
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)

### Visualize Semantic Search 

⚠️ ⚠️ ⚠️ This section is for Advanced Practioners. Please feel free to run through these cells and come back later to re-examine the concepts ⚠️ ⚠️ ⚠️ 

Let's see how the semantic search works:
1. First we calculate the embeddings vector for the query, and
2. then we use this vector to do a similarity search on the store


##### Citation
We will also be able to get the `citation` or the underlying documents which our Vector Store matched to our query. This is useful for debugging and also measuring the quality of the vector stores. let us look at how the underlying Vector store calculates the matches

##### Vector DB Indexes
One of the key components of the Vector DB is to be able to retrieve documents matching the query with accuracy and speed. There are multiple algorithims for the same and some examples can be [read here](https://thedataquarry.com/posts/vector-db-3/) 

In [20]:
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')
#- helpful function to display in tabular format

def display_table(data):
    html = "<table>"
    for row in data:
        html += "<tr>"
        for field in row:
            html += "<td>%s</td>"%(field)
        html += "</tr>"
    html += "</table>"
    display(HTML(html))

In [21]:

v = embedding_model.embed_query("How can I check for imbalances in my model?")
print(v[0:10])
results = vs.similarity_search_by_vector(v, k=2)
display(Markdown('Let us look at the documents which had the relevant information pertaining to our query'))
for r in results:
    display(Markdown(f'{r.page_content}'))
    display(Markdown(f'------------------------------------'))

[-0.14746094, 0.77734375, 0.26953125, -0.55859375, 0.047851562, -0.43554688, -0.057617188, -0.00030326843, -0.5703125, -0.33789062]


Let us look at the documents which had the relevant information pertaining to our query

What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example)."

------------------------------------

How do I build an ML model to generate accurate predictions in SageMaker Canvas?," Once you have connected sources, selected a dataset, and prepared your data, you can select the target column that you want to predict to initiate a model creation job. SageMaker Canvas will automatically identify the problem type, generate new relevant features, test a comprehensive set of prediction models using ML techniques such as linear regression, logistic regression, deep learning, time-series forecasting, and gradient boosting, and build the model that makes accurate predictions based on your dataset."

------------------------------------

#### Similarity Search

##### Distance scoring in Vector Data bases
[Distance scores](https://weaviate.io/blog/distance-metrics-in-vector-search) are the key in vector searches. Here are some FAISS specific methods. One of them is similarity_search_with_score, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance ( Squared Euclidean) . Therefore, a lower score is better. Further in FAISS we have similarity_search_with_score (ranked by distance: low to high) and similarity_search_with_relevance_scores ( ranked by relevance: high to low) with both using the distance strategy. The similarity_search_with_relevance_scores calculates the relevance score as 1 - score. For more details of the various distance scores [read here](https://milvus.io/docs/metric.md)


In [22]:
display(Markdown(f"##### Let us look at the documents based on {vs.distance_strategy.name} which will be used to answer our question 'What kind of bias does Clarify detect ?'"))

context = vs.similarity_search('What kind of bias does Clarify detect ?', k=2)
#-  langchain.schema.document.Document
display(Markdown(f'------------------------------------'))
list_context = [[doc.page_content, doc.metadata] for doc in context]
list_context.insert(0, ['Documents', 'Meta-data'])
display_table(list_context)

##### Let us look at the documents based on EUCLIDEAN_DISTANCE which will be used to answer our question 'What kind of bias does Clarify detect ?'

------------------------------------

0,1
Documents,Meta-data
"What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""",{}
"How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.",{}


Let us first look at the Page context and the meta data associated with the documents. Now let us look at the L2 scores based on the distance scoring as explained above. Lower score is better

In [23]:
#- relevancy of the documents
results = vs.similarity_search_with_score("What kind of bias does Clarify detect ?", k=2, fetch_k=3)
display(Markdown(f'##### Similarity Search Table with relevancy score.'))
display(Markdown(f'------------------------------------'))   
results.insert(0,['Documents', 'Relevancy Score'])
display_table(results)

##### Similarity Search Table with relevancy score.

------------------------------------

0,1
Documents,Relevancy Score
"page_content='What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""'",130.9253
"page_content='How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.'",188.70457


#### Marginal Relevancy score

Maximal Marginal Relevance  has been introduced in the paper [The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf). Maximal Marginal Relevance tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc. In the below results since we have a very limited data set it might not make a difference but for larger data sets the query will theoritically run faster while still preserving the over all relevancy of the documents

In [24]:
#- normalizing the relevancy
display(Markdown('##### Let us look at MRR scores'))
results = vs.max_marginal_relevance_search_with_score_by_vector(embedding_model.embed_query("What kind of bias does Clarify detect ?"), k=3)
results.insert(0, ["Document", "MRR Score"])
display_table(results)
  

##### Let us look at MRR scores

0,1
Document,MRR Score
"page_content='What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""'",130.9253
"page_content='How does SageMaker Clarify improve model explainability?, SageMaker Clarify is integrated with SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available through an API.'",188.70457
"page_content='What is the underlying tuning algorithm for Automatic Model Tuning?,"" Currently, the algorithm for tuning hyperparameters is a customized implementation of Bayesian Optimization. It aims to optimize a customer-specified objective metric throughout the tuning process. Specifically, it checks the object metric of completed training jobs, and uses the knowledge to infer the hyperparameter combination for the next training job.""\nDoes Automatic Model Tuning recommend specific hyperparameters for tuning?,"" No. How certain hyperparameters impact the model performance depends on various factors, and it is hard to definitively say one hyperparameter is more important than the others and thus needs to be tuned. For built-in algorithms within SageMaker, we do call out whether or not a hyperparameter is tunable.""'",281.98907


#### Update embeddings of the Vector Databases

Update of documents happens all the time and we have multiple versions of the documents. Which means we need to also factor how do we update the embeddings in our Vector Data bases. Fortunately we have and can leverage the meta data to update embeddings

The key steps are:
1. Load the new embeddings and add the meta data stating the version as 2
2. Merge to the exisiting Vector database
3. Run the query using the filter to only search in the new index and get the latest documents for the same query


In [28]:
# create vector store
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.schema import Document
loader = CSVLoader(
    file_path="./data/sagemaker/sm_faq_v2.csv",
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": ["Question", "Answer"],
    },
)

#docs_split = loader.load()
docs_split = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separator=",").split_documents(loader.load())
list_of_documents = [Document(page_content=doc.page_content, metadata=dict(page='v2')) for idx, doc in enumerate(docs_split)]
print(f"Number of split docs={len(docs_split)}")
db = FAISS.from_documents(list_of_documents, embedding_model)

Number of split docs=6


#### Run a query against version 2 of the documents
Let us run the query agsint our exisiting vector data base and we will see the the exisiting or the version 1 of the documents coming back. If we run with the filter since those do not exist in our vector Database we will see no results returned or an empty list back


In [29]:
# Run the query with requesting data from version 2 which does not exist
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)
search_query = "How can I check for imbalances in my model?"
#print(f"Running with v1 of the documents we get response of {vs.similarity_search_with_score(query=search_query, k=1, fetch_k=4)}")
print("------\n")
print(f"Running the query with V2 of the document we get {vs.similarity_search_with_score(query=search_query, filter=dict(page='v2'), k=1)}:")


------

Running the query with V2 of the document we get []:


#### Add a new version of the document
We will create the version 2 of the documents and use meta data to add to our original index. Once done we will then apply a filter in our query which will return to us the documents newly added. Run the query now after adding version of the documents

We will also examine a way to speed up our searches and queries and look at another way to narrow the search using the  fetch_k parameter when calling similarity_search with filters. Usually you would want the fetch_k to be more than the k parameter. This is because the fetch_k parameter is the number of documents that will be fetched before filtering. If you set fetch_k to a low number, you might not get enough documents to filter from.

In [30]:
# - now let us add version 2 of the data set and run query from that

vs.merge_from(db)

#### Query complete merged data base with no filters
Run the query against the fully merged DB without any filters for the meta data and we see that it returns the top results of the new V2 data and also the top results of the v1 data. Essentially it will match and return data closest to the query

In [31]:
# - run the query again
search_query_v2 = "How can I check for imbalances in my model?"
results_with_scores = vs.similarity_search_with_score(search_query_v2, k=2, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
"Question: How can I check for imbalances in my model? Answer: Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time",{'page': 'v2'},154.83801
"What kind of bias does SageMaker Clarify detect?,"" Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model's prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example).""",{},229.07718


#### Query with Filter
Now we will ask to search only against the version 2 of the data and use filter criteria against it

In [32]:
# - run the query again
search_query_v2 = "How can I check for imbalances in my model?"
results_with_scores = vs.similarity_search_with_score(search_query_v2, filter=dict(page='v2'), k=2, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
"Question: How can I check for imbalances in my model? Answer: Amazon SageMaker Clarify Version 2 will helps improve model transparency. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time",{'page': 'v2'},154.83801
Question: What kind of bias does SageMaker Clarify detect? Answer: Measuring bias in ML models is a first step to mitigating bias.,{'page': 'v2'},268.96283


#### Query for new data
Now let us ask a question which exists only on the version 2 of the document

In [33]:
# - now let us ask a question which ONLY exits in the version 2 of the document
search_query_v2 = "Can i use Quantum computing?"
results_with_scores = vs.similarity_search_with_score(query=search_query_v2, filter=dict(page='v2'), k=1, fetch_k=3)
results_with_scores = [[doc.page_content, doc.metadata, score] for doc, score in results_with_scores]
results_with_scores.insert(0, ['Document', 'Meta-Data', 'Score'])
display_table(results_with_scores)

0,1,2
Document,Meta-Data,Score
Question: Can i use Quantum computing? Answer: Yes SageMaker version sometime in future will let you run quantum computing,{'page': 'v2'},97.10314


### Let us continue to build our chatbot

The prompt template is now altered to include both conversation memory as well as chat history as inputs along with the human input. Notice how the prompt also instructs Claude to not answer questions which it does not have the context for. This helps reduce hallucinations which is extremely important when creating end user facing applications which need to be factual.

In [36]:
# re-create vector store and continue
vs = FAISS.load_local('./faiss-index/langchain/', embedding_model, allow_dangerous_deserialization=True)

In [37]:
RAG_TEMPLATE = """You are a helpful conversational assistant.

If you are unsure about the answer OR the answer does not exist in the context, respond with
"Sorry but I do not understand your request. I am still learning so I appreciate your patience! 😊
NEVER make up the answer.

If the human greets you, simply introduce yourself.

The context will be placed in <context></context> XML tags. 

<context>{context}</context>

Do not include any xml tags in your response.

{history}

Human: {input}

Assistant:
"""
PROMPT = PromptTemplate.from_template(RAG_TEMPLATE)

The new `ConversationWithRetrieval` class now includes a `get_context` function which searches our vector database based on the human input and combines it into the base prompt.

In [38]:
class ConversationWithRetrieval:
    def __init__(self, client, vector_store: FAISS=None, model_id: str="anthropic.claude-instant-v1") -> None:
        """instantiates a new rag based conversation

        Args:
            vector_store (FAISS, optional): pre-populated vector store for searching context. Defaults to None.
            model_id (str, optional): which bedrock model to use for the conversational agent. Defaults to "anthropic.claude-instant-v1".
        """

        # store vector store
        self.vector_store = vector_store
        
        # instantiate memory
        self.memory = ConversationBufferMemory(human_prefix="Human", ai_prefix="Assistant")

        # instantiate LLM connection
        self.llm = Bedrock(
            client=client,
            model_id=model_id,
            model_kwargs={
                "max_tokens_to_sample": 500,
                "temperature": 0.0,
            },
        )

    def ai_respond(self, user_input: str=None):
        """responds to the user input in the conversation with context used

        Args:
            user_input (str, optional): user input. Defaults to None.

        Returns:
            ai_output (str): response from AI chatbot
            search_results (list): context used in the completion
        """

        # format the prompt with chat history and user input
        context_string, search_results = self.get_context(user_input)
        history = self.memory.load_memory_variables({})['history']
        llm_input = PROMPT.format(history=history, input=user_input, context=context_string)

        # respond to the user with the LLM
        ai_output = self.llm(llm_input).strip()

        # store the input and output
        self.memory.chat_memory.add_user_message(user_input)
        self.memory.chat_memory.add_ai_message(ai_output.strip())

        return ai_output, search_results

    def get_context(self, user_input, k=5):
        """returns context used in the completion

        Args:
            user_input (str): user input as a string
            k (int, optional): number of results to return. Defaults to 5.

        Returns:
            context_string (str): context used in the completion as a string
            search_results (list): context used in the completion as a list of Document objects
        """
        search_results = self.vector_store.similarity_search(
            user_input, k=k
        )
        context_string = '\n\n'.join([f'Document {ind+1}: ' + i.page_content for ind, i in enumerate(search_results)])
        return context_string, search_results

Now the model can answer some specific domain questions based on our document database!

In [39]:
chat = ConversationWithRetrieval(boto3_bedrock, vs)

In [40]:
output, context = chat.ai_respond('How can I check for imbalances in my model?')
display(Markdown(f'{output}'))

SageMaker Clarify allows you to detect various types of biases in ML models, including disparities in performance across demographic groups. Some common metrics it supports measuring include differences in error rates, precision and recall across groups. Checking for these types of imbalances in your model is recommended before and after training to help identify and mitigate any unfair treatment of certain populations in your predictions.

In [41]:
output, context = chat.ai_respond('What kind does it detect?')
display(Markdown(f'** Ai Assistant Answer: ** \n{output}'))
display(Markdown(f'\n\n** Relevant Documentation: ** \n{context}'))

** Ai Assistant Answer: ** 
SageMaker Clarify can detect various types of biases in ML models, including:

- Representation bias, which refers to imbalances in the representation of different groups in the training data. This could involve checking if one group is underrepresented. 

- Statistical bias, which involves detecting differences in the label distributions across groups in the training data. 

- Prediction bias, which measures whether the performance of a trained model differs across groups. This includes comparing error rates, precision, recall and other metrics between groups to identify disparities in how accurately the model predicts outcomes for different populations.

- Breaking these metrics down further can provide more detailed insight, such as comparing precision and recall separately between groups.

So in summary, SageMaker Clarify detects biases related to data representation and label distributions prior to training, as well as performance differences between groups for the trained model, to help identify and mitigate unfair treatment of certain populations.



** Relevant Documentation: ** 
[Document(page_content='What is Amazon SageMaker Autopilot?," SageMaker Autopilot is the industry’s first automated machine learning capability that gives you complete control and visibility into your ML models. SageMaker Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance, all with just a few clicks. The result is the best-performing model that you can deploy at a fraction of the time normally required to train the model. You get full visibility into how the model was created and what’s in it, and SageMaker Autopilot integrates with SageMaker Studio. You can explore up to 50 different models generated by SageMaker Autopilot inside SageMaker Studio so it’s easy to pick the best model for your use case. SageMaker Autopilot can be used by people without ML experience to easily produce a model, or it can be used by experienced developers to quickly develop a baseline model on which teams can further iterate."'), Document(page_content='What is Amazon SageMaker Studio Lab?," SageMaker Studio Lab is a free ML development environment that provides the compute, storage (up to 15 GB), and security—all at no cost—for anyone to learn and experiment with ML. All you need to get started is a valid email ID; you don’t need to configure infrastructure or manage identity and access or even sign up for an AWS account. SageMaker Studio Lab accelerates model building through GitHub integration, and it comes preconfigured with the most popular ML tools, frameworks, and libraries to get you started immediately. SageMaker Studio Lab automatically saves your work so you don’t need to restart between sessions. It’s as easy as closing your laptop and coming back later."'), Document(page_content='What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the situation under investigation. SageMaker currently supports the computation of different bias metrics for training data (as part of SageMaker data preparation), for the trained model (as part of Amazon SageMaker Experiments), and for inference for a deployed model (as part of Amazon SageMaker Model Monitor). For example, before training, we provide metrics for checking whether the training data is representative (that is, whether one group is underrepresented) and whether there are differences in the label distribution across groups. After training or during deployment, metrics can be helpful to measure whether (and by how much) the performance of the model differs across groups. For example, start by comparing the error rates (how likely a model\'s prediction is to differ from the true label) or break further down into precision (how likely a positive prediction is to be correct) and recall (how likely the model will correctly label a positive example)."'), Document(page_content='How can I reproduce a feature from a given moment in time?, SageMaker Feature Store maintains time stamps for all features at every instance of time. This helps you retrieve features at any period of time for business or compliance requirements. You can easily explain model features and their values from when they were first created to the present time by reproducing the model from a given moment in time.\nWhat are offline features?," Offline features are used for training because you need access to very large volumes over a long period of time. These features are served from a high-throughput, high-bandwidth repository."\nWhat are online features?, Online features are used in applications required to make real-time predictions. Online features are served from a high-throughput repository with single-digit millisecond latency for fast predictions.'), Document(page_content='Why should I use SageMaker for shadow testing?," SageMaker simplifies the process of setting up and monitoring shadow variants so you can evaluate the performance of the new ML model on live production traffic. SageMaker eliminates the need for you to orchestrate infrastructure for shadow testing. It lets you control testing parameters such as the percentage of traffic mirrored to the shadow variant and the duration of the test. As a result, you can start small and increase the inference requests to the new model after you gain confidence in model performance. SageMaker creates a live dashboard displaying performance differences across key metrics, so you can easily compare model performance to evaluate how the new model differs from the production model."')]

--- 
## Using LangChain for Orchestration of RAG

Beyond the primitive classes for prompt handling and conversational memory management, LangChain also provides a framework for [orchestrating RAG flows](https://python.langchain.com/docs/expression_language/cookbook/retrieval) with what purpose built "chains". In this section, we will see how to be a retrieval chain with LangChain which is more comprehensive and robust than the original retrieval system we built above.

The workflow we used above follows the following process...

1. User input is received.
2. User input is queried against the vector database to retrieve relevant documents.
3. Relevant documents and chat memory are inserted into a new prompt to respond to the user input.
4. Return to step 1.

However, more complex methods of interacting with the user input can generate more accurate results in RAG architectures. One of the popular mechanisms which can increase accuracy of these retrieval systems is utilizing more than one call to an LLM in order to reformat the user input for more effective search to your vector database. A better workflow is described below compared to the one we already built...

1. User input is received.
2. An LLM is used to reword the user input to be a better search query for the vector database based on the chat history and other instructions. This could include things like condensing, rewording, addition of chat context, or stylistic changes.
3. Reformatted user input is queried against the vector database to retrieve relevant documents.
4. The reformatted user input and relevant documents are inserted into a new prompt in order to answer the user question.
5. Return to step 1.

Let's now build out this second workflow using LangChain below.

First we need to make a prompt which will reformat the user input to be more compatible for searching of the vector database. The way we do this is by providing the chat history as well as the some basic instructions to Claude and asking it to condense the input into a single output.

In [42]:
condense_prompt = PromptTemplate.from_template("""\
<chat-history>
{chat_history}
</chat-history>

<follow-up-message>
{question}
<follow-up-message>

Human: Given the conversation above (between Human and Assistant) and the follow up message from Human, \
rewrite the follow up message to be a standalone question that captures all relevant context \
from the conversation. Answer only with the new question and nothing else.

Assistant: Standalone Question:""")

The next prompt we need is the prompt which will answer the user's question based on the retrieved information. In this case, we provide specific instructions about how to answer the question as well as provide the context retrieved from the vector database.

In [43]:
respond_prompt = PromptTemplate.from_template("""\
<context>
{context}
</context>

Human: Given the context above, answer the question inside the <q></q> XML tags.

<q>{question}</q>

If the answer is not in the context say "Sorry, I don't know as the answer was not found in the context". Do not use any XML tags in the answer.

Assistant:""")

Now that we have our prompts set up, let's set up the conversational memory buffer just like we did earlier in the notebook. Notice how we inject an example human and assistant message in order to help guide our AI assistant on what its job is.

In [44]:
llm = Bedrock(
    client=boto3_bedrock,
    model_id="anthropic.claude-instant-v1",
    model_kwargs={"max_tokens_to_sample": 500, "temperature": 0.9}
)
memory_chain = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    human_prefix="Human",
    ai_prefix="Assistant"
)
memory_chain.chat_memory.add_user_message(
    'Hello, what are you able to do?'
)
memory_chain.chat_memory.add_ai_message(
    'Hi! I am a help chat assistant which can answer questions about Amazon SageMaker.'
)

Lastly, we will used the `ConversationalRetrievalChain` from LangChain to orchestrate this whole system. If you would like to see some more logs about what is happening in the orchestration and not just the final output, make sure to change the `verbose` argument to `True`.

In [45]:
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(
    llm=llm, # this is our claude model
    retriever=vs.as_retriever(), # this is our FAISS vector database
    memory=memory_chain, # this is the conversational memory storage class
    condense_question_prompt=condense_prompt, # this is the prompt for condensing user inputs
    verbose=False, # change this to True in order to see the logs working in the background
)
qa.combine_docs_chain.llm_chain.prompt = respond_prompt # this is the prompt in order to respond to condensed questions

Let's go ahead and generate some responses from our RAG solution!

In [46]:
display(Markdown(f"{qa.run({'question': 'How can I check for imbalances in my model?'})}"))

 You can check for imbalances in your model when using Amazon SageMaker in several ways:

Amazon SageMaker Clarify helps improve model transparency by detecting statistical bias across the entire ML workflow. SageMaker Clarify checks for imbalances during data preparation, after training, and ongoing over time, and also includes tools to help explain ML models and their predictions. Findings can be shared through explainability reports.

In [47]:
display(Markdown(f"{qa.run({'question': 'What kind does it detect?' })}"))

 SageMaker Clarify checks for imbalances in the following types:

Representativeness - Whether one group is underrepresented in the training data. 
Label distribution - Whether there are differences in the label distribution across groups.
Performance - Whether the performance of the model differs across groups, such as differences in error rates, precision, or recall across groups.

In [48]:
display(Markdown(f"{qa.run({'question': 'How does this improve model explainability?' })}"))

 Detecting imbalances in representativeness, label distribution, and performance across groups with Amazon SageMaker Clarify can help explain predictions from a model by highlighting potential unfairness or biases in the model's behavior. Identifying differences in how the model performs on different groups is an important part of understanding a model's limitations and ensuring it does not discriminate.

## Let us use LLM to validate if the response was factual

#### We first create a sanity prompt, which will use the vector DB results and ask the LLM to validate if the respinse given was acurate or not

In [50]:
# create sanity check prompt
from langchain.chains.question_answering import load_qa_chain
from langchain import PromptTemplate

def create_sanity_prompt(instruction_start: str = None,instruction_end: str = None,) -> PromptTemplate:
    """
    Create a prompt template for LLM sanity check

    Parameters
    ----------
    instruction_start : str, optional
        Instrcution in the beginning of the prompt, by default None
    instruction_end : str, optional
        Instrcution in the end of the prompt, by default None

    Returns
    -------
    PromptTemplate
        Prompt template in the LangChain format
    """

    # first instruction
    prompt_template_build = instruction_start + "\n"

    # add context
    prompt_template_build += "Context: {context}" + "\n"

    # add statement
    prompt_template_build += "Statement: {statement}" + "\n"

    # addinstruction
    prompt_template_build += "Question: " + instruction_end + "\n"

    # add answer placeholder
    prompt_template_build += "Answer:"

    print(prompt_template_build)
    # build the template
    llm_prompt = PromptTemplate(
        template=str(prompt_template_build),
        input_variables=["context", "statement"],
    )
    return llm_prompt

sanity_prompt = create_sanity_prompt(
    instruction_start="""The following is a conversation between a highly knowledgeable and intelligent AI assistant, called Falcon, and a human user asking Questions. In the following interactions, Falcon will converse in natural language, and Falcon will answer the questions based only on the provided Context. Falcon will provide accurate, short and direct answers to the questions.""",
    instruction_end="Is the above statement based directly on the provided context? Answer with yes or no.",
)

docs = vs.similarity_search_with_score('How can I check for imbalances in my model?')
contexts = []
source = []
for doc, score in docs:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
    if score <= 0.9:
        contexts.append(doc)
        source.append(doc.metadata['source'])
        print(f"\n INPUT CONTEXT:{contexts}")

sanity_chain = load_qa_chain(llm=llm, prompt=sanity_prompt)
sanity_check = sanity_chain({"input_documents": contexts, "statement": output},return_only_outputs=True)['output_text']

sanity_check

The following is a conversation between a highly knowledgeable and intelligent AI assistant, called Falcon, and a human user asking Questions. In the following interactions, Falcon will converse in natural language, and Falcon will answer the questions based only on the provided Context. Falcon will provide accurate, short and direct answers to the questions.
Context: {context}
Statement: {statement}
Question: Is the above statement based directly on the provided context? Answer with yes or no.
Answer:
Content: What kind of bias does SageMaker Clarify detect?," Measuring bias in ML models is a first step to mitigating bias. Bias may be measured before training and after training, as well as for inference for a deployed model. Each measure of bias corresponds to a different notion of fairness. Even considering simple notions of fairness leads to many different measures applicable in various contexts. You must choose bias notions and metrics that are valid for the application and the sit

' Yes'

#### We can see the Vector database responded accurately to our query

In [51]:
sanity_check

' Yes'

--- 
## Using LlamaIndex for Orchestration of RAG

Another popular open source framework for orchestrating RAG is [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/index.html). Let's take a look below at how to use our SageMaker FAQ vector index to have a conversational RAG application with LlamaIndex.

In [52]:
from IPython.display import Markdown, display
from langchain.embeddings.bedrock import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

from llama_index import ServiceContext

First we need to set up the system setting to define the embedding model and LLM. Again, we will be using titan and claude respectively.

In [53]:
embed_model = BedrockEmbeddings(client=boto3_bedrock, model_id="amazon.titan-embed-text-v1")
llm = Bedrock(
    client=boto3_bedrock,
    model_id="anthropic.claude-instant-v1",
    model_kwargs={
        "max_tokens_to_sample": 500,
        "temperature": 0.9,
    },
)
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model, chunk_size=512
)

The next step would be to create a FAISS index from our document base. In this lab, this is already done for you and stored in the [faiss-index/llama-index/](../faiss-index/llama-index/) folder.

If you are interested in how this was accomplished, follow [this tutorial](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/FaissIndexDemo.html) from LlamIndex. The code below is the basics of how this was accomplished as well.

```python
# faiss_index = faiss.IndexFlatL2(1536)
# vector_store = FaissVectorStore(faiss_index=faiss_index)
# documents = SimpleDirectoryReader("./../data/sagemaker").load_data()
# vector_store = FaissVectorStore(faiss_index=faiss_index)
# storage_context = StorageContext.from_defaults(vector_store=vector_store)
# index = VectorStoreIndex.from_documents(
#     documents, storage_context=storage_context, service_context=service_context
# )
# index.storage_context.persist('../faiss-index/llama-index/')
```

Once the index is created, we can load the persistent files to a `FaissVectorStore` object and create a `query_engine` from the vector index. To learn more about indicies in LlamaIndex, read more [here](https://gpt-index.readthedocs.io/en/latest/understanding/indexing/indexing.html).

In [54]:
from llama_index import load_index_from_storage, StorageContext
from llama_index.vector_stores.faiss import FaissVectorStore

vector_store = FaissVectorStore.from_persist_dir("./faiss-index/llama-index")
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, persist_dir="./faiss-index/llama-index"
)
index = load_index_from_storage(storage_context=storage_context, service_context=service_context)
query_engine = index.as_query_engine()

Now let's set up a retrieval based chat application similar to LangChain. We will use the same condensing question strategy as before and can reuse the same prompt to condense the question for vector searching. Notice how we include some custom chat history to inject context into the prompt for the model to understand what we are asking questions about. The resulting `chat_engine` object is now fully ready to chat about our documents.

In [55]:
from llama_index.prompts  import PromptTemplate
from llama_index.llms import ChatMessage, MessageRole
from llama_index.chat_engine.condense_question import CondenseQuestionChatEngine

custom_prompt = PromptTemplate("""\
<chat-history>
{chat_history}
</chat-history>

<follow-up-message>
{question}
<follow-up-message>

Human: Given the conversation above (between Human and Assistant) and the follow up message from Human, \
rewrite the message to be a standalone question that captures all relevant context \
from the conversation. Answer only with the new question and nothing else.

Assistant: Standalone Question:""")

custom_chat_history = [
    ChatMessage(
        role=MessageRole.USER,
        content='Hello assistant, I have some questions about using Amazon SageMaker today.'
    ),
    ChatMessage(
        role=MessageRole.ASSISTANT,
        content='Okay, sounds good.'
    )
]

query_engine = index.as_query_engine()
chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine,
    condense_question_prompt=custom_prompt,
    chat_history=custom_chat_history,
    service_context=service_context,
    verbose=True
)

Let's go ahead and ask our first question. Notice that the verbose `chat_engine` will print out the condensed question as well.

In [56]:
response = chat_engine.chat("How can I check for imbalances in my model?")

Querying with:  When using Amazon SageMaker, how can I check for imbalances in the model I want to train during our previous discussion?


In [57]:
display(Markdown(f"{response}"))

 Based on the context provided, here is how you can check for imbalances in the model you want to train when using Amazon SageMaker:

Amazon SageMaker Clarify helps improve model transparency by detecting statistical bias across the entire ML workflow. With SageMaker Clarify, you can check for imbalances during data preparation, after training your model, and ongoing over time. 

Specifically, SageMaker Clarify provides the following capabilities to detect and help mitigate bias:

- It checks whether the training data is representative and whether there are differences in the label distribution across groups, before training the model. This helps identify potential data imbalances.

- After training your model, it measures whether and by how much the model's performance (e.g. error rates, precision, recall) differs across groups. This identifies if the trained model exhibits statistical biases. 

- During deployment, it continues monitoring for biases or differences in model performance across groups over time.

SageMaker Clarify integrates with SageMaker Experiments to provide detailed feature importance graphs and enable explanations of individual predictions through an API. This improves the explainability of any biases detected.

So in summary, to check for imbalances in your SageMaker model, you can leverage the bias detection and explainability capabilities of Amazon SageMaker Clarify.

Now follow up questions can be asked with conversational context in mind!

In [58]:
response = chat_engine.chat("How does this improve model explainability?")

Querying with:  How does Amazon SageMaker Clarify's capabilities for detecting biases in models through feature importance graphs and explanations of individual predictions through an API improve the explainability of any biases detected in a SageMaker model as discussed?


In [59]:
display(Markdown(f"{response}"))

 Based on the context provided, SageMaker Clarify improves model explainability by helping detect potential biases in models and providing explanations for those biases:

- SageMaker Clarify integrates with SageMaker Experiments to provide feature importance graphs after model training. These graphs detail the importance of each input feature for the model's overall predictions. This allows developers to determine if any features have more influence than they should, which could indicate potential biases. 

- For example, the graphs may show that a feature like gender has a high influence on the model's predictions, even when it should not. This helps detect potential biases.

- SageMaker Clarify also provides explanations for individual predictions through an API. These explanations shed light on why the model made a particular prediction for a given example. 

- By examining these individual explanations, developers can further analyze any potential biases detected in the feature importance graphs. The explanations help validate and provide context around those initial bias findings.

- Together, the feature importance graphs and individual prediction explanations improve model explainability by allowing developers to both detect potential biases and then understand the reasons behind them, through examining feature influences and prediction rationales. This helps address and remedy any issues to ensure the model is fair and unbiased.

---
## Next steps

Now that we have a working RAG application with vector search retrieval, we will explore a new type of retrieval. In the next notebook we will see how to use LLM agents to automatically retrieve information from APIs.