<center><img src="images/2024_reInvent_Logo_wDate_Black_V3.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# <a name="0">re:Invent 2024 | Lab 1: Build your RAG powered chatbot  </a>
## <a name="0">Build a chatbot with Knowledge Bases and Guardrails to detect and remediate hallucinations </a>

## Lab Overview
In this lab, you will:
1. Take a deeper look at which LLM parameters influence or control for model hallucinations
2. Set up Retrieval Augmented Generation and understand how it can control for hallucinations
3. Apply contextual grounding in Amazon Bedrock Guardrails to intervene when a model hallucinates
4. Use RAGAS evaluation and understand which metrics help us measure hallucinations

## Dataset
For this workshop, we will use the [Bedrock User Guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) available as a PDF file.
## Use-Case Overview
In this lab, we want to develop a chatbot which can answer questions about Amazon Bedrock as factually as possible. We will set up Retrieval Augmented Generation using [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) and apply [Amazon Guardrails](https://aws.amazon.com/bedrock/guardrails/) to intervene when hallucinations are detected.


#### Lab Sections

This lab notebook has the following sections:
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.


----

# Star Github repository for future reference

In [None]:
%%html

<a class="github-button" href="https://github.com/aws-samples/responsible_ai_aim325_reduce_hallucinations_for_genai_apps" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star Reduce Hallucinations workshop on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

# Environment Setup

In [2]:
# %pip install --upgrade --quiet pip sagemaker boto3 ragas==0.1.7 pydantic==2.6.1 langchain-core==0.1.40 langchain langchain-aws

In [3]:
%%capture
!pip3 install -r requirements.txt --quiet

In [4]:
# restart kernel
# from IPython.core.display import HTML
# HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import time
import os
import json
import boto3
from time import gmtime, strftime, sleep
import pprint
import random
import zipfile

# from retrying import retry
from rag_setup.create_kb_utils import *
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import sagemaker
from botocore.exceptions import ClientError

(sagemaker.__version__, boto3.__version__)

## Set constants

In [None]:
# Get some variables you need to interact with SageMaker service
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "reduce-hallucinations-in-genai-apps"
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

initialized = True

print(sm_role)
print(bucket_name)

In [7]:
embedding_model_id = "amazon.titan-embed-text-v2:0"
llm_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

In [None]:
# Store some variables to keep the value between the notebooks
%store bucket_name
%store bucket_prefix
%store sm_role
%store region
%store initialized

In [None]:
# test if bedrock model access has been enabled
input_prompt = "Who was the first person to land on the sun?"
test_llm_call(input_prompt)

# 1. Chat with Anthropic Claude 3 Sonnet through Bedrock

In [9]:
bedrock_runtime = boto3.client(service_name="bedrock-runtime")


def generate_message_claude(
    query,
    system_prompt="",
    max_tokens=1000,
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    temperature=0.9,
    top_p=0.99,
    top_k=100,
):
    # Prompt with user turn only.
    user_message = {"role": "user", "content": query}
    messages = [user_message]
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
        }
    )

    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get("body").read())
    return response_body

In [None]:
query = "How do Amazon Bedrock Guardrails work?"

response = generate_message_claude(query)
print("User turn only.")
print(json.dumps(response, indent=4))

## 1.1 Apply System Prompt

In [None]:
query = "Is it possible to purchase provisioned throughput for Anthropic Claude models on Amazon Bedrock?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(query, system_prompt)
print("User turn only.")
print(json.dumps(response, indent=4))

In [None]:
query = "How do Amazon Bedrock Guardrails work?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(query, system_prompt)
print("User turn only.")
print(json.dumps(response, indent=4))

## 1.2 Understanding LLM generation parameters
### 1. Temperature: The amount of randomness injected into the response.

In [None]:
query = "What is Amazon Bedrock?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(query, system_prompt, temperature=1)
print("User turn only.")
print(json.dumps(response, indent=4))

In [None]:
query = "What is Amazon Bedrock?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(query, system_prompt, temperature=0)
print("User turn only.")
print(json.dumps(response, indent=4))

#### 2. top_p – Use nucleus sampling.

In nucleus sampling, Anthropic Claude computes the cumulative distribution over all the options for each subsequent token in decreasing probability order and cuts it off once it reaches a particular probability specified by top_p. You should alter either temperature or top_p, but not both.

In [None]:
query = "What is Amazon Bedrock?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(query, system_prompt, temperature=1, top_p=1)
print("User turn only.")
print(json.dumps(response, indent=4))

#### 3. top_k: Only sample from the top K options for each subsequent token.

Use top_k to remove long tail low probability responses.

In [None]:
query = "What is Amazon Bedrock?"
system_prompt = "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge. If you are unsure of the answer, do not make up any information."

response = generate_message_claude(
    query, system_prompt, temperature=0, top_p=1, top_k=100
)
print("User turn only.")
print(json.dumps(response, indent=4))

# Retrieval Augmented Generation
We are using the Retrieval Augmented Generation (RAG) technique with Amazon Bedrock. A RAG implementation consists of two parts:

    1. A data pipeline that ingests that from documents (typically stored in Amazon S3) into a Knowledge Base i.e. a vector database such as Amazon OpenSearch Service Serverless (AOSS) so that it is available for lookup when a question is received.

The data pipeline represents an undifferentiated heavy lifting and can be implemented using Amazon Bedrock Knowledge Bases. We can now connect an S3 bucket to a vector database such as AOSS and have a Bedrock Knowledge Bases read the objects (html, pdf, text etc.), chunk them, and then convert these chunks into embeddings using Amazon Titan Embeddings model and then store these embeddings in AOSS. All of this without having to build, deploy, and manage the data pipeline.

<center><img src="images/fully_managed_ingestion.png" alt="This image shows how Aazon Bedrock Knowledge Bases ingests objects in a S3 bucket into the Knowledge Base for use in a RAG set up. The objects are chunks, embedded and then stored in a vector index." height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>
    

    2. An application that receives a question from the user, looks up the knowledge base for relevant pieces of information (context) and then creates a prompt that includes the question and the context and provides it to an LLM for generating a response.






Once the data is available in the Bedrock knowledge base, then user questions can be answered using the following system design:

<center><img src="images/retrieveAndGenerate.png" alt="This image shows the retrieval augmented generation (RAG) system design setup with knowledge bases, S3, and AOSS. Knowledge corpus is ingested into a vector database using Amazon Bedrock Knowledge Base Agent and then RAG approach is used to work question answering. The question is converted into embeddings followed by semantic similarity search to get similar documents. With the user prompt being augmented with the RAG search response, the LLM is invoked to get the final raw response for the user." height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>


# Data
Let's use the publicly available [Bedrock user guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) to inform the model

In [None]:
!wget -P data/ -N https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf --no-check-certificate

In [None]:
# Upload data to S3
dataset_file_local_path = "data/bedrock-ug.pdf"
input_s3_url = sagemaker.Session().upload_data(
    path=dataset_file_local_path, bucket=bucket_name
)
print(f"Upload the dataset to {input_s3_url}")

%store input_s3_url

# Steps

1. Create Amazon Bedrock Knowledge Base execution role with necessary policies for accessing data from S3 and writing embeddings into OSS.
2. Create an empty OpenSearch serverless index.
3. Create Amazon Bedrock knowledge base
4. Create a data source within knowledge base which will connect to Amazon S3
5. Start an ingestion job using KB APIs which will read data from s3, chunk it, convert chunks into embeddings using Amazon Titan Embeddings model and then store these embeddings in AOSS. 

In [21]:
!export PYTHONPATH='./lab1/'
# import sys
# sys.path.insert(0,'./lab1/')

In [22]:
kb_db_file_uri = "data"

# if a kb already exists we can use the same, else the infra setup code will create one by itself using the bedrock user guide.
use_existing_kb = False
existing_kb_id = None

In [23]:
%load_ext autoreload
%autoreload 2
from rag_setup.create_kb_utils import *

In [None]:
%%time

# For new KB it takes around ~6 minutes for this setup to complete on a t2.medium instance.
infra_response = setup_knowledge_base(
    bucket_name, kb_db_file_uri, use_existing_kb, existing_kb_id
)
infra_response

In [None]:
kb_id = infra_response["knowledge_base_db_id"]
random_id = infra_response["prefix_infra"]
# keep the kb_id for invocation later in the invoke request
%store kb_id
%store bucket_name

In [None]:
kb_id

In [27]:
# allow time for KB to be ready
time.sleep(180)

# Chat with the model using the knowledge base by providing the generated KB_ID
### Using RetrieveAndGenerate API
Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

In [28]:
pp = pprint.PrettyPrinter(indent=2)

In [None]:
kb_id

In [30]:
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region)


def ask_bedrock_llm_with_knowledge_base(
    query,
    kb_id=kb_id,
    model_arn=llm_model_id,
) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={"text": query},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": kb_id,
                "modelArn": model_arn,
            },
        },
    )

    return response

In [None]:
query = "What is Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
generated_text = response["output"]["text"]
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])
print(f"---------- Generated using Anthropic Claude 3 Sonnet:")
pp.pprint(generated_text)
print(f"---------- The citations for the response:")
pp.pprint(contexts)
print(kb_id)

In [None]:
query = "Is it possible to purchase provisioned throughput for Anthropic Claude Sonnet on Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
generated_text = response["output"]["text"]
citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])
print(f"---------- Generated using Anthropic Claude 3 Sonnet:")
pp.pprint(generated_text)
print(f"---------- The citations for the response:")
pp.pprint(contexts)
print()

# Contextual Grounding with Amazon Bedrock Guardrails

In [None]:
# Create guardrail
bedrock_client = boto3.client("bedrock")
guardrail_name = f"bedrock-rag-grounding-guardrail-{random_id}"
print(guardrail_name)
guardrail_response = bedrock_client.create_guardrail(
    name=guardrail_name,
    description="Guardrail for ensuring relevance and grounding of model responses in RAG powered chatbot",
    contextualGroundingPolicyConfig={
        "filtersConfig": [
            {"type": "GROUNDING", "threshold": 0.5},
            {"type": "RELEVANCE", "threshold": 0.5},
        ]
    },
    blockedInputMessaging="Can you please rephrase your question?",
    blockedOutputsMessaging="Sorry, I am not able to find the correct answer to your query - Can you try reframing your query to be more specific",
)

In [None]:
guardrailId = guardrail_response["guardrailId"]
guardrail_response

In [None]:
guardrail_version = bedrock_client.create_guardrail_version(
    guardrailIdentifier=guardrail_response["guardrailId"],
    description="Working version of RAG app guardrail with higher thresholds for contextual grounding",
)
print(guardrail_version)
guardrailVersion = guardrail_response["version"]
print(guardrailId)
%store guardrailId

In [36]:
# Retrieve and Generate using Guardrail

bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region)


def retrieve_and_generate_with_guardrail(
    query, kb_id, model_arn=llm_model_id, session_id=None
):

    prompt_template = "You are a helpful AI assistant to help users understand documented risks in various projects. \
    Answer the user query based on the context retrieved. If you dont know the answer, dont make up anything. \
    Only answer based on what you know from the provided context. You can ask the user for clarifying questions if anything is unclear\
    But generate an answer only when you are confident about it and based on the provided context.\
    User Query: $query$\
    Context: $search_results$"

    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={"text": query},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "generationConfiguration": {
                    "guardrailConfiguration": {
                        "guardrailId": guardrailId,
                        "guardrailVersion": guardrailVersion,
                    },
                    "inferenceConfig": {
                        "textInferenceConfig": {"temperature": 0.7, "topP": 0.25}
                    },
                    "promptTemplate": {"textPromptTemplate": prompt_template},
                },
                "knowledgeBaseId": kb_id,
                "modelArn": model_arn,
                "retrievalConfiguration": {
                    "vectorSearchConfiguration": {"overrideSearchType": "SEMANTIC"}
                },
            },
        },
    )
    return response

In [None]:
# Knowledge BAse ID

query = "What is Amazon Bedrock?"
# query = "Is it possible to purchase provisioned throughput for Anthropic Claude Sonnet on Amazon Bedrock?"

model_response = retrieve_and_generate_with_guardrail(query, kb_id)

print(model_response)

# Evaluating RAG with RAGAS

In [38]:
import boto3
import pprint
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_community.chat_models.bedrock import BedrockChat
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(
    connect_timeout=120, read_timeout=120, retries={"max_attempts": 0}
)
bedrock_client = boto3.client("bedrock-runtime")
bedrock_agent_client = boto3.client("bedrock-agent-runtime", config=bedrock_config)

llm_for_text_generation = BedrockChat(model_id=llm_model_id, client=bedrock_client)

llm_for_evaluation = BedrockChat(model_id=llm_model_id, client=bedrock_client)

bedrock_embeddings = BedrockEmbeddings(
    model_id=embedding_model_id, client=bedrock_client
)

In [None]:
import pandas as pd

test = pd.read_csv("data/bedrock-user-guide-test.csv")
test = test.dropna()
test.style.set_properties(**{"text-align": "left", "border": "1px solid black"})
test.to_string(justify="left", index=False)
with pd.option_context("display.max_colwidth", None):
    pretty_print(test)

In [40]:
from datasets import Dataset

questions = test["Question/prompt"].tolist()
ground_truths = [[gt] for gt in test["Correct answer"].tolist()]

answers = []
contexts = []

for query in questions:
    response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
    generatedResult = response["output"]["text"]
    answers.append(generatedResult)
    contexts.append(
        [
            doc["content"]["text"]
            for doc in response["citations"][0]["retrievedReferences"]
        ]
    )

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths,
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness,
)

from ragas.metrics.critique import correctness

# specify the metrics here, kept one for now, we can add more.
metrics = [answer_relevancy]

result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
)

ragas_df = result.to_pandas()

In [None]:
ragas_df.style.set_properties(**{"text-align": "left", "border": "1px solid black"})
ragas_df.to_string(justify="left", index=False)
with pd.option_context("display.max_colwidth", None):
    pretty_print(ragas_df)

### <a >Challenge Exercise :: Try it Yourself! </a>


<div style="border: 4px solid coral; text-align: left; margin: auto;">
    <br>
    <p style="text-align: center; margin: auto;"><b>Try the following exercises on this lab and note the observations.</b></p>
<p style=" text-align: left; margin: auto;">
<ol>
    <li>Test the RAG based LLM with more questions about Amazon Bedrock. </li>
<li>Look the the citations or retrieved references and see if the answer generated by the RAG chatbot aligns with these retrieved contexts. What response do you get when the retrieved context comes up empty? </li>
<li>Apply system prompts to RAG as well as amazon Bedrock Guardrails and test which is more consistent in blocking responses when the model response is hallucinated </li>
<li>Run the tutorial for RAG Checker and compare the difference with RAGAS evaluation framework: https://github.com/amazon-science/RAGChecker/blob/main/tutorial/ragchecker_tutorial_en.md </li>
</ol>
<br>
</p>
</div>


## Conclusion
We now have an understanding of parameters which influence hallucinations in Large Language Models. We learnt how to set up Retrieval Augmented Generation to provide a context to the model while answering.
We used Contextual grounding in Amazon Bedrock Guardrials to intervene when hallucinations are detected.
Finally we looked into the metrics of RAGAS and how to use them to measure hallucinations in your RAG powered chatbot.

In the next lab, we will:
1. Build a custom hallucination detector
2. Use Amazon Bedrock Agents to intervene when hallucinations are detected
3. Call a human for support when the LLM hallucinates
