<center><img src="images/2024_reInvent_Logo_wDate_Black_V3.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# <a name="0">re:Invent 2024 | Lab 1: Build your RAG powered chatbot  </a>
## <a name="0">Build a chatbot with Knowledge Bases and Guardrails to detect and remediate hallucinations </a>

## Lab Overview
In this lab, you will:
1. Take a deeper look at which LLM parameters influence or control for model hallucinations
2. Understand how Retrieval Augmented Generation can control for hallucinations
3. Apply contextual grounding in Amazon Bedrock Guardrails to intervene when a model hallucinates
4. Use RAGAS evaluation and understand which metrics help us measure hallucinations

## Dataset
For this workshop, we will use the [Bedrock User Guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) available as a PDF file.
## Use-Case Overview
In this lab, we want to develop a chatbot which can answer questions about Amazon Bedrock as factually as possible. We will work with Retrieval Augmented Generation using [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) and apply [Amazon Guardrails](https://aws.amazon.com/bedrock/guardrails/) to intervene when hallucinations are detected.


#### Lab Sections

This lab notebook has the following sections:
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.


----

# Star Github repository for future reference

In [None]:
%%html

<a class="github-button" href="https://github.com/aws-samples/responsible_ai_aim325_reduce_hallucinations_for_genai_apps" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star Reduce Hallucinations workshop on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

# Environment Setup

In [None]:
%%capture
%pip install -r ../requirements.txt --quiet

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import time
import os
import json
import boto3
from time import gmtime, strftime, sleep
import random
import zipfile
import uuid
from rag_setup.create_kb_utils import *
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
from botocore.config import Config

import numpy as np  
import pandas as pd 
import sagemaker
from botocore.exceptions import ClientError

import pprint
pp = pprint.PrettyPrinter(indent=4)

## Set constants

In [None]:
# Get some variables you need to interact with SageMaker service
boto_session = boto3.Session()
region = boto_session.region_name

In [None]:
embedding_model_id="amazon.titan-embed-text-v2:0"
llm_model_id="anthropic.claude-3-sonnet-20240229-v1:0"

# 1. Chat with Anthropic Claude 3 Sonnet through Bedrock

In [None]:
RETRY_CONFIG = Config(
    retries={
        'max_attempts': 5,            # Maximum number of retry attempts
        'mode': 'adaptive'            # Adaptive mode adjusts based on request limits
    },
    read_timeout=1000,
    connect_timeout=1000
)

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name=region,
    config=RETRY_CONFIG)

def generate_message_claude(
    query, 
    system_prompt="", 
    max_tokens=1000,
    model_id='anthropic.claude-3-sonnet-20240229-v1:0',
    temperature=1,
    top_p=0.999,
    top_k=250
):
    # Prompt with user turn only.
    user_message = {"role": "user", "content": query}
    messages = [user_message]
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k
        }
    )

    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    return response_body['content'][0]['text']

In [None]:
query = 'How does Amazon Bedrock Guardrails work?'

response = generate_message_claude(query)
pp.pprint(response)

##### The LLM might hallucinate on what Amazon Bedrock Guardrails are if it does not have the right context to produce a factual answer. [Amazon Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/) helps to implement safeguards customized to your application requirements and your responsible AI policies.

<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>If LLM call to Bedrock did not work, enable model access on Amazon Bedrock console</h4>
</div>
<br/>

## 1.1 Apply System Prompt

In [None]:
system_prompt = 'You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge.\
If you are unsure of the answer, do not make up any information.'

In [None]:
query = 'Is it possible to purchase provisioned throughput for Anthropic Claude models on Amazon Bedrock?'

response = generate_message_claude(query, system_prompt)
pp.pprint(response)

In [None]:
query = 'How do Amazon Bedrock Guardrails work?'

response = generate_message_claude(query, system_prompt)
pp.pprint(response)

## 1.2 Understanding LLM generation parameters
### 1. Temperature

Affects the shape of the probability distribution for the predicted output and influences the likelihood of the model selecting lower-probability outputs.

- Choose a lower value to influence the model to select higher-probability outputs.

- Choose a higher value to influence the model to select lower-probability outputs.

In [None]:
query = 'Create a haiku about a unicorn'

response = generate_message_claude(query, temperature=0.9)
pp.pprint(response)

In [None]:
query = 'Create a haiku about a unicorn'

response = generate_message_claude(query, temperature=0.1)
pp.pprint(response)

### 2. top_p – Use nucleus sampling.

The percentage of most-likely candidates that the model considers for the next token.

- Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.

- Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.

In [None]:
query = 'Who is mans best friend?'

response = generate_message_claude(query, top_p=0.1)
pp.pprint(response)

In [None]:
query = 'Who is mans best friend?'

response = generate_message_claude(query, top_p=0.9)
pp.pprint(response)

### 3. top_k: Only sample from the top K options for each subsequent token.

The number of most-likely candidates that the model considers for the next token.

- Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.

- Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.

In [None]:
query = 'What is the universe'

response = generate_message_claude(query, top_k=3)
pp.pprint(response)

In [None]:
query = 'What is the universe'

response = generate_message_claude(query, top_k=100)
pp.pprint(response)

# Retrieval Augmented Generation
We are using the Retrieval Augmented Generation (RAG) technique with Amazon Bedrock. A RAG implementation consists of two parts:

    1. A data pipeline that ingests that from documents (typically stored in Amazon S3) into a Knowledge Base i.e. a vector database such as Amazon OpenSearch Service Serverless (AOSS) so that it is available for lookup when a question is received.

The data pipeline represents an undifferentiated heavy lifting and can be implemented using Amazon Bedrock Knowledge Bases. We can now connect an S3 bucket to a vector database such as AOSS and have a Bedrock Knowledge Bases read the objects (html, pdf, text etc.), chunk them, and then convert these chunks into embeddings using Amazon Titan Embeddings model and then store these embeddings in AOSS. All of this without having to build, deploy, and manage the data pipeline.

<center><img src="images/fully_managed_ingestion.png" alt="This image shows how Aazon Bedrock Knowledge Bases ingests objects in a S3 bucket into the Knowledge Base for use in a RAG set up. The objects are chunks, embedded and then stored in a vector index." height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>
    

    2. An application that receives a question from the user, looks up the knowledge base for relevant pieces of information (context) and then creates a prompt that includes the question and the context and provides it to an LLM for generating a response.






Once the data is available in the Bedrock knowledge base, then user questions can be answered using the following system design:

<center><img src="images/retrieveAndGenerate.png" alt="This image shows the retrieval augmented generation (RAG) system design setup with knowledge bases, S3, and AOSS. Knowledge corpus is ingested into a vector database using Amazon Bedrock Knowledge Base Agent and then RAG approach is used to work question answering. The question is converted into embeddings followed by semantic similarity search to get similar documents. With the user prompt being augmented with the RAG search response, the LLM is invoked to get the final raw response for the user." height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>


# Data
Let's use publicly available [Bedrock user guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) to inform the model. If you are running this workshop in a AWS led evet, this dataset is already uploaded to the S3 path (s3://bedrock-rag-us-west-2-<account_id>/kb-docs/bedrock-ug.pdf) and connected as a data source to the Amazon Bedrock Knowledge Base

In [None]:
kb_id = None
kb_list = bedrock_agent_client.list_knowledge_bases()['knowledgeBaseSummaries']
for kb in kb_list:
    if kb['name'] == 'bedrock_user_guide_kb':
        kb_id = kb['knowledgeBaseId']

if kb_id is None:
    print(f"Please navigate to Amazon Bedrock > Builder Tools > Knowledge Bases.\
    Click on 'bedrock_user_guide_kb' KB. Go to Datasource section and click `Sync` button.\
    Please wait for it to finish, then re-run this cell. ")
print(kb_id)

In [None]:
# keep the kb_id for invocation later in the invoke request
%store kb_id

# Chat with the model using the knowledge base by providing the generated KB_ID
### Using RetrieveAndGenerate API
Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

In [None]:
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region, config=RETRY_CONFIG)

def ask_bedrock_llm_with_knowledge_base(query,
                                        kb_id=kb_id,
                                        model_arn=llm_model_id,
                                        temperature=0,
                                        top_p=1,
                                        ) -> str:
    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': kb_id,
                'modelArn': model_arn,
                'generationConfiguration': {
                    'inferenceConfig': {
                        'textInferenceConfig': {
                            'maxTokens': 2048,
                            'temperature': temperature,
                            'topP': top_p,
                        }
                    },
                    'promptTemplate': {
                        'textPromptTemplate': 'You are a helpful AI assistant. You try to answer the user queries based on the provided context.\
                        If you are unsure of the answer, do not make up any information. Context to the user query is $search_results$ \
                        $output_format_instructions$'
                    }
                },
            },
        }
    )

    return response

In [None]:
query = "What is Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id, temperature=0)
pretty_display_rag_response(response)

In [None]:
pretty_display_rag_citations(response)

### Change the temperature to choose a different amount of randomness in the model response
- Choose a lower value to influence the model to select higher-probability outputs.

- Choose a higher value to influence the model to select lower-probability outputs.

In [None]:
query = "What is Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id, temperature=0.8)
pretty_display_rag_response(response)

### Test another query!

In [None]:
query = "Is it possible to purchase provisioned throughput for Anthropic Claude Sonnet on Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
pretty_display_rag_response(response)

# Contextual Grounding Check with Amazon Bedrock Guardrails
Contextual grounding check evaluates for hallucinations across two paradigms:

- Grounding – This checks if the model response is factually accurate based on the source and is grounded in the source. Any new information introduced in the response will be considered un-grounded.

- Relevance – This checks if the model response is relevant to the user query.

In [None]:
# Create guardrail

random_id_suffix = str(uuid.uuid1())[:6] # get first 6 characters of uuid string to generate guardrail name suffix

bedrock_client = boto3.client('bedrock')
guardrail_name = f"bedrock-rag-grounding-guardrail-{random_id_suffix}"
print(guardrail_name)

guardrail_response = bedrock_client.create_guardrail(
    name=guardrail_name,
    description='Guardrail for ensuring relevance and grounding of model responses in RAG powered chatbot',
    contextualGroundingPolicyConfig={
        'filtersConfig': [
            {
                'type': 'GROUNDING',
                'threshold': 0.5
            },
            {
                'type': 'RELEVANCE',
                'threshold': 0.5
            },
        ]
    },
    blockedInputMessaging='Can you please rephrase your question?',
    blockedOutputsMessaging='Sorry, I am not able to find the correct answer to your query - Can you try reframing your query to be more specific'
)
guardrailId = guardrail_response['guardrailId']

In [None]:
guardrail_version = bedrock_client.create_guardrail_version(
    guardrailIdentifier=guardrail_response['guardrailId'],
    description='Working version of RAG app guardrail with higher thresholds for contextual grounding'
)

guardrailVersion = guardrail_response['version']

%store guardrailId

In [None]:
# Retrieve and Generate using Guardrail

def retrieve_and_generate_with_guardrail(
    query,
    kb_id,
    model_arn=llm_model_id,
    session_id=None
):

    prompt_template = 'You are a helpful AI assistant to help users understand documented risks in various projects. \
    Answer the user query based on the context retrieved. If you dont know the answer, dont make up anything. \
    Only answer based on what you know from the provided context. You can ask the user for clarifying questions if anything is unclear\
    But generate an answer only when you are confident about it and based on the provided context.\
    User Query: $query$\
    Context: $search_results$\
    $output_format_instructions$'

    response = bedrock_agent_runtime_client.retrieve_and_generate(
        input={
            'text': query
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'generationConfiguration': {
                    'guardrailConfiguration': {
                        'guardrailId': guardrailId,
                        'guardrailVersion': guardrailVersion
                    },
                    'inferenceConfig': {
                        'textInferenceConfig': {
                            'temperature': 0.1,
                            'topP': 0.25
                        }
                    },
                    'promptTemplate': {
                        'textPromptTemplate': prompt_template
                    }
                },
                'knowledgeBaseId': kb_id,
                'modelArn': model_arn,
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'overrideSearchType': 'SEMANTIC'
                    }
                }
            }
        }
    )
    return response

In [None]:
query = 'What is Generative AI?'

model_response = retrieve_and_generate_with_guardrail(query, kb_id)

pp.pprint(model_response)

<div style="border: 2px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>The Guardrail intervenes when the generated model response is not grounded in a context</h4>
</div>
<br/>

# Evaluating RAG with RAGAS

In [None]:
from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_community.chat_models.bedrock import BedrockChat
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
from langchain_core.globals import set_verbose, set_debug

# Disable verbose logging
set_verbose(False)

# Disable debug logging
set_debug(False)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

llm_for_text_generation = BedrockChat(model_id=llm_model_id, client=bedrock_client)

llm_for_evaluation = BedrockChat(model_id=llm_model_id, client=bedrock_client)

bedrock_embeddings = BedrockEmbeddings(model_id=embedding_model_id,client=bedrock_client)

In [None]:
import pandas as pd

test = pd.read_csv('data/bedrock-user-guide-test.csv').dropna()
test.style.set_properties(**{'text-align': 'left', 'border': '1px solid black'})
test.to_string(justify='left', index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(test))

In [None]:
from datasets import Dataset

questions = test['Question/prompt'].tolist()
ground_truth = [gt for gt in test['Correct answer'].tolist()]

answers = []
contexts = []

for query in questions:
    response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
    generatedResult = response['output']['text']
    answers.append(generatedResult)

    context = []
    citations = response["citations"]
    for citation in citations:
        retrievedReferences = citation["retrievedReferences"]
        for reference in retrievedReferences:
            context.append(reference["content"]["text"])
    contexts.append(context)

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truth
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

### Let us deep dive into the two RAGAS metrics that we will use in the next lab

#### 1. answer_relevancy: Answer Relevancy metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the user_input, the retrived_contexts and the response.


In [None]:
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy
)

#specify the metrics here, kept one for now, we can add more.
metrics_ar = [
        answer_relevancy
    ]

result_ar = evaluate(
    dataset = dataset, 
    metrics=metrics_ar,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    raise_exceptions=False
)

ragas_df_ar= result_ar.to_pandas()

In [None]:
ragas_df_ar.style.set_properties(**{'text-align': 'left', 'border': '1px solid black'})
ragas_df_ar.to_string(justify='left', index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(ragas_df_ar))

#### 2. answer_correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. 

In [None]:
from ragas import evaluate
from ragas.metrics import (
    answer_correctness
)

metrics_ac = [
        answer_correctness
    ]

result_ac = evaluate(
    dataset = dataset, 
    metrics=metrics_ac,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    raise_exceptions=False
)

ragas_df_ac = result_ac.to_pandas()

In [None]:
ragas_df_ac.style.set_properties(**{'text-align': 'left', 'border': '1px solid black'})
ragas_df_ac.to_string(justify='left', index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(ragas_df_ac))

### <a >Challenge Exercise :: Try it Yourself! </a>


<div style="border: 4px solid coral; text-align: left; margin: auto;">
    <br>
    <p style="text-align: center; margin: auto;"><b>Try the following exercises in this lab and note the observations.</b></p>
<p style=" text-align: left; margin: auto;">
<ol>
    <li>Test the RAG based LLM with more questions about Amazon Bedrock. </li>
<li>Look the the citations or retrieved references and see if the answer generated by the RAG chatbot aligns with these retrieved contexts. What response do you get when the retrieved context comes up empty? </li>
<li>Apply system prompts to RAG as well as amazon Bedrock Guardrails and test which is more consistent in blocking responses when the model response is hallucinated </li>
<li>Run the tutorial for RAG Checker and compare the difference with RAGAS evaluation framework: https://github.com/amazon-science/RAGChecker/blob/main/tutorial/ragchecker_tutorial_en.md </li>
</ol>
<br>
</p>
</div>


## Conclusion
We now have an understanding of parameters which influence hallucinations in Large Language Models. We learnt how to set up Retrieval Augmented Generation to provide a context to the model while answering.
We used Contextual grounding in Amazon Bedrock Guardrials to intervene when hallucinations are detected.
Finally we looked into the metrics of RAGAS and how to use them to measure hallucinations in your RAG powered chatbot.

In the next lab, we will:
1. Build a custom hallucination detector
2. Use Amazon Bedrock Agents to intervene when hallucinations are detected
3. Call a human for support when the LLM hallucinates
