# Amazon Bedrock Knowledge Bases with Guardrails to reduce hallucinations

> **As featured in AWS re:Invent 2024 session AIM325**

In this notebook, you will:
1. Explore some LLM inference parameters that influence (and can help mitigate) model hallucinations.
2. Understand how Retrieval Augmented Generation (RAG) helps reduce hallucinations, and can be implemented with [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/).
3. Add real-time contextual grounding checks with [Amazon Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/), to detect and intervene when a RAG pipeline could still be generating hallucinatory or irrelevant answers.
4. Learn how the Open Source [Ragas](https://docs.ragas.io/en/stable/) framework can help evaluate RAG pipelines with different relevant metrics.

**About the dataset:** To demonstrate knowledge search, this workshop will use a simple example corpus including only (an outdated version of) the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) in PDF format. A set of example questions and expected answers based on this document are provided in [data/bedrock-user-guide-test.csv](data/bedrock-user-guide-test.csv).

## Prerequisites and setup

**Permissions:**
- *This notebook* (i.e. your AWS CLI credentials, or your SageMaker Execution Role if running in SageMaker AI) needs access to invoke models, query Knowledge Bases, and create+invoke Guardrails in Amazon Bedrock.
- *You* (i.e. your AWS Console user) will need access to deploy an OpenSearch Serverless-backed Bedrock Knowledge Base - including creating IAM Roles, which we'll set up via AWS CloudFormation

**Kernel and libraries:**

This notebook uses the same libraries as defined in the top-level [pyproject.toml](../../pyproject.toml) in this sample repository. Run the cells below to install those and then restart your notebook kernel, if you don't have them already:

In [None]:
%pip install -e ../..

In [None]:
# restart kernel
from IPython.core.display import HTML

HTML("<script>Jupyter.notebook.kernel.restart()</script>")

With the relevant libraries installed, we're ready to load them up and connect to the AWS services that'll be used in the rest of the notebook:

In [None]:
# Python Built-Ins:
import json
import logging
import uuid
import warnings

# External Dependencies:
import boto3
from botocore.config import Config as BotoConfig
import pandas as pd
import pprint

# Connect to AWS Services:
boto_session = boto3.Session()  # Can optionally override 'region_name' here if wanted
bedrock_config = BotoConfig(
    # Override default retry & timeout config for (maybe long-running/throttled) FM invocations
    retries={"max_attempts": 5, "mode": "adaptive"},
    read_timeout=1000,
    connect_timeout=1000,
)
bedrock_client = boto_session.client("bedrock")
bedrock_runtime = boto_session.client("bedrock-runtime", config=bedrock_config)
bedrock_agent_client = boto_session.client("bedrock-agent")
bedrock_agent_runtime = boto_session.client(
    "bedrock-agent-runtime", config=bedrock_config
)

# Setup for logging/printing outputs:
logging.basicConfig(
    format="%(asctime)s {%(filename)s:%(lineno)d} %(levelname)s: %(message)s",
    level=logging.INFO,
)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
pp = pprint.PrettyPrinter(indent=4)

This notebook has been tested with Anthropic Claude 3 Sonnet for text generation and Amazon Titan Text Embeddings v2 for embeddings, as configured below.

It should be possible to run with other models instead if needed (for e.g. due to regional availability or updates over time), but this may impact the observed behaviour and metrics:

In [None]:
embedding_model_id = "amazon.titan-embed-text-v2:0"
llm_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

## 1. Basic LLM invocation on Bedrock and hallucination risks

To get started, let's test out the basics of asking a question to a text generation Foundation Model on Amazon Bedrock. We'll set up a utility function for this to simplify the code later:

In [None]:
def generate_message_claude(
    query,
    system_prompt="",
    max_tokens=1000,
    model_id=llm_model_id,
    temperature=1.0,
    top_p=0.999,
    top_k=250,
):
    """Utility to simplify invoking Claude with a text prompt and getting a text response"""
    user_message = {"role": "user", "content": query}
    messages = [user_message]
    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
        }
    )

    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response["body"].read())
    return response_body["content"][0]["text"]

In this example we're relying **only** on whatever information the model remembers from its initial training, and whatever information is available in the prompt: there's no web search or other knowledge lookup enabled.

...So if we ask a specific, factual question on a specialized topic, there's a high chance the model may "hallucinate" a confident-sounding but factually incorrect answer:

In [None]:
query = "How does Amazon Bedrock Guardrails work?"

response = generate_message_claude(query)
pp.pprint(response)

<div style="border: 4px solid coral; text-align: left; padding: 20px;">
    <strong>Note: If the LLM call to Bedrock did not work, enable model access on Amazon Bedrock console</strong>
</div>

### 1.1 Apply a system prompt

Perhaps the simplest way to steer the model away from this behaviour is to just give guidance in the system prompt - encouraging the model to consider whether it's confident and decline to answer if not.

This approach isn't perfect by itself, but can help to avoid hallucinations in some cases:

In [None]:
system_prompt = (
    "You are a helpful AI assistant. You try to answer the user queries to the best of your knowledge.\
If you are unsure of the answer, do not make up any information."
)

query = "Is it possible to purchase provisioned throughput for Anthropic Claude models on Amazon Bedrock?"

response = generate_message_claude(query, system_prompt)
pp.pprint(response)

In [None]:
query = "How do Amazon Bedrock Guardrails work?"

response = generate_message_claude(query, system_prompt)
pp.pprint(response)

### 1.2 Understanding LLM generation parameters

There are also a few standard parameters controlling the response generation process, which can be adjusted to influence hallucination likelihood:

**Temperature** affects the shape of the probability distribution for the predicted output and influences the likelihood of the model selecting lower-probability outputs.

- Choose a lower value to influence the model to select higher-probability outputs.
- Choose a higher value to influence the model to select lower-probability outputs.

In [None]:
query = "Create a haiku about a unicorn"

response = generate_message_claude(query, temperature=0.9)
pp.pprint(response)

In [None]:
query = "Create a haiku about a unicorn"

response = generate_message_claude(query, temperature=0.1)
pp.pprint(response)

**top_k** limits the number of most-likely options that the model considers for choosing each token, in a process called nucleus sampling.

- Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.
- Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.

In [None]:
query = "What is the universe"

response = generate_message_claude(query, top_k=3)
pp.pprint(response)

In [None]:
query = "What is the universe"

response = generate_message_claude(query, top_k=100)
pp.pprint(response)

**top_p**, like top_k, limits candidates that the model considers for choosing each token - but by their total probability score rather than the number of options. As a result, the pool will automatically be larger when the model is less certain, or smaller when just one or two tokens dominate the scores.

- Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.
- Choose a higher value to increase the size of the pool and allow the model to consider less likely outputs.

In [None]:
query = "Who is mans best friend?"

response = generate_message_claude(query, top_p=0.1)
pp.pprint(response)

In [None]:
query = "Who is mans best friend?"

response = generate_message_claude(query, top_p=0.9)
pp.pprint(response)

## Retrieval-Augmented Generation

We are using the Retrieval Augmented Generation (RAG) technique with Amazon Bedrock. A RAG implementation consists of two parts:

    1. A data pipeline that ingests that from documents (typically stored in Amazon S3) into a Knowledge Base i.e. a vector database such as Amazon OpenSearch Service Serverless (AOSS) so that it is available for lookup when a question is received.

The data pipeline represents an undifferentiated heavy lifting and can be implemented using Amazon Bedrock Knowledge Bases. We can now connect an S3 bucket to a vector database such as AOSS and have a Bedrock Knowledge Bases read the objects (html, pdf, text etc.), chunk them, and then convert these chunks into embeddings using Amazon Titan Embeddings model and then store these embeddings in AOSS. All of this without having to build, deploy, and manage the data pipeline.

![](images/fully_managed_ingestion.png "This image shows how Aazon Bedrock Knowledge Bases ingests objects in a S3 bucket into the Knowledge Base for use in a RAG set up. The objects are chunks, embedded and then stored in a vector index.")

    2. An application that receives a question from the user, looks up the knowledge base for relevant pieces of information (context) and then creates a prompt that includes the question and the context and provides it to an LLM for generating a response.


Once the data is available in the Bedrock knowledge base, then user questions can be answered using the following system design:

![](images/retrieveAndGenerate.png "This image shows the retrieval augmented generation (RAG) system design setup with knowledge bases, S3, and AOSS. Knowledge corpus is ingested into a vector database using Amazon Bedrock Knowledge Base Agent and then RAG approach is used to work question answering. The question is converted into embeddings followed by semantic similarity search to get similar documents. With the user prompt being augmented with the RAG search response, the LLM is invoked to get the final raw response for the user.")

### Set up the Knowledge Base

In this example we'll use (an [outdated version](https://ws-assets-prod-iad-r-iad-ed304a55c2ca1aee.s3.us-east-1.amazonaws.com/1fa309f2-c771-42d5-87bc-e8f919e7bcc9/bedrock-ug.pdf) of) the publicly available [Amazon Bedrock User Guide](https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf) as an example document to inform the model.

The [Bedrock Knowledge Bases Console](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases) provides a UI workflow to guide you through the multiple steps of creating and configuring a knowledge base, connecting a data source, and triggering a "sync" to index the data for search.

For this sample though, we've provided an [AWS CloudFormation](https://aws.amazon.com/cloudformation/resources/templates/) template to make this multi-step setup as simple as possible - in [/infra/Bedrock-Knowledge-Base.yaml](../../infra/Bedrock-Knowledge-Base.yaml). You could upload this template yourself through the [CloudFormation Console](https://console.aws.amazon.com/cloudformation/home?#/stacks/create) - or click the button below to get started with a version we've already published to Amazon S3:

[![Launch Stack](https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home?#/stacks/create/review?templateURL=https://s3.amazonaws.com/ws-assets-prod-iad-r-iad-ed304a55c2ca1aee/1fa309f2-c771-42d5-87bc-e8f919e7bcc9/Bedrock-Knowledge-Base.yaml&stackName=HallucinationKBDemo "Launch Stack")

<div style="border: 4px solid coral; text-align: left; margin: auto; padding: 15px;">
    <strong>⏰ This deployment can take ~10-15 minutes to complete</strong>
</div>
<br/>

Once the stack is created successfully, you can check in the [Amazon Bedrock Console](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases) that your knowledge base is deployed successfully, and run the cell below to look up its unique ID automatically:

In [None]:
kb_name = "bedrock-userguide-demo"

kb_id = None
kb_list = bedrock_agent_client.list_knowledge_bases()["knowledgeBaseSummaries"]
for kb in kb_list:
    if kb["name"] == kb_name:
        kb_id = kb["knowledgeBaseId"]

if kb_id is None:
    raise ValueError(
        "Couldn't find pre-created Bedrock Knowledge Base. Please follow the instructions "
        "above to deploy the sample knowledge base, double-check its name matches '%s' configured "
        "above, then re-run this cell." % kb_name
    )
print(f"Using existing Bedrock Knowledge Base with ID: {kb_id}")

### Query the Knowledge Base

Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

In [None]:
def ask_bedrock_llm_with_knowledge_base(
    query,
    kb_id=kb_id,
    model_arn=llm_model_id,
    temperature=0,
    top_p=1,
):
    response = bedrock_agent_runtime.retrieve_and_generate(
        input={"text": query},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": kb_id,
                "modelArn": model_arn,
                "generationConfiguration": {
                    "inferenceConfig": {
                        "textInferenceConfig": {
                            "maxTokens": 2048,
                            "temperature": temperature,
                            "topP": top_p,
                        }
                    },
                    "promptTemplate": {
                        "textPromptTemplate": "You are a helpful AI assistant. You try to answer the user queries based on the provided context.\
                        If you are unsure of the answer, do not make up any information. Context to the user query is $search_results$ \
                        $output_format_instructions$"
                    },
                },
            },
        },
    )

    return response


def pretty_display_rag_citations(response):
    citations = response["citations"]
    contexts = []
    for citation in citations:
        retrievedReferences = citation["retrievedReferences"]
        for reference in retrievedReferences:
            contexts.append(reference["content"]["text"])
    print(f"---------- The citations for the response:")
    pp.pprint(contexts)

In [None]:
query = "What is Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id, temperature=0)
pp.pprint(response["output"]["text"])

In [None]:
pretty_display_rag_citations(response)

### Change the temperature to choose a different amount of randomness in the model response
- Choose a lower value to influence the model to select higher-probability outputs.

- Choose a higher value to influence the model to select lower-probability outputs.

In [None]:
query = "What is Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id, temperature=0.8)
pp.pprint(response["output"]["text"])

### Test another query!

In [None]:
query = "Is it possible to purchase provisioned throughput for Anthropic Claude Sonnet on Amazon Bedrock?"

response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
pp.pprint(response["output"]["text"])

## Extra protection with Amazon Bedrock Guardrails
Contextual grounding check evaluates for hallucinations across two paradigms:

- Grounding – This checks if the model response is factually accurate based on the source and is grounded in the source. Any new information introduced in the response will be considered un-grounded.

- Relevance – This checks if the model response is relevant to the user query.


### Create the Guardrail

In [None]:
# Create guardrail
# (get first 6 characters of uuid string to generate guardrail name suffix)
random_id_suffix = str(uuid.uuid1())[:6]
guardrail_name = f"bedrock-rag-grounding-guardrail-{random_id_suffix}"
print(guardrail_name)

guardrail_response = bedrock_client.create_guardrail(
    name=guardrail_name,
    description="Guardrail for ensuring relevance and grounding of model responses in RAG powered chatbot",
    contextualGroundingPolicyConfig={
        "filtersConfig": [
            {"type": "GROUNDING", "threshold": 0.6},
            {"type": "RELEVANCE", "threshold": 0.6},
        ]
    },
    blockedInputMessaging="Can you please rephrase your question?",
    blockedOutputsMessaging="Sorry, I am not able to find the correct answer to your query - Can you try reframing your query to be more specific",
)
guardrailId = guardrail_response["guardrailId"]

In [None]:
guardrail_version = bedrock_client.create_guardrail_version(
    guardrailIdentifier=guardrail_response["guardrailId"],
    description="Working version of RAG app guardrail with higher thresholds for contextual grounding",
)

guardrailVersion = guardrail_response["version"]

### Query the Knowledge Base with Guardrail enabled

In [None]:
def retrieve_and_generate_with_guardrail(
    query,
    kb_id,
    model_arn=llm_model_id,
):
    prompt_template = (
        "You are a helpful AI assistant to help users understand documented risks in various projects. \
    Answer the user query based on the context retrieved. If you dont know the answer, dont make up anything. \
    Only answer based on what you know from the provided context. You can ask the user for clarifying questions if anything is unclear\
    But generate an answer only when you are confident about it and based on the provided context.\
    User Query: $query$\
    Context: $search_results$\
    $output_format_instructions$"
    )

    response = bedrock_agent_runtime.retrieve_and_generate(
        input={"text": query},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "generationConfiguration": {
                    "guardrailConfiguration": {
                        "guardrailId": guardrailId,
                        "guardrailVersion": guardrailVersion,
                    },
                    "inferenceConfig": {
                        "textInferenceConfig": {"temperature": 0.1, "topP": 0.25}
                    },
                    "promptTemplate": {"textPromptTemplate": prompt_template},
                },
                "knowledgeBaseId": kb_id,
                "modelArn": model_arn,
                "retrievalConfiguration": {
                    "vectorSearchConfiguration": {"overrideSearchType": "SEMANTIC"}
                },
            },
        },
    )
    return response

In [None]:
query = "What is Generative AI?"

model_response = retrieve_and_generate_with_guardrail(query, kb_id)

pp.pprint(model_response)

<div style="border: 2px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>The Guardrail intervenes when the generated model response is not grounded in a context</h4>
</div>
<br/>

## Evaluating RAG with Ragas

In [None]:
# External Dependencies:
from datasets import Dataset
from langchain_aws import BedrockEmbeddings, ChatBedrockConverse
from langchain_core.globals import set_verbose, set_debug
import pandas as pd
from ragas import evaluate
from ragas.metrics import answer_relevancy, answer_correctness

# Disable verbose logging
set_verbose(False)

# Disable debug logging
set_debug(False)


llm_for_evaluation = ChatBedrockConverse(model=llm_model_id, client=bedrock_runtime)

bedrock_embeddings = BedrockEmbeddings(
    model_id=embedding_model_id, client=bedrock_runtime
)

In [None]:
test = pd.read_csv("data/bedrock-user-guide-test.csv").dropna()
test.style.set_properties(**{"text-align": "left", "border": "1px solid black"})
test.to_string(justify="left", index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(test))

In [None]:
questions = test["Question/prompt"].tolist()
ground_truth = [gt for gt in test["Correct answer"].tolist()]

answers = []
contexts = []

for query in questions:
    response = ask_bedrock_llm_with_knowledge_base(query, kb_id)
    generatedResult = response["output"]["text"]
    answers.append(generatedResult)

    context = []
    citations = response["citations"]
    for citation in citations:
        retrievedReferences = citation["retrievedReferences"]
        for reference in retrievedReferences:
            context.append(reference["content"]["text"])
    contexts.append(context)

# To dict
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truth,
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

Let's explore two particular Ragas metrics that we'll use in the next lab

### answer_relevancy metric

Answer Relevancy metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the user_input, the retrived_contexts and the response.

In [None]:
metrics_ar = [answer_relevancy]

result_ar = evaluate(
    dataset=dataset,
    metrics=metrics_ar,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    raise_exceptions=False,
)

ragas_df_ar = result_ar.to_pandas()

In [None]:
ragas_df_ar.style.set_properties(**{"text-align": "left", "border": "1px solid black"})
ragas_df_ar.to_string(justify="left", index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(ragas_df_ar))

### answer_correctness metric

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. 

In [None]:
metrics_ac = [answer_correctness]

result_ac = evaluate(
    dataset=dataset,
    metrics=metrics_ac,
    llm=llm_for_evaluation,
    embeddings=bedrock_embeddings,
    raise_exceptions=False,
)

ragas_df_ac = result_ac.to_pandas()

In [None]:
ragas_df_ac.style.set_properties(**{"text-align": "left", "border": "1px solid black"})
ragas_df_ac.to_string(justify="left", index=False)
with pd.option_context("display.max_colwidth", None):
    display(pd.DataFrame(ragas_df_ac))

### <a >Challenge Exercise :: Try it Yourself! </a>


<div style="border: 4px solid coral; text-align: left; margin: auto;">
    <br>
    <p style="text-align: center; margin: auto;"><b>Try the following exercises in this lab and note the observations.</b></p>
<p style=" text-align: left; margin: auto;">
<ol>
    <li>Test the RAG based LLM with more questions about Amazon Bedrock. </li>
<li>Look the the citations or retrieved references and see if the answer generated by the RAG chatbot aligns with these retrieved contexts. What response do you get when the retrieved context comes up empty? </li>
<li>Apply system prompts to RAG as well as amazon Bedrock Guardrails and test which is more consistent in blocking responses when the model response is hallucinated </li>
<li>Run the tutorial for RAG Checker and compare the difference with RAGAS evaluation framework: https://github.com/amazon-science/RAGChecker/blob/main/tutorial/ragchecker_tutorial_en.md </li>
</ol>
<br>
</p>
</div>


## Conclusion

We now have an understanding of parameters which influence hallucinations in Large Language Models. We learnt how to set up Retrieval Augmented Generation to provide a context to the model while answering.
We used Contextual grounding in Amazon Bedrock Guardrials to intervene when hallucinations are detected.
Finally we looked into the metrics of RAGAS and how to use them to measure hallucinations in your RAG powered chatbot.

To explore further, check out the [../bedrock-agent-self-reflection](../bedrock-agent-self-reflection/) sample in which we'll:
1. Build a custom hallucination detector
2. Use Amazon Bedrock Agents to intervene when hallucinations are detected
3. Call a human for support when the LLM hallucinates

## Clean-up

Once you're done experimenting, remember to clean up created AWS resources in order to avoid ongoing costs.

⚠️ **Note:** The following will be re-used in the `bedrock-agent-self-reflection` sample, so don't clean them up just yet if you're about to explore that!

1. The Knowledge Base we deployed via AWS CloudFormation can be deleted by deleting the stack you provisioned from the [CloudFormation Console](https://console.aws.amazon.com/cloudformation/home?#/stacks)
2. The Amazon Bedrock Guardrail we created can be deleted either from the [Bedrock Guardrails Console](https://console.aws.amazon.com/bedrock/home?#/guardrails) or by un-commenting and running the code cell below:

In [None]:
# bedrock_client.delete_guardrail(guardrailIdentifier=guardrailId)