# Part 3: Responsible answers with Retrieval Augmented Generation (RAG)

> *This notebook should work well in the `Data Science 3.0` kernel on Amazon SageMaker Studio. It requires Python v3.10+*

In this notebook we'll explore how Amazon Bedrock LLMs can be integrated with trusted text data sources to deliver more reliable answers, and show some ways you can **evaluate and measure** systems like these using [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/).

---

## Context: Understanding Retrieval-Augmented Generation

Because Large Language Models are trained to generate likely-seeming responses to initial inputs, they're sometimes prone to ["hallucinate"](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)) answers that sound coherent and even confident, but are factually incorrect or even self-contradictory.

When we prompt an LLM with a contextless question like *"Who is the president of the US?"*, we're asking it to recall whatever relevant facts it might've learned during training, and create an answer from scratch: Which makes incorrect "hallucinations" much more likely.

When we prompt an LLM to respond to a prompt from which the answer is mostly self-evident, such hallucinations are rare because the model can directly reference the input data when forming a response. For example: *If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?*

...But we can't pack the prompt with **ALL** the information a bot might need to answer every possible question... So what's the solution?

With the **Retrieval-Augmented Generation** pattern (as shown in the diagram below), we:

- Use a **search engine** to **retrieve the most relevant** document snippets from a reference corpus, based on the user's question
- Push the original question and those few snippets **together** into the LLM prompt
- Ask the LLM to answer the question based on the provided sources, or else say it doesn't know

Some solutions might add an additional LLM call to transform the user's raw question into an optimized search query, to improve retrieval performance.

![](imgs/rag-flow-dark.png "Flow diagram of a RAG system in which the user's question is used to search a corpus, and then those search results are fed together with the user question into a LLM to generate a response")

---

## Initial setup

To start exploring RAG patterns with a practical example, we'll first install some libraries that might not be present in the default notebook kernel image:

- Amazon Bedrock [became generally available](https://www.langchain.com/) in September 2023, so we need new-enough versions of the AWS Python SDKs `boto3` and `botocore` to be able to call the service
- [LangChain](https://python.langchain.com/docs/get_started/introduction) is an open-source framework for orchestrating common LLM patterns, that we'll use to simplify the code examples instead of building from basic Bedrock SDK calls.
- [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) is an open-source framework to help integrate LLMs with trusted data sources, and measure the performance of data-connected LLM use-cases
- [pypdf](https://pypdf.readthedocs.io/en/stable/) will enable us to read PDF document text from Python

In [None]:
%pip install --quiet \
    "boto3>=1.28.63,<2" \
    "botocore>=1.31.63,<2" \
    langchain==0.0.337 \
    llama-index==0.9.4 \
    pypdf==3.17.1

With the installs done, we'll load some libraries and initial setup that'll be useful later:

In [None]:
# Python Built-Ins:
import os  # For dealing with folder paths
from urllib.request import urlretrieve  # For fetching data from the web
from typing import Tuple, List  # Type annotations for easier debugging

# External Dependencies:
import boto3  # AWS SDK for Python
import nest_asyncio  # Needed for some asyncio-based libs to work in Jupyter notebooks
import pandas as pd  # For processing and displaying tabular data (dataframes)

nest_asyncio.apply()  # Enable asyncio-based libs to work properly in this notebook

---

## Download and pre-process documents with Titan Text Embeddings and LlamaIndex

The general RAG pattern can be implemented using pretty much any search engine, including fully-managed options like [Amazon Kendra](https://aws.amazon.com/blogs/machine-learning/quickly-build-high-accuracy-generative-ai-applications-on-enterprise-data-using-amazon-kendra-langchain-and-large-language-models/) and [Amazon OpenSearch](https://aws.amazon.com/blogs/machine-learning/build-a-powerful-question-answering-bot-with-amazon-sagemaker-amazon-opensearch-service-streamlit-and-langchain/). Since response quality is strongly dependent on the search result quality, it's often better to use **semantic search** tools than traditional keyword-based engines.

In this example notebook, we'll create an in-memory semantic search index using:

- [Amazon Titan Embeddings](https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/) on Amazon Bedrock, as a model to convert text of documents and user queries into numerical "embedding" vectors.
- LlamaIndex [VectorStoreIndex](https://gpt-index.readthedocs.io/en/stable/module_guides/indexing/vector_store_guide.html), to index the generated document vectors in-memory and retrieve the most similar documents for incoming queries/questions.

### Download sample documents

In this example we'll use just a single document for our RAG corpus: Amazon's 2022 annual letter to shareholders. Since the document itself is quite long, it'll still end up being split into multiple separate entries in the search index.

First, run the cell below to download the file(s) locally:

In [None]:
# Configure the local folder to use and the data files to fetch:
DATA_ROOT = "./data"
URL_FILENAME_MAP = {
    "https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf": \
        "2022-Shareholder-Letter.pdf"
}

# Create the local folder and download the data files:
urls = [k for k in URL_FILENAME_MAP.keys()]
filenames = [URL_FILENAME_MAP[k] for k in urls]

os.makedirs(DATA_ROOT, exist_ok=True)
for url, filename in URL_FILENAME_MAP.items():
    urlretrieve(url, os.path.join(DATA_ROOT, filename))

Then, we can initially read the PDF files using LlamaIndex:

In [None]:
from llama_index import SimpleDirectoryReader

docs = SimpleDirectoryReader(input_files=["data/2022-Shareholder-Letter.pdf"]).load_data()

### Split and vectorize the documents

Text vectorization models typically place an upper limit on the length of text they can process as a single item, and anyway we'll want each search result to be reasonably short - for embedding results in the answer generation LLM prompt later.

Because of this, we'll need to **split** our source document(s) into shorter passages for indexing. LlamaIndex's [TokenTextSplitter](https://gpt-index.readthedocs.io/en/latest/api/llama_index.node_parser.TokenTextSplitter.html) provides a utility for this:

In [None]:
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=102
)

To convert each document chunk into a single vector, we'll use the Amazon Titan Embeddings model.

LlamaIndex supports LangChain-based models via the `LangchainEmbedding` class, and LangChain supports Bedrock, so we can simply chain the two together:

In [None]:
from langchain.embeddings.bedrock import BedrockEmbeddings
from llama_index.embeddings import LangchainEmbedding

# Text-to-vector model used to map document chunks or user queries to numeric vectors:
embed_model = LangchainEmbedding(
    BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
)

With the splitting and vectorization settings defined, we're ready to define and run LlamaIndex `IngestionPipeline` to ingest the data:

In [None]:
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=[text_splitter, embed_model])
doc_nodes = pipeline.run(documents=docs)

print(f"Ingested {len(doc_nodes)} chunks from {len(docs)} source docs")

In [None]:
doc_nodes[0].metadata

---

## Create and test the query engine

With the chunking and vectorization complete, we're ready to index the data into a queryable store.

Since the end-to-end querying will also include *generating* the text answer from the retrieved documents, we'll need to define our **text generation model** configuration here too:

In [None]:
from langchain.llms.bedrock import Bedrock # required until llama_index offers direct Bedrock integration
from llama_index import ServiceContext, set_global_service_context, VectorStoreIndex


model_kwargs_claude = {
    "temperature": 0,
    "top_k": 10,
    "max_tokens_to_sample": 3000,  # Max response length
}

# Text-to-text model used to formulate final answer from search results:
llm = Bedrock(model_id="anthropic.claude-instant-v1", model_kwargs=model_kwargs_claude)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    chunk_size=512,
)
set_global_service_context(service_context)

vector_index = VectorStoreIndex(
    nodes=doc_nodes,
    service_context=service_context,
)
query_engine = vector_index.as_query_engine(
    similarity_top_k=5,  # The top k=5 search results will be fed through to the LLM prompt
)

You can store the created index to the local file system, in case you'd like to re-load it into memory in future rather than re-creating it from scratch:

In [None]:
os.makedirs("./indices", exist_ok=True)
vector_index.storage_context.persist("./indices/amazon-shareholder-letters")

Now let's test our end-to-end RAG-powered query engine with an example question, for which the answer should be present in the [source document](data/2022-Shareholder-Letter.pdf):

In [None]:
# In-context question:
query = "What is Amazon's investment strategy in generative AI?"

response = query_engine.query(query)
print(response)

Hopefully you should see a correct and relevant response! But of course it can't answer ***every*** question...

In [None]:
# Out-of-context question:
print(query_engine.query("What new features will AWS launch at Re:Invent 2040?"))

In [None]:
# Challenging cross-context question:
print(
    query_engine.query(
        "How many months of RxPass membership could I buy for the same cost as the minimum "
        "grocery order that'd qualify for free delivery?"
    )
)
# ($35 grocery threshold per page 2; RxPass = $5/mo per page 5; therefore the answer is 7)

---

## Automated, end-to-end RAG pipeline evaluation with LlamaIndex evaluators

As shown above, although RAG solutions are powerful they can still fail to answer questions for a variety of reasons, including:

- The corpus could not contain any relevant documents to answer the user's question
- The search engine could fail to return the correct/relevant snippets to build an answer
- The LLM could fail to compose a useful and correct answer from the (correct) retrieved snippets

Therefore to quantify the quality and robustness of the solution, we'll need to take a **data-driven** approach and will be interested in **multiple metrics**.

Although **human evaluation** would provide a useful gold-standard, the effort required is not as scalable as we'd like for large datasets and frequent system updates. Instead, [LlamaIndex's evaluation module](https://gpt-index.readthedocs.io/en/latest/optimizing/evaluation/evaluation.html) provides automated tools that **use LLMs to judge** result quality.

In the sections below, we'll show 4 automated evaluations available in the tool:

1. **Faithfulness**: Whether the final response is in agreement with (doesn't contradict) the retrieved document snippets.
2. **Relevancy**: Whether the response and retrieved content were relevant to the query.
3. **Correctness**: Whether the generated answer is relevant and agreeing with a reference answer.
4. **Guidelines**: Evaluating a system against customisable, user-specified guidelines.

### 1. Faithfulness to source documents

The **Faithfulness** metric compares final generated response with the source document snippets that were retrieved from search, and is useful for checking if the generative model introduced any inconsistencies or **hallucinations**.

![](imgs/rag-eval-flow-faithfulness.png "Flow diagram: After retrieving relevant content & generating RAG response, responses are evaluated for how they match to the retrieved documents")

In [None]:
from llama_index.evaluation import FaithfulnessEvaluator

# We'll be using our 'successful' question and response from earlier:
print(query)
print(response)
print("----------------")

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
eval_result = faithfulness_evaluator.evaluate_response(response=response)

print("Did test pass:", eval_result.passing)
print(f"Test feedback:\n{eval_result.feedback}")

### 2. Relevancy to the user question

The **Relevancy** metric checks whether the response (and retrieved source documents) are actually relevant to the user's question. This is useful for measuring if the query was actually answered by the response.

![](imgs/rag-eval-flow-relevancy.png "Flow diagram: After retrieving relevant content & generating RAG response, responses are evaluated for how they match to the original query")

In [None]:
from llama_index.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
eval_result = relevancy_evaluator.evaluate_response(query=query, response=response)

print("Did test pass:", eval_result.passing)
print(f"Test feedback:\n{eval_result.feedback}")

#### Exploring divergence between faithfulness and relevancy

For an example of how **relevancy** and **faithfulness** can differ, consider the below question that isn't answered by any content in the source corpus:

In [None]:
# Out-of-context question:
ooc_query = "What is the generative ai strategy for other cloud providers?"
ooc_response = query_engine.query(ooc_query)
print(ooc_response)

Although the corpus did not contain information to answer the question, the generated response is still on-topic to what was originally asked... Therefore the relevancy test should **PASS**:

In [None]:
# Evaluate relevance of result to the original question:
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
eval_result = relevancy_evaluator.evaluate_response(query=ooc_query, response=ooc_response)

print("Did relevancy test pass:", eval_result.passing)
print(f"Test feedback:\n{eval_result.feedback}")

However in this case, the faithfulness test **FAILS** because of the divergence between the retrieved content and what was originally asked:

In [None]:
# Evaluate faithfulness of response to retrieved content:
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
eval_result = faithfulness_evaluator.evaluate_response(response=ooc_response)

print("Did faithfulness test pass:", eval_result.passing)
print(f"Test feedback:\n{eval_result.feedback}")

### 3. Correctness by matching to a reference answer

The **Correctness** metric requires a correct reference answer to be provided for the question, and compares whether the generated response agrees with this target answer. It's useful for a 'ground truth' perspective in cases where questions have a single canonical answer and those target answers have been collected already.

![](imgs/rag-eval-flow-correctness.png "Flow diagram: After retrieving relevant content & generating RAG response, responses are evaluated against a reference answer")

Below we'll use a batch of questions to evaluate on all three evaluators discussed above - Faithfulness, Relevancy, and Correctness - to evaluate the overall performance of our RAG application.

First, we'll define the dataset and the utility function to perform the evaluation:

In [None]:
from llama_index.evaluation import CorrectnessEvaluator

eval_question_answer_pair = [
    ("What is the Amazon policy for return to work after pandemic?",
        "Amazon has asked corporate employees to come back to office at least three days a week beginning May 2022."),
    ("What changes did Amazon do to overcome the challenges related to increasing cost in Stores fulfillment network?",
        "During the early pandemic, with physical stores shut, our consumer business grew extraordinarily, with revenue increasing from $245B in 2019 to $434B in 2022. This growth meant doubling our fulfillment center footprint built over 25 years and substantially accelerating a last-mile transportation network now the size of UPS in about two years - no easy feat thanks to hundreds of thousands of Amazonians. However, with that rate and scale of change, much optimization was needed to yield intended productivity; over recent months, we scrutinized every process path in fulfillment centers and transportation, redesigning many processes and mechanisms, resulting in steady gains and cost reductions the last few quarters. We also took this occasion to make larger structural changes setting us up for lower costs and faster speed for years, like reevaluating our US fulfillment network organization. Until recently, Amazon operated one national network distributing inventory from fulfillment centers nationwide, but as this expanded to hundreds more nodes, connecting them efficiently became more complex. Last year, we started re-architecting our inventory placement strategy by leveraging our larger footprint to move from a national to a regionalized network model with eight interconnected regions operating largely self-sufficiently while still shipping nationally when necessary. We also continue improving our algorithms to predict regional inventory needs and have completed this regional rollout with early results like shorter travel distances, lower costs, less environmental impact, and faster customer delivery. Overall, we remain confident about our plans to lower costs, reduce delivery times, and build a meaningfully larger retail business with healthy margins."),
    ("By what percentage did AWS revenue grow year-over-year in 2022?",
        "AWS had a 29% year-over-year ('YoY') revenue in 2022 on $62B revenue base."),
    ("Approximately how many new features and services did AWS launch in 2022 according to the passage?",
        "AWS launched over 3,300 new features and services in 2022"),
    ("Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver according to the passage?",
        "In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors."),
    ("Which was the first inference chip launched by AWS according to the passage?",
        "AWS launched their first inference chips (“Inferentia”) in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense."),
    ("What kind of throughput and latency improvements does the new Inferentia2 chip offer compared to the original Inferentia chip?",
        "Inferentia2 chip, launched by AWS, offers up to four times higher throughput and ten times lower latency than our first Inferentia processor. "),
    ("According to the passage, what percentage of Amazon's unit sales in today comes from third-party sellers?",
        "Today, Amazon sells nearly every physical and digital retail item, with a vibrant third-party seller ecosystem that accounts for 60% of their unit sales. "),
    ("According to the context, in what year did Amazon's annual revenue increase from $245B to $434B?",
        "Amazon's annual revenue increased from $245B in 2019 to $434B in 2022."),
]

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context)
correctness_evaluator = CorrectnessEvaluator(service_context=service_context)


def run_evals(qa_pairs: List[Tuple[str, str]], query_engine):
    """Loop through a Q&A dataset to run a batch evaluation with LlamaIndex"""
    results_list = []
    for question, reference_answer in qa_pairs:
        response = query_engine.query(question)
        generated_answer = str(response)
        correctness_results = correctness_evaluator.evaluate(
            query=question,
            response=generated_answer,
            reference=reference_answer
        )
        faithfulness_results = faithfulness_evaluator.evaluate_response(response=response)
        relevancy_results = relevancy_evaluator.evaluate_response(query=question, response=response)
        cur_result_dict = {
            "query": question,
            "generated_answer": generated_answer,
            "correctness": correctness_results.passing,
            "correctness_feedback": correctness_results.feedback,
            "correctness_score": correctness_results.score,
            "faithfulness": faithfulness_results.passing,
            "faithfulness_feedback": faithfulness_results.feedback,
            "faithfulness_score": faithfulness_results.score,
            "relevancy": relevancy_results.passing,
            "relevancy_feedback": relevancy_results.feedback,
            "relevancy_score": relevancy_results.score
        }
        results_list.append(cur_result_dict)
    evals_df = pd.DataFrame(results_list)
    return evals_df

Then, we can run the evaluation on the dataset and visualize the results in a dataframe:

> ⏰ **Note:** This batch evaluation will make several LLM calls for each Q&A pair in the dataset, so may take a couple of minutes to run.

In [None]:
%%time

evaluation_results = run_evals(eval_question_answer_pair, query_engine)

print(f"""Average correctness score: {evaluation_results.correctness.mean()}
Average faithfulness score: {evaluation_results.faithfulness.mean()}
Average relevancey score: {evaluation_results.relevancy.mean()}""")

evaluation_results

### 4. Guideline Evaluation of Prompt Completions: Using LLaMaIndex

The **Guidelines** evaluator, rather than implementing a fixed metric, provides functionality for you to specify your own evaluation criteria in natural language. This is useful for implementing additional checks for the business' specific concerns.

![](imgs/rag-eval-flow-guidelines.png "Flow diagram: After retrieving relevant content & generating RAG response, responses are evaluated with respect to a set of guidelines defined in natural language.")

First, we create a `GuidelineEvaluator` for each guideline to be checked:

In [None]:
from llama_index.evaluation import GuidelineEvaluator

GUIDELINES = [
    "The response should fully answer the query.",
    "The response should avoid being vague or ambiguous.",
    "The response should not use toxic or profane language.",
    "The response should not be bias or discriminatory.",
    "The response should be specific and use statistics or numbers when possible.",
]

evaluators = [
    GuidelineEvaluator(service_context=service_context, guidelines=guideline)
    for guideline in GUIDELINES
]

Next, let's define which question/response pair we're going to test:

In [None]:
query = "What is Amazon's generative AI strategy?"
response = query_engine.query(query)
print(response)

Finally, we can loop through the guidelines to test and show the result for each one:

In [None]:
for guideline, evaluator in zip(GUIDELINES, evaluators):
    eval_result = evaluator.evaluate_response(
        query=query,
        response=response,
    )
    print("================")
    print(f"Guideline: {guideline}")
    print(f"Pass: {eval_result.passing}")
    print(f"Feedback: {eval_result.feedback}\n")

---

## Summary

In this notebook we introduced how the **Retrieval-Augmented Generation** pattern can be applied to generate more reliable answers and reduce hallucinations, by grounding LLM responses in dynamically-retrieved data from trusted sources.

As demonstrated through this series of notebooks, responsible and effective application of text generation models requires mitigating and monitoring several different risks: From toxicity and hallucination, to potential bias and hijacking. We've explored a range of tools available to tackle these concerns - including prompt engineering approaches, additional guardrail models, and retrieval-augmented generation.

To ensure robustness, businesses will typically need to take a data-driven approach and evaluate solutions across this range of criteria. Although a level of human evaluation will be important to build confidence, **automated evaluation techniques** as shown here can help to scale this monitoring more effectively: Especially when experimenting with different models, prompt templates, and other configurations.

In fact, LLMs can even be used to propose test cases as in the example below:

```python
# DO NOT RUN IN LAB SETTING
# from llama_index.evaluation import DatasetGenerator
#
# data_generator = DatasetGenerator.from_documents(docs)
# eval_questions = data_generator.generate_questions_from_nodes()
# eval_questions
```

So although LLM use-cases raise new governance concerns, they can also provide new tools to help tackle them. By pro-actively managing risks and scaling tests through automation, businesses can build robust solutions and deploy them with confidence.