<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">MLU Advanced Prompt Engineering for LLMs</a>
## <a name="0">Lab 3: Retrieval Augmented Generation (RAG)</a>

This notebook introduces a prompting strategy that helps address both LLM hallucinations and outdated training data. Generative AI models are limited by what they know and fail at generating output about recent trends or events. If the LLM lacks factual knowledge, the responses will be mixed at best and problematic at worst, potentially giving rise to hallucinations. 

Retrieval augmented generation (RAG) can be a solution. RAG is a strategy that pairs information retrieval with designed system prompts to anchor LLMs on precise, up-to-date, and pertinent information retrieved from an external knowledge store. Prompting LLMs with contextual knowledge makes it possible to create domain-specific applications that require an evolving understanding of facts, despite LLM training data remaining static.

1. <a href="#1">Import libraries</a>
2. <a href="#2">Set up Bedrock for inference</a>
3. <a href="#3">Retrieval Augmented Generation (RAG)</a>
    - <a href="#31">Prompting without retriever</a>
    - <a href="#32">Load relevant documents</a>
    - <a href="#33">Split documents into chunks</a>
    - <a href="#34">Embeddings and vector databases</a>
    - <a href="#35">Prompting with RAG</a>
4. <a href="#4">Evaluation of results</a>

Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

<br/>
You will be presented with coding activities to check your knowledge and understanding throughout the notebook whenever you see the MLU robot:

<img style="display: block; margin-left: auto; margin-right: auto;" src="./images/activity.png" alt="Activity" width="125"/>

### <a name="1">1. Import libraries</a>
(<a href="#0">Go to top</a>)

Let's start by installing all required packages as specified in the `requirements.txt` file and importing several libraries.

In [2]:
!pip install -q -U pip --root-user-action=ignore
!pip3 install -q -r requirements.txt --root-user-action=ignore

In [3]:
import re
import boto3
import json
import random
import warnings

import pandas as pd
from sklearn.metrics import accuracy_score

from IPython.display import Markdown

ModuleNotFoundError: No module named 'boto3'

### <a name="2">2. Set up Bedrock for inference</a>
(<a href="#0">Go to top</a>)

To get started, set up Bedrock and instantiate an active `bedrock-runtime` to query LLMs. 

<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    <h4>A note on Bedrock API invocation</h4>

Amazon Bedrock is generally available for all AWS customers. The provided lab for this course currently invokes Bedrock's external endpoint, which has access to Anthropic and AI21 models.

For training and learning purposes, Amazon Bedrock is also available through an internal endpoint that allows accessing Amazon's Titan models. To use that instead, please follow the instructions in the <a href="https://w.amazon.com/bin/view/AmazonBedrock/Products/GetStarted/">Amazon Bedrock Get Started wiki page</a>.
</div>
</br>

In this example we will use **[Claude Instant](https://aws.amazon.com/bedrock/claude/)**, Anthropic's faster, lower-priced yet very capable text generation model.

Notice that we can set parameters for the inference, such as: 
 - `max_tokens_to_sample`: controls the maximum number of tokens in the generated response.
 - `temperature`: controls the randomness of the generated output. This parameter is between zero and one. When set closer to zero, the model tends to select higher probability words. When set further away from zero, the model may select lower-probability words.
 - `stop_sequences`: sequence of characters to indicate the model when it should stop generating output.

We set the temperature to 0 to minimize randomness in the generated output.

In [3]:
from langchain.llms.bedrock import Bedrock

# Define the bedrock-runtime client that will be used for inference
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

# Define the model name and parameters for inference
bedrock_model_id = "anthropic.claude-instant-v1"

# each model has a different set of inference parameters
inference_modifier = {
    "max_tokens_to_sample": 512, 
    "temperature": 0.0,
    "stop_sequences": [
      "\\n\\nHuman:"
    ]
}

# Load the Bedrock langchain module with the selected bedrock model
llm = Bedrock(
    model_id=bedrock_model_id, client=bedrock_runtime, model_kwargs=inference_modifier
)

Next, use Bedrock for inference to test everything works as expected:

In [4]:
Markdown(llm("\n\nHuman: Are you ready to answer some questions?\n\nAssistant:"))

 Yes, I'll do my best to answer your questions!

### <a name="3">3. Retrieval Augmented Generation (RAG)</a>
(<a href="#0">Go to top</a>)

Retrieval Augmented Generation (RAG) is an advanced prompting technique that combines text generation with **information retrieval to enhance the performance and contextual understanding** of large language models. RAG doesn't rely solely on pre-trained knowledge of the underlying LLM; instead, it leverages a retriever  to fetch relevant information from a corpus of text. The retrieved information is used to augment the generation process, producing informative and factual responses. 

For this example we will use a set of compiled questions and answers about re:Invent 2022. AWS re:Invent is a learning conference hosted by AWS for the global cloud computing community. The event features keynote announcements, training and certification opportunities, access to more than 2,000 technical sessions, a partner expo, after-hours events, and more. 

You will see how to use a Bedrock-hosted LLM to answer questions about re:Invent 2022. Then you will learn how to incorporate a retriever to enhance the information that the LLM is initially not able to provide.

We first load a file that contains questions and answers about re:Invent 2022.

In [5]:
df = pd.read_csv("data/reinvent_qa.csv", sep=";")

with pd.option_context("display.max_rows", None):
    with pd.option_context("display.max_colwidth", None):
        display(df.head())

Unnamed: 0,Question,Answer
0,What city was AWS re:Invent 2022 held in?,Las Vegas
1,When did AWS re:Invent 2022 take place?,"November 28 to December 2, 2022"
2,How many years has AWS re:Invent been running?,11 years
3,How many people attended re:Invent 2022 in person?,"Over 51,000"
4,How many keynotes were featured at re:Invent 2022?,5 keynotes


#### <a name="31">3.1. Prompting without retriever</a>
(<a href="#0">Go to top</a>)

We now prompt Claude Instant with a template for question answering and apply it to all questions in our dataframe. The model ignores the answer to most questions.

In [6]:
from langchain.prompts import PromptTemplate

template = """

Human: Answer the question below.
Keep your response as precise as possible and limit it to a few words. 
If you don't know the answer, respond "I don't know".

Here is the question: 
{question}

Assistant:"""


def answer_question_claude(row):
    prompt_message = PromptTemplate.from_template(template).format(
        question=row.Question
    )
    answer = llm(prompt_message)
    return answer.strip()


df["claude_answer"] = df.apply(answer_question_claude, axis=1)

with pd.option_context("display.max_rows", None):
    with pd.option_context("display.max_colwidth", None):
        display(df.head())

Unnamed: 0,Question,Answer,claude_answer
0,What city was AWS re:Invent 2022 held in?,Las Vegas,Las Vegas
1,When did AWS re:Invent 2022 take place?,"November 28 to December 2, 2022","November 28-December 2, 2022."
2,How many years has AWS re:Invent been running?,11 years,17 years
3,How many people attended re:Invent 2022 in person?,"Over 51,000","Over 40,000"
4,How many keynotes were featured at re:Invent 2022?,5 keynotes,Three


#### <a name="32">3.2. Load relevant documents</a>
(<a href="#0">Go to top</a>)


A way to incorporate current information into the model without fine-tuning it is to enrich the answer generation process with information from related documents. LangChain provides several [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/) to achieve this. Document loaders expose a `load` method for loading data as documents from a configured source. After re:Invent 2022, AWS released information about the conference in the form of Internet articles, which we can easily load using LangChain's [`UnstructuredURLLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.url.UnstructuredURLLoader.html). 


<div>
</br>
<img src="attachment:reInvent_2022.png" width="800"/>
</div>

The following code snippet shows how to load data from a URL. Variable `page_content` contains the fetched text body. Here we display the start of the text (we show the content beginning in character 214 to hide the title, authors, and tags). 

In [7]:
from langchain.document_loaders import UnstructuredURLLoader

# List of URLs for the loader. We will only use one in this example.
urls = [
    "https://aws.amazon.com/blogs/security/three-key-security-themes-from-aws-reinvent-2022/",
]

# Define the URL Loader
loader = UnstructuredURLLoader(urls=urls)

# Load the data
data = loader.load()

# Pre-process the content for prettier display
data[0].page_content = re.sub("\n{3,}", "\n", data[0].page_content)
data[0].page_content = re.sub(" {2,}", " ", data[0].page_content)

print(data[0].page_content[214:1200])
print()

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.



AWS re:Invent returned to Las Vegas, Nevada, November 28 to December 2, 2022. After a virtual event in 2020 and a hybrid 2021 edition, spirits were high as over 51,000 in-person attendees returned to network and learn about the latest AWS innovations.

Now in its 11th year, the conference featured 5 keynotes, 22 leadership sessions, and more than 2,200 breakout sessions and hands-on labs at 6 venues over 5 days.

With well over 100 service and feature announcements—and innumerable best practices shared by AWS executives, customers, and partners—distilling highlights is a challenge. From a security perspective, three key themes emerged.

Turn data into actionable insights

Security teams are always looking for ways to increase visibility into their security posture and uncover patterns to make more informed decisions. However, as AWS Vice President of Data and Machine Learning, Swami Sivasubramanian, pointed out during his keynote, data often exists in silos; it isn’t alw



---

<div style="border: 4px solid coral; text-align: center; margin: auto; padding-left: 50px; padding-right: 50px">
    <h2><i>Try it Yourself!</i></h2>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;"><b>Try loading other web pages.</b></br> For example, you could try searching for more web pages containing information about re:Invent 2022. </br>Try loading more than one URL and inspect the results loaded in <code>page_content</code>.
</p>
    <br>
</div>

In [8]:
############## CODE HERE ####################



############## END OF CODE ##################

#### <a name="33">3.3. Split documents into chunks</a>
(<a href="#0">Go to top</a>)


Large documents may pose a challenge for RAG as they might not fit into the context window. Document splitting is often performed to split large documents into smaller chunks. This also allows the retriever to select the more relevant chunks from the document instead of feeding the entire data to an LLM. In this section we use the [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html), which is LangChain's default [text splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters). It takes a list of separators, splits based on the first one and moves to the next if the chunk size is still too large.

In [9]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
)

# Use the recursive character splitter
recur_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    is_separator_regex=True,
)

# Perform the splits using the splitter
data_splits = recur_splitter.split_documents(data)

# Print a random chunk
print(random.choice(data_splits).page_content)

Data protection and privacy

The first step to protecting data is to find it. Amazon Macie now automatically discovers sensitive data, providing continual, cost-effective, organization-wide visibility into where sensitive data resides across your Amazon Simple Storage Service (Amazon S3) estate. With this new capability, Macie automatically and intelligently samples and analyzes objects across your S3 buckets, inspecting them for sensitive data such as personally identifiable information (PII), financial data, and AWS credentials. Macie then builds and maintains an interactive data map of your sensitive data in S3 across your accounts and Regions, and provides a sensitivity score for each bucket. This helps you identify and remediate data security risks without manual configuration and reduce monitoring and remediation costs.

Encryption is a critical tool for protecting data and building customer trust. The launch of the end-to-end encrypted enterprise communication service AWS Wickr 

#### <a name="34">3.4. Embeddings and vector databases</a>
(<a href="#0">Go to top</a>)

For RAG to be successful, we need a way of doing a semantic search to **retrieve the documents that contain the most relevant information to be used in the answer generation process**. At this stage, the concept of **embedding** comes into play. This is the transformation of the previously extracted and chunked text into a vector in a high-dimensional space that represents the semantic meaning.

In this example we will use Amazon's **[Titan Embeddings model](https://aws.amazon.com/bedrock/titan/)** to generate the embeddings.

In [10]:
from langchain.embeddings import BedrockEmbeddings

bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1", client=bedrock_runtime
)

We also need a place to store the documents' vector representation efficiently, allowing for quick retrieval. This is done by **vector databases**, which are designed specifically to handle and index high-dimensional vectors. In this example we will use [FAISS](https://faiss.ai/index.html) (Facebook AI Similarity Search), a library for efficient similarity search and clustering of dense vectors.

You can read more about [vector databases](https://python.langchain.com/docs/modules/data_connection/vectorstores/) in LangChain. Examples of other vector DBs that could potentially be used as retrievers in a RAG use case are [Pinecone](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [Croma](https://python.langchain.com/docs/integrations/vectorstores/pinecone). 

In [11]:
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

# Create a vector DB from documents retrieved from the URL and split with the RecursiveCharacterTextSplitter
vectorstore_faiss = FAISS.from_documents(
    data_splits,
    bedrock_embeddings,
)

<div style="border: 4px solid coral; text-align: center; margin: auto; padding-left: 50px; padding-right: 50px">
    <h2><i>Try it Yourself!</i></h2>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">You can try different models for the embeddings, for instance <a href="https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html">HuggingFaceEmbeddings</a>, as well as different vector databases for the retriever, for instance <a href="https://python.langchain.com/docs/integrations/vectorstores/chroma">Chroma</a> or <a href="https://python.langchain.com/docs/integrations/vectorstores/pinecone">Pinecone</a>.</p>
    <p style=" text-align: center; margin: auto;"><b>Try replacing the embeddings and Vector DB above with other options and check whether you can get improved results.</b></p>
    <br>
</div>


In [12]:
############## CODE HERE ####################



############## END OF CODE ##################

#### <a name="35">3.5. Prompting with RAG</a>
(<a href="#0">Go to top</a>)

Let's finally assemble the text generation with the LLM and the retriever. The query to the model is converted into a vector using the embedding model. This query vector represents the semantic meaning of the user's query. To find the most relevant documents to the user's query, we use a process called "vector similarity search". In essence, this process compares the query vector with all the document vectors in the database, finding the ones most similar to the query vector. The similarity between vectors is typically measured using the "cosine similarity", which captures the angle between the vectors in a multidimensional space. The documents corresponding to the most similar vectors are then returned as the search results.

The code below uses LangChain's [RetrievalQA](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html) that implements the RAG steps outlined above.

In [13]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Supress warnings
warnings.filterwarnings("ignore")

context_template = """

Human: Answer the question below.
Use the given context to answer the question. 
If you don't know the answer, respond "I don't know".
Keep your response as precise as possible and limit it to a few words. 

Here is the context:
{context}

Here is the question: 
{question}

Assistant:"""

# Define the prompt template for Q&A
context_prompt_template = PromptTemplate.from_template(context_template)

# Define the RetrievalQ&A chain
# We pass the llm and the FAISS vector store, retrieving the k most relevant documents
rag_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": context_prompt_template},
)

# Perform RAG using the RetrievalQA chain with FAISS as retriever
df["claude_rag_answer"] = df.Question.apply(
    lambda question: rag_chain({"query": question})["result"].strip()
)

with pd.option_context("display.max_rows", None):
    with pd.option_context("display.max_colwidth", None):
        display(df.head())

Unnamed: 0,Question,Answer,claude_answer,claude_rag_answer
0,What city was AWS re:Invent 2022 held in?,Las Vegas,Las Vegas,"Las Vegas, Nevada"
1,When did AWS re:Invent 2022 take place?,"November 28 to December 2, 2022","November 28-December 2, 2022.","November 28 to December 2, 2022."
2,How many years has AWS re:Invent been running?,11 years,17 years,11 years
3,How many people attended re:Invent 2022 in person?,"Over 51,000","Over 40,000","Over 51,000"
4,How many keynotes were featured at re:Invent 2022?,5 keynotes,Three,5 keynotes


### <a name="4">4. Evaluation of results</a>
(<a href="#0">Go to top</a>)


A key aspect of LLM usage is the evaluation of its results. In this example we will use a second LLM, Anthropic's Claude 2 model, to assess whether the provided answer from Claude Instant is correct. Given that similar answers, such as "Las Vegas" and "Las Vegas, Nevada", should be counted as correct, we ask the evaluator LLM to assess the correctness of the answers by outputting 0 or 1. 

In [14]:
llm_eval = Bedrock(
    model_id="anthropic.claude-v2",
    client=bedrock_runtime,
    model_kwargs=inference_modifier,
)

template_eval = """

Human: You are an evaluator and need to assess whether the two text inputs below match.
Output 1 if they contain matching information.
Output 0 if they contain different or incompatible information.
Your answer must be only 0 or 1.

Here is input 1:
{input1}

Here is input 2:
{input2}

Assistant:"""


def eval_llm_output(row, col_to_eval):
    prompt_message = PromptTemplate.from_template(template_eval).format(
        input1=row.Answer, input2=row[col_to_eval]
    )
    answer = llm(prompt_message)
    return int(answer.strip())


df["claude_eval"] = df.apply(lambda r: eval_llm_output(r, "claude_answer"), axis=1)
df["claude_rag_eval"] = df.apply(
    lambda r: eval_llm_output(r, "claude_rag_answer"), axis=1
)

with pd.option_context("display.max_rows", None):
    with pd.option_context("display.max_colwidth", None):
        display(df)

Unnamed: 0,Question,Answer,claude_answer,claude_rag_answer,claude_eval,claude_rag_eval
0,What city was AWS re:Invent 2022 held in?,Las Vegas,Las Vegas,"Las Vegas, Nevada",1,1
1,When did AWS re:Invent 2022 take place?,"November 28 to December 2, 2022","November 28-December 2, 2022.","November 28 to December 2, 2022.",1,1
2,How many years has AWS re:Invent been running?,11 years,17 years,11 years,0,1
3,How many people attended re:Invent 2022 in person?,"Over 51,000","Over 40,000","Over 51,000",0,1
4,How many keynotes were featured at re:Invent 2022?,5 keynotes,Three,5 keynotes,0,1
5,Did Swami Sivasubramanian gave a keynote at re:Invent 2022?,Yes,Yes,Yes,1,1
6,What Amazon service featured in re:Invent 2022 brings security data into a data lake?,Amazon Security Lake,Lake Formation,Amazon Security Lake,0,1
7,How many AWS security partners have announced integrations with Amazon Security Lake?,More than 37,Three,More than 37.,0,1
8,What does OCSF stand for?,Open Cybersecurity Schema Framework,I don't know.,Open Cybersecurity Schema Framework,0,1
9,What percentage of C-Level executives will have performance requirements related to cybersecurity risk by 2026 according to Gartner?,At least 50%,90%,50%,0,1


Finally, we can compute the accuracy of both approaches:

* Claude Instant without retrieval
* Claude Instant with RAG based on data retrieved from the Internet

In [15]:
# EVALUATION

acc_orig = df.claude_eval.sum() / len(df)
acc_rag = df.claude_rag_eval.sum() / len(df)

print("---")
print(f"Accuracy with Vanilla LLM (Claude Instant)\t{acc_orig*100:.1f}%")
print(f"Accuracy with RAG on LLM (Claude Instant)\t{acc_rag*100:.1f}%")
print("---")

---
Accuracy with Vanilla LLM (Claude Instant)	28.6%
Accuracy with RAG on LLM (Claude Instant)	95.2%
---


### Conclusion

We have demonstrated how to implement a retrival strategy to augment the context given to an LLM when prompting it to answer specific questions. This Retrieval Augmented Generation technique (RAG) enhances the performance of the model by enabling it to generate factually correct responses that standard prompting was not able to produce. 

<div style="border: 4px solid coral; text-align: center; margin: auto; padding-left: 50px; padding-right: 50px">
    <h2><i>Try it Yourself!</i></h2>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">You can try <b>other models available in Bedrock, such as <a href="https://aws.amazon.com/bedrock/jurassic/">Jurassic</a> or <a href="https://aws.amazon.com/bedrock/cohere-command/">Cohere Command</a></b>, and study their performance without and with retrieval augmented generation.</p>
    <br>
</div>


In [16]:
############## CODE HERE ####################



############## END OF CODE ##################

<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
</p>

# Thank you!

