# Retrieval Augmented Question & Answering with Amazon Bedrock using LangChain & Amazon OpenSearch

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

---

Previously, we used the Anthropic Claude model in Amazon Bedrock to demonstrate a basic Question Answering (QA) system, and learned the value of grounding a model with additional context before generating a response. In the previous notebook, we had to manually provide the model with relevant data and context ourselves. However, this approach is not fit for enterprise-level QA systems where there could be hundreds of thousands of large documents.

## Retrieval Augmented Generation (RAG)

We can improve upon this process by implementing an architecture called retrieval augmented generation (RAG). RAG retrieves data from outside the LLM's training data sources and augments the prompts by adding the relevant retrieved data as context. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, without needing to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

## Solution

In this notebook, we aid LLM responses to user queries by implementing RAG using context from external documents. First, we process documents and store these into a vector store. Next, we search the vector store using the user's question, and return relevant data as external context to the LLM. Finally, the LLM generates an answer to the user's question based on the new context provided.

We will walk through implementing the following two patterns: Question Answering (QA) and Conversational AI with conversation memory. 

Let’s break down the solution a little further. 

### Prepare documents for search
![Documents](./images/embeddings_lang.png)

First, the documents must be processed and then indexed in a vector store.
- Load the documents from our directory
- Process the documents by splitting them into smaller chunks
- Create a numerical vector representation of each chunk using an embeddings model
- Create an index using the chunks and the corresponding embeddings

### Respond to the user’s question
![Question](./images/chatbot_lang.png)

Once the vector store is indexed with documents and embeddings, we can search for text relevant to the question being asked. The relevant chunks are sent to the model as additional context, where the model will then generate the answer.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

Let's get started!

## Setup

Before running the rest of this notebook, you'll need to run the cells below to ensure necessary libraries are installed and to connect to Amazon Bedrock.

⚠️ For more details on how the setup works and **whether you might need to make any changes**, refer to the [Amazon Bedrock boto3 setup notebook](../../02_prompt_engineering/1_setup.ipynb) notebook.

In this notebook, we'll also need some extra dependencies:

- [OpenSearch Python Client](https://pypi.org/project/opensearch-py/) to store vector embeddings
- [PyPDF](https://pypi.org/project/pypdf/) for handling PDF files

In [None]:
%pip install -r ../requirements.txt

In [None]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.34.66" \
    "awscli>=1.32.66" \
    "botocore>=1.34.66"

In [None]:
%pip install -U opensearch-py==2.3.1 langchain==0.1.12 "pypdf>=3.8,<4" \
    apache-beam \
    datasets \
    tiktoken \
    rich

In [None]:
import json
import os
import sys
import warnings
from rich import print as rprint
warnings.filterwarnings('ignore')

module_path = "../../"
sys.path.append(os.path.abspath(module_path))
from utils import bedrock

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

## Configure LangChain

LangChain provides convenient integrations with Amazon Bedrock and other services like vector stores and retrievers. We begin with instantiating the large language model (LLM) and the embeddings model. We are using Anthropic Claude version 2 for text generation and Amazon Titan Embeddings G1 - Text for text embedding.

Note: Amazon Bedrock offers a choice of high-performing foundation models (FMs). You can replace the value for `model_id` with one of the available [model IDs](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html) as follows. Some models have different requirements for inputs such as prompt format. As of this writing, all models are supported in the US West (Oregon, us-west-2) Region. If you are using another AWS Region, check the latest [model support by AWS Region](https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html).

```python
llm = BedrockChat(model_id="anthropic.claude-3-haiku-20240307-v1:0", ...)
```


In [None]:
from langchain.embeddings import BedrockEmbeddings
from langchain_community.chat_models import BedrockChat
from langchain.load.dump import dumps

# Instantiate the Anthropic Claude 3 Haiku model
llm = BedrockChat(
    model_id="anthropic.claude-v2",
    model_kwargs={"max_tokens": 200}
)

# Instantiate the Amazon Titan Embeddings G1 - Text embeddings model
bedrock_embeddings = BedrockEmbeddings(
    client=boto3_bedrock,
    model_id="amazon.titan-embed-text-v1" # change this model ID to use another embeddings model
)

### [Optional] Explore using different embeddings models
Below are optional embeddings models from Jina AI and MPNet that are made available through Hugging Face. You can load these models into the local notebook for quick experimentation. Local embedding models require the [sentence-transformers](https://www.sbert.net/) library, which in turn requires Hugging Face's `transformers` library as well as PyTorch. If you want to experiment with these, it is recommended that you use a larger instance type with and a PyTorch kernel.

Note: If you are using the Jina AI embeddings model, you may need an instance type equivalent to `ml.m5.large` or larger.

In [None]:
# set to True to load local models but see the note above first
LOAD_LOCAL_MODELS = False

if LOAD_LOCAL_MODELS:
    %pip install -U sentence-transformers
    from langchain.embeddings import HuggingFaceEmbeddings
    jina_embeddings = HuggingFaceEmbeddings(model_name="jinaai/jina-embeddings-v2-base-en") # Jina AI embeddings model
    mpnet_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") # MPNet embeddings model

## Data Preparation
We will load the documents with the help of [PyPDF in LangChain](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf). For this lab, we will be using the files in the data folder.

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents. For this use-case, we are creating chunks of roughly 2000 characters with an overlap of 200 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter). This text splitter is recommended for generic text.

Note: The retrieved document/text should be large enough to contain enough information to answer a question, but small enough to fit into the model's context window, including the prompt. The embeddings model we are using also has an input tokens limit of 8k tokens, which roughly translates to ~32000 characters.

In [None]:
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

# Load the documents from the data folder/directory
loader = PyPDFDirectoryLoader("../data/")
documents = loader.load()

# Split the documents recursively by character
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

docs = text_splitter.split_documents(documents)

Let's compare the split documents to the original documents.

In [None]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents]) // len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)

print(f"Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.")
print(f"After the split we have {len(docs)} documents compared to the original {len(documents)}.")
print(f"Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.")

Now let's test out our embedding model on a single document to see what an embedding looks like below. These embeddings could be generated for the entire corpus of documents and stored in a vector store for easy retrieval.

In [None]:
try:
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
    modelId = bedrock_embeddings.model_id
    print("Embedding model Id :", modelId)
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if  "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubleshoot this issue please refer to the following resources.\
        \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/setting-up.html\
        \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\
        \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
              \x1b[0m")
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

## Create the vector store
In this workshop we will use the ***vector engine for Amazon OpenSearch Serverless.***

Amazon OpenSearch Serverless is a serverless option in Amazon OpenSearch Service. As a developer, you can use OpenSearch Serverless to run petabyte-scale workloads without configuring, managing, and scaling OpenSearch clusters. You get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment. Pay only for what you use by automatically scaling resources to provide the right amount of capacity for your application — without impacting data ingestion.

In [None]:
import boto3
import time

vector_store_name = "bedrock-workshop-rag"
index_name = "bedrock-workshop-rag-index"
encryption_policy_name = "bedrock-workshop-rag-sp"
network_policy_name = "bedrock-workshop-rag-np"
access_policy_name = "bedrock-workshop-rag-ap"
user_identity = boto3.client("sts").get_caller_identity()["Arn"]
user_account = boto3.client("sts").get_caller_identity()["Account"]
sagemaker_notebook_role = (
    "arn:aws:iam::"
    + user_account
    + ":role/aws-service-role/sagemaker.amazonaws.com/AWSServiceRoleForAmazonSageMakerNotebooks"
)

region = os.environ.get("AWS_DEFAULT_REGION", boto3.session.Session().region_name)

aoss_client = boto3.client(
    "opensearchserverless"
)  # Create the Amazon OpenSearch Serverless client

print("Creating a security policy for AOSS collection..")
security_policy = aoss_client.create_security_policy(
    name=encryption_policy_name,
    policy=json.dumps(
        {
            "Rules": [
                {
                    "Resource": ["collection/" + vector_store_name],
                    "ResourceType": "collection",
                }
            ],
            "AWSOwnedKey": True,
        }
    ),
    type="encryption",
)

print("Creating a network policy for AOSS collection..")
network_policy = aoss_client.create_security_policy(
    name=network_policy_name,
    policy=json.dumps(
        [
            {
                "Rules": [
                    {
                        "Resource": ["collection/" + vector_store_name],
                        "ResourceType": "collection",
                    }
                ],
                "AllowFromPublic": True,
            }
        ]
    ),
    type="network",
)

print("Creating an AOSS collection..")
collection = aoss_client.create_collection(name=vector_store_name, type="VECTORSEARCH")

print("Waiting for an AOSS collection to be created..")
while True:
    status = aoss_client.list_collections(
        collectionFilters={"name": vector_store_name}
    )["collectionSummaries"][0]["status"]
    if status in ("ACTIVE", "FAILED"):
        break
    print(".")
    time.sleep(10)

print("Creating an access policy for the AOSS collection..")
access_policy = aoss_client.create_access_policy(
    name=access_policy_name,
    policy=json.dumps(
        [
            {
                "Rules": [
                    {
                        "Resource": ["collection/" + vector_store_name],
                        "Permission": [
                            "aoss:CreateCollectionItems",
                            "aoss:DeleteCollectionItems",
                            "aoss:UpdateCollectionItems",
                            "aoss:DescribeCollectionItems",
                        ],
                        "ResourceType": "collection",
                    },
                    {
                        "Resource": ["index/" + vector_store_name + "/*"],
                        "Permission": [
                            "aoss:CreateIndex",
                            "aoss:DeleteIndex",
                            "aoss:UpdateIndex",
                            "aoss:DescribeIndex",
                            "aoss:ReadDocument",
                            "aoss:WriteDocument",
                        ],
                        "ResourceType": "index",
                    },
                ],
                "Principal": [user_identity, sagemaker_notebook_role],
                "Description": "Easy data policy",
            }
        ]
    ),
    type="data",
)

host = (
    collection["createCollectionDetail"]["id"]
    + "."
    + region
    + ".aoss.amazonaws.com:443"
)

print("AOSS host: " + host)

Now we are ready to ingest the documents into the vector store. This can be done easily using the [LangChain OpenSearch integration](https://python.langchain.com/docs/integrations/vectorstores/opensearch) which takes in the embeddings model and the documents to create the entire vector store.

⚠️ Note: ⚠️ If you are using the Jina AI embeddings model, be sure you are using an instance type equivalent to `ml.m5.large` or larger before running the code below. If you run into a user permission error below, wait a minute and try running the code again.

In [None]:
from opensearchpy import RequestsHttpConnection, AWSV4SignerAuth
from langchain.vectorstores import OpenSearchVectorSearch

service = 'aoss'
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    bedrock_embeddings, # Use 'jina_embeddings' if using Jina AI or 'mpnet_embeddings' if using MPNet embeddings
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=index_name,
    engine="faiss",
)

## Searching the vector store
We can make a query to the vector store and return the relevant chunks of text.

It takes a few seconds to make the documents available in the collection. If you will get an empty output in the next cell, just wait a little bit and retry. 

### Semantic search methods
[Semantic search](https://opensearch.org/docs/latest/search-plugins/semantic-search/) considers the context and intent of a query. In OpenSearch, semantic search is facilitated by neural search with text embedding models. Semantic search creates a dense vector (a list of floats) and ingests data into a k-NN index.

Short for k-nearest neighbors, the k-NN plugin enables users to search for the k-nearest neighbors to a query point across an index of vectors.  To determine the neighbors, you can specify the space (the distance function) you want to use to measure the distance between points. We will explore some of the distance functions later in this lab.

The k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors:
- Approximate k-NN
- Script Score k-NN
- Painless extensions

#### Approximate k-NN search
Standard k-NN search methods compute similarity using a brute-force approach that measures the nearest distance between a query and a number of points, which produces exact results. This works well in many applications. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. [Approximate k-NN search](https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/) methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. Using this approach requires a sacrifice in accuracy but increases search processing speeds appreciably.

Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. This approach is the preferred method when a dataset reaches hundreds of thousands of vectors.

In [None]:
# Search query
query = "What are the Fidelity's digital asset offerings in 2022?"

# Search for the 3 most relevant documents
results = docsearch.similarity_search(query, k=3)

rprint(dumps(results, pretty=True))

#### Exact k-NN with scoring script
The k-NN plugin implements the OpenSearch [score script](https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/) plugin that you can use to find the exact k-nearest neighbors to a given query point.

Because the score script approach executes a brute force search, it doesn’t scale as well as the approximate approach. This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

In [None]:
results = docsearch.similarity_search(
    query, 
    k=2,
    is_appx_search=False,
    search_type="script_scoring"
)

rprint(dumps(results, pretty=True))

#### k-NN Painless Scripting extensions (exact)
Currently, the vector engine for Amazon OpenSearch serverless supports the approximate k-NN search and scoring script search methods. Below is an example of the [painless scripting](https://opensearch.org/docs/latest/search-plugins/knn/painless-functions/) search method on an Amazon OpenSearch service for your reference.

Similar to the k-NN Script Score, you can use this method to perform a brute force, exact k-NN search across an index. This approach has slightly slower query performance compared to the k-NN scoring script. If your use case requires more customization over the final score, you should use this approach over k-NN scoring script.

```python
results = docsearch.similarity_search(
    query, 
    is_appx_search=False,
    search_type="painless_scripting"
)

print(dumps(results, pretty=True))
```

#### Exact k-NN search with filters
You can apply [k-NN search with filters](https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/) with either the  scoring script or painless extension search methods. Filters can greatly reduce the number of vectors to be searched. Using the k-NN score script, you can apply a filter on an index before executing the nearest neighbor search (sometimes referred to as a pre-filter search). This is useful for dynamic search cases where the index body may vary based on other conditions.

In [None]:
query = "What are the Fidelity's digital asset offerings in 2022?"
filter = {"bool": {"filter": {"term": {"text": "crypto"}}}}

# Pre-filter results
results = docsearch.similarity_search(
    query, 
    k=2,
    is_appx_search=False,
    search_type="script_scoring",
    pre_filter=filter    
)

rprint(dumps(results, pretty=True))

You can also apply metadata filters to only include results from a specific document or a specific page of a document as examples.

In [None]:
query = "What are the Fidelity's digital asset offerings in 2022?"

# Metadata filter to only include results from 2022-Fidelity-Annual-Report.pdf
metadata_filter = {"term": {"metadata.source.keyword": "../data/2022-Fidelity-Investments-Annual-Report.pdf"}}

# Metadata filter to only include results from page 6 of 2022-Fidelity-Annual-Report.pdf
metadata_page_filter = {"bool": {"filter": [{"term": {"metadata.page": 6}}, {"term": {"metadata.source.keyword": "../data/2022-Fidelity-Investments-Annual-Report.pdf"}}]}}

# Search query with metadata filter
results = docsearch.similarity_search(query, k=3, boolean_filter=metadata_filter) 

rprint(dumps(results, pretty=True))

#### Spaces - similarity or distance measures

Approximate Search through OpenSearch supports the following similarity or distance measures:

**Euclidean distance** – The straight-line distance between points.

**L1 (Manhattan) distance** – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.

**L-infinity (chessboard) distance** – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.

**Inner product** – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.

**Cosine similarity** – The cosine of the angle between two vectors in a vector space.

We can specify the distance measure in the `space_type` parameter when we load our documents as seen below.

```python
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    bedrock_embeddings,
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=index_name,
    engine="faiss",
    space_type="l2", # Options are: “l2”, “l1”, “cosinesimil”, “linf”, “innerproduct”; default: “l2”
)
```

#### [Optional] Engines and algorithms
The Approximate k-NN search methods leveraged by OpenSearch use approximate nearest neighbor (ANN) algorithms from the [NMSLIB](https://github.com/nmslib/nmslib), [FAISS](https://github.com/facebookresearch/faiss), and [Lucene](https://lucene.apache.org/) libraries to power k-NN search.

The engine details are as follows:

- Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
- Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
- Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation.

We can specify the engine as shown below.

Note: As of this writing, Amazon OpenSearch Serverless vector search collections don't support the Apache Lucene ANN engine. Vector search collections only support the HNSW algorithm with FAISS and do not support IVF and IVFQ. Please check the updated [limitations](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html#serverless-vector-limitations). 

```python
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    bedrock_embeddings,
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=index_name,
    engine="faiss", # Options are: “nmslib”, “faiss”, “lucene”; default: “nmslib”
)
```

**For HNSW, we can tune the m, ef_construction, and ef_search parameters to achieve our desired trade-off:**

**m** – Controls the maximum number of edges a node in a graph can have. Because each node has to store all of its edges, increasing this value will increase the memory footprint, but also increase the connectivity of the graph, which will improve recall.

**ef_construction** – Controls the size of the candidate queue for edges when adding a node to the graph. Increasing this value will increase the number of candidates to consider, which will increase the index latency. However, because more candidates will be considered, the quality of the graph will be better, leading to better recall during search.

**ef_search** – Similar to ef_construction, it controls the size of the candidate queue for graph traversal during search. Increasing this value will increase the search latency, but will also improve the recall.

In general, we chose configurations that gradually increase the parameters, as detailed in the following table.

![](./images/hnsw_parameters.png)

Below is an example using the **FAISS** engine and providing a parameter configuration that provides a balance between latency, memory, and recall (see table above).

```python
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    bedrock_embeddings,
    opensearch_url=host,
    http_auth=auth,
    timeout = 100,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    index_name=index_name,
    engine="faiss",
    space_type="l2",
    m=16, # maximum number of edges
    ef_construction=128, # size of the candidate queue for edges
    ef_search=128 # size of the candidate queue for graph traversal
)
```

### Maximum marginal relevance search (MMR)
If you’d like to look up for some similar documents, but you’d also like to receive diverse results, MMR is a method you should consider. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.

In [None]:
results = docsearch.max_marginal_relevance_search(
    query, 
    k=2,
    fetch_k=10,
    lambda_param=0.5
)

rprint(dumps(results, pretty=True))

## Orchestrating RAG using LangChain
Now that we can query our vector database for documents, we can retrieve data from outside of a large language model's training data sources and augment our prompts by adding the relevant retrieved data in context.

We can use LangChain to build applications that read data from stored internal documents and summarize them into conversational responses. We can create a Retrieval Augmented Generation (RAG) workflow that introduces new information to the language model during prompting. Implementing context-aware workflows like RAG reduces model hallucination and improves response accuracy.

### Generative Question Answering
Generative Question Answering (GQA) is a paradigm shift from traditional models that use exact keyword matches. GQA uses large language models (LLMs) to generate human-like, novel responses to user queries. 

We instruct the model to base the answer to our question on the information returned from our knowledge base - Amazon OpenSearch serverless. We can do this easily using the [RetrievalQA](https://api.python.langchain.com/en/stable/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html#langchain.chains.retrieval_qa.base.RetrievalQA) chain in LangChain. This chain first does a retrieval step to fetch relevant documents, then passes those documents into an LLM to generate a response.

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, # this is our foundation model
    chain_type="stuff", # uses all of the text from the documents in the prompt
    retriever=docsearch.as_retriever() # this is our vector store
)

Let’s try this with our earlier query:

In [None]:
query = "What are Fidelity's digital asset offerings in 2022?"

response = qa.invoke(query)
rprint(response['result'])

We’re still not entirely protected from convincing yet false information, called hallucinations, by the model. They can happen, and it’s unlikely that we can eliminate the problem completely. However, we can do more to improve our trust in the answers provided.

An effective way of doing this is by adding citations to the response, allowing a user to see where the information is coming from. We can do this using a slightly different version of the RetrievalQA chain called [RetrievalQAWithSourcesChain](https://api.python.langchain.com/en/stable/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html#langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain). RetrievalQAWithSourcesChain does question answering over retrieved documents as before, but it also cites the sources in the text response.

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(search_kwargs={'k': 3}),
    return_source_documents=True
)

response = qa_with_sources.invoke(query)
rprint(response['answer'])

The LLM provided the relevant text directly from the documents. We can see the full result below, and you will notice that you can match the cited text from the result to the text in the source documents provided as context to the model.

In [None]:
rprint(dumps(response, pretty=True))

#### Customizing the GQA prompt
We’ve learned how to ground Large Language Models with source knowledge by using a vector database as our knowledge base. Using this, we can encourage accuracy in our LLM’s responses, keep source knowledge up to date, and improve trust in our system by providing citations with every answer.

In the above scenario you explored the quick and easy way to get a context-aware answer to your question. Now let's have a look at a more customizable option with the help of RetrievalQA where you can customize how the documents fetched should be added to prompt using the `chain_type` parameter. 

If you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using the `return_source_documents` parameter, which returns the documents that are added to the context of the LLM prompt. 

RetrievalQA also allows you to provide a custom [prompt template](https://python.langchain.com/docs/modules/model_io/prompts/) which can be specific to the model.

Note: In this example we are using Anthropic Claude as the LLM under Amazon Bedrock, this particular model performs best if the inputs are provided under `Human:` and the model is requested to generate an output after `Assistant:`. In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """
Human: Use the following pieces of context to provide a concise answer in Italian to the question at the end. \ 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, 
    input_variables=["context", "question"]
)

qa_prompt = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(
        search_kwargs={'k': 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT},
)

Let's run our query with the new prompt from our prompt template:

In [None]:
query = "What are Fidelity's best achievements in 2022?"

response = qa_prompt.invoke({"query": query})
rprint(response["result"])

We can print the full result to see the original query, result or response from the LLM, and the source documents.

In [None]:
rprint(dumps(response, pretty=True))

## Conversational AI with retrieval

Conversational AI is a type of artificial intelligence (AI) that can simulate human conversation. Using our knowledge base as before, we can add in conversation history to have a conversation based on retrieved documents.

We can add conversation history along with our query and context with the assistance of the [ConversationalRetrievalChain](https://api.python.langchain.com/en/stable/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain) LangChain chain. This chain can be used to have conversations with a document. It takes in a question and (optional) previous conversation history.

The algorithm for this chain consists of three parts:
1. Use the chat history and the user's question to create a “standalone question”. This is done so that this question can be passed into the retrieval step to fetch relevant documents. If only the new question was passed in, then relevant context may be lacking. If the whole conversation was passed into retrieval, there may be unnecessary information there that would distract from retrieval.
2. This new rephrased question is passed to the retriever and the relevant documents are returned.
3. The retrieved documents are passed to an LLM along with the new question and relevant documents to generate a final response.


#### Set up our prompt templates
First, we will create a prompt template which will reformat the user input to be more compatible for searching the vector database. The prompt provides instructions for the LLM to rephrase the user input and chat history into a new standalone question that only includes the relevant information from the conversation.

Note that this time our prompt template includes a `{chat_history}` variable where our chat history will be included in the prompt.

In [None]:
condense_prompt = PromptTemplate.from_template("""\
<chat-history>
{chat_history}
</chat-history>

<follow-up-message>
{question}
<follow-up-message>

Human: Given the conversation above (between Human and Assistant) and the follow up message from Human, \
rewrite the follow up message to be a standalone question that captures all relevant context \
from the conversation. Answer only with the new question and nothing else.

Assistant: Standalone Question:""")

The next prompt will be used to pass the rephrased standalone question and the relevant documents returned from the retriever as context for the LLM to respond to with an answer. In this case, we also provide specific instructions about how to answer the question.

Note that this time our prompt template includes a `{context}` variable where context retrieved from the vector database will be included in the prompt.

In [None]:
respond_prompt = PromptTemplate.from_template("""\
<context>
{context}
</context>

Human: Given the context above, answer the question inside the <q></q> XML tags.

<q>{question}</q>

If the answer is not in the context say "Sorry, I don't know as the answer was not found in the context". Do not use any XML tags in the answer.

Assistant:""")

### Set up conversation memory

Now that we have our prompts ready, we need to facilitate conversation memory. We will use LangChain's [ConversationBufferMemory](https://api.python.langchain.com/en/stable/memory/langchain.memory.buffer.ConversationBufferMemory.html#langchain.memory.buffer.ConversationBufferMemory) class which provides an easy way to capture conversational memory for LLM chat applications.

In [None]:
from langchain.memory import ConversationBufferMemory

memory_chain = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    human_prefix="Human",
    ai_prefix="Assistant"
)

memory_chain.chat_memory.add_user_message(
    'Hello, what are you able to do?'
)

memory_chain.chat_memory.add_ai_message(
    "Hi! I am a helpful chat assistant which can answer questions about annual financial reports for 2022 and 2023."
)

### Using LangChain for Conversational Retrieval
Let's check out an example of Anthropic Claude being able to retrieve context through conversational memory below.

In [None]:
from langchain.chains import ConversationalRetrievalChain

conversational_qa = ConversationalRetrievalChain.from_llm(
    llm=llm, # this is our foundation model
    retriever=docsearch.as_retriever(), # this is our Amazon OpenSearch vector database
    memory=memory_chain, # this is the conversational memory storage class
    condense_question_prompt=condense_prompt, # this is the prompt for condensing user inputs
    verbose=False, # change this to True in order to see the logs working in the background
)

conversational_qa.combine_docs_chain.llm_chain.prompt = respond_prompt # this is the prompt in order to respond to condensed questions

response = conversational_qa.invoke({'question': "What were Fidelity's digital asset offerings in 2022?"})
rprint(response['answer'])

Let's take a look to see the full response which contains the question asked, chat history, and the answer from our model.

In [None]:
rprint(dumps(response, pretty=True))

We can ask follow up questions and the model will answer based on the context from the previous question.

In [None]:
response = conversational_qa.invoke({'question': 'Why did they expand these offerings?'})
rprint(response['answer'])

You may continue the conversation by asking follow up questions.

```python
response = conversational_qa.invoke({'question': 'Add additional question here'})
rprint(response['answer'])
```

### Conversational Retrieval with source documents

To return the source documents, we need to specify the `input_key` and `output_key` parameters in our memory buffer.

In [None]:
memory_chain_with_sources = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    input_key="question", # specify the input_key to return source documents
    output_key="answer", # specify the output_key to return source documents
    human_prefix="Human",
    ai_prefix="Assistant"
)

memory_chain_with_sources.chat_memory.add_user_message(
    'Hello, what are you able to do?'
)

memory_chain_with_sources.chat_memory.add_ai_message(
    "Hi! I am a helpful chat assistant which can answer questions about annual financial reports for 2022 and 2023."
)

We will reuse our previous `condense_prompt` and `respond_prompt` templates, and add the `return_source_documents` parameter to our ConversationalRetrievalChain chain.

In [None]:
conversational_qa_with_sources = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=docsearch.as_retriever(),
    memory=memory_chain_with_sources,
    condense_question_prompt=condense_prompt,
    return_source_documents=True, # this is to return the source documents
    verbose=False,
)

conversational_qa_with_sources.combine_docs_chain.llm_chain.prompt = respond_prompt # this is the prompt in order to respond to condensed questions

result = conversational_qa_with_sources.invoke({'question': "What were Fidelity's digital asset offerings in 2022?"})
rprint(result['answer'])

We can see the source documents below. You can cite the sources in your application to provide the users confidence in the responses during the conversation.

In [None]:
rprint(dumps(result['source_documents'], pretty=True))

### Clean up
You have reached the end of this workshop. Following cell will delete all created resources.


In [None]:
aoss_client.delete_collection(id=collection['createCollectionDetail']['id'])
aoss_client.delete_access_policy(name=access_policy_name, type='data')
aoss_client.delete_security_policy(name=encryption_policy_name, type='encryption')
aoss_client.delete_security_policy(name=network_policy_name, type='network')

## Conclusion
In the above implementation of RAG based Question Answering and Conversational AI, we have explored the following concepts and how to implement them using the LangChain integrations for Amazon Bedrock and Amazon OpenSearch Serverless:

- Loading documents and processing them into smaller chunks
- Creating a vector store using vector engine for Amazon OpenSearch Serverless
- Generating embeddings with an embeddings model
- Searching the vector store to retrieve context relevant to the question
- Performing Generative Question Answering using foundation models
- Improving trust in our system by providing citations with every answer
- Preparing prompt templates to use as input to the LLM
- Storing conversation memory and providing the history as context to the LLM

### Next steps
- Experiment with different vector stores
- Leverage various text and embedding models available through Amazon Bedrock to see alternate outputs
- Explore options such as persistent storage of embeddings and document chunks
- Use Amazon Bedrock Knowledge Bases, a fully managed RAG capability with built-in session context management

# Thank You