# Exploring LangChain Retrieval Algorithms for improved RAG with Cohere models on Amazon Bedrock



In this notebook, we demonstrate the use of Cohere's [Command](https://docs.cohere.com/docs/command-beta) model and their [Embed-english](https://cohere.com/embed) embeddings model to efficiently construct a Retrieval Augmented Generation (RAG) QnA system on a SageMaker Notebook. This notebook is powered by an `ml.t3.medium instance`. These models can be called through the Bedrock API, which we then use to build, experiment with, and tune for enhancing our RAG application for imrpved retrieval using [LangChain](https://www.langchain.com/). Additionally, we showcase how the [FAISS](https://github.com/facebookresearch/faiss) and ChromaDB can be utilized to archive and retrieve embeddings, integrating it into your RAG workflow. 

## What are LangChain Retrievers?

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

LangChain's retrieval algorithms can help improve upon the current capabilities of LLMs in processing, understanding, and generating human-like text. As the size and complexity of documents increase, representing multiple facets of the document in a single embedding can lead to a loss of specificity. Although it’s essential to capture the general essence of a document, it’s equally crucial to recognize and represent the varied sub-contexts within. This is a challenge you are often faced with when working with larger documents. Another challenge with RAG is that with retrieval, you aren’t aware of the specific queries that your document storage system will deal with upon ingestion. This could lead to information most relevant to a query being buried under text. For customers in industries such as healthcare, telecommunications, and financial services who are looking to implement RAG in their applications, the limitations of the regular retriever chain in providing precision, avoiding redundancy, and effectively compressing information make it less suited to fulfilling these needs compared to some of the advanced retrievers we will discuss in this notebook. These techniques are able to distill vast amounts of information into the concentrated, impactful insights that you need, while helping improve price-performance.

### Local setup (Optional):
---

For a local server, follow these steps to execute this jupyter notebook:

1. **Configure AWS CLI**: Configure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) with your AWS credentials. Run `aws configure` and enter your AWS Access Key ID, AWS Secret Access Key, AWS Region, and default output format.

2. **Install required libraries**: Install the necessary Python libraries for working with SageMaker, such as [sagemaker](https://github.com/aws/sagemaker-python-sdk/), [boto3](https://github.com/boto/boto3), and others. You can use a Python environment manager like [conda](https://docs.conda.io/en/latest/) or [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your Python packages in your preferred IDE (e.g. [Visual Studio Code](https://code.visualstudio.com/)).

3. **Create an IAM role for SageMaker**: Create an AWS Identity and Access Management (IAM) role that grants your user [SageMaker permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 


## Contents
---

1. [Requirements](#1.-Requirements)
1. [Data Processing Steps](#2.-Data-Preparation)
1. [Vector Store backed Retrieval](#3.-Vector-store-backed-retriever)
1. [RetrievalQAchain](#4.-RetrievalQA-Chain)
1. [Exploring some popular LangChain Retrievers](#5.-Exploring-some-popular-LangChain-Retrievers)
1. [Conclusion](#6.-Conclusion)

## 1. Requirements
---

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose ml.t3.medium.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [1]:
%%writefile requirements.txt
langchain-community==0.2.6
langchain==0.2.6
boto3==1.34.134
pypdf==4.1.0
faiss-cpu==1.8.0
sqlalchemy==2.0.31
langchain-aws==0.1.8
transformers

Overwriting requirements.txt


In [2]:
!pip install -U -r requirements.txt --quiet

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b>

Before proceeding restart the kernel, go to the "Kernel" menu and select "Restart Kernel".

</div>

Import all the necessary libraries

In [3]:
import sqlalchemy
print(sqlalchemy.__version__)

2.0.31


In [4]:
import langchain
print(langchain.__version__)

0.2.6


In [5]:
import boto3
print(boto3.__version__)

1.34.134


In [6]:
import boto3
from boto3 import client
from botocore.config import Config
import glob
import json
from langchain_aws import BedrockLLM
from langchain.chains import ConversationChain
from langchain.embeddings import BedrockEmbeddings
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.memory import ConversationBufferMemory
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
#from langchain_aws import ChatBedrock
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
import numpy as np
from pypdf import PdfReader, PdfWriter
from urllib.request import urlretrieve

Create the Bedrock client

In [7]:
config = Config(read_timeout=2000)

bedrock = boto3.client(service_name='bedrock-runtime',
                       region_name='us-east-1',
                       config=config)

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b>

Currently only the Command and Command light text generation models from Cohere work with LangChain's [BedrockLLM](https://python.langchain.com/v0.2/docs/integrations/platforms/aws/) class
LangChain's [ChatBedrock](https://python.langchain.com/v0.2/docs/integrations/chat/bedrock/) class which is the suggested way to perform chat completion tasks with LangChain is is yet to support Command R and Command R+ models, this notebook will be updated once the change has been made.

</div>

In [102]:
#Set the desired cohere model as the default model
#Currently only the Command and Command light text generation models from Cohere work with LangChain's BedrockLLM class
#ChatBedrock class is yet to support command R and Coammnd R+, this notebook will be updated once the change has been made

command_light = "cohere.command-light-text-v14"
command_text = "cohere.command-text-v14"

DEFAULT_MODEL = command_text

In [103]:
#llm
llm = BedrockLLM(
    model_id=DEFAULT_MODEL,
    model_kwargs={
        "max_tokens": 2048,  ## MAXIMUM NUMBER OF TOKENS for Mistral Large
        "temperature": 0.5,
        "p": 1
    },
    client=bedrock,
)

In [104]:
#Initialize conversation chain with Cohere model on Bedrock
conversation = ConversationChain(
    # We set verbose to false to suppress the printing of logs during the execution of the conversation chain. This can be set to true when you're debugging your conversation chain or trying to understand how it's working under the hood.
    llm=llm, verbose=False, memory=ConversationBufferMemory() 
)

conversation.predict(input="Hi there!")

' Hi Human! I am excited to talk to you. How can I help you today? What would you like to know or discuss?'

In [11]:
from langchain.embeddings import BedrockEmbeddings

#initialize cohere-embed-model with bedrockembeddings
bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3",
                                       client=bedrock)

## 2. Data Preparation
---

Let's first build out our document store.

In this example, we'll be using several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on.

In [12]:
!mkdir -p ./data

from urllib.request import urlretrieve
urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf'
]

filenames = [
    'AMZN-2023-Shareholder-Letter.pdf',
    'AMZN-2022-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
]

metadata = [
    dict(year=2023, source=filenames[0]),
    dict(year=2022, source=filenames[1]),
    dict(year=2021, source=filenames[2])]

data_root = "./data/"

for idx, url in enumerate(urls):
    file_path = data_root + filenames[idx]
    urlretrieve(url, file_path)

As part of Amazon's culture, the CEO always includes a copy of the 1997 Letter to Shareholders with every new release. This will cause repetition, take longer to generate embeddings, and may skew your results. In the next section you will take the downloaded data, trim the 1997 letter (last 3 pages) and overwrite them as processed files.

In [13]:


local_pdfs = glob.glob(data_root + '*.pdf')

for local_pdf in local_pdfs:
    pdf_reader = PdfReader(local_pdf)
    pdf_writer = PdfWriter()
    for pagenum in range(len(pdf_reader.pages)-3):
        page = pdf_reader.pages[pagenum]
        pdf_writer.add_page(page)

    with open(local_pdf, 'wb') as new_file:
        new_file.seek(0)
        pdf_writer.write(new_file)
        new_file.truncate()

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 512 tokens, which roughly translates to ~2000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [15]:

documents = []

for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]

    documents += document

# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
print(docs[0])

page_content='Dear Shareholders:\nLast year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more.\nThe reasons are many, but start with the progress we’ve made in our financial results and customerexperiences, and extend to our continued innovation and the remarkable opportunities in front of us.\nIn 2023, Amazon’s total revenue grew 12% year-over-year (“Y oY”) from $514B to $575B. By segment, North\nAmerica revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% Y oY from$118B to $131B, and AWS revenue increased 13% Y oY from $80B to $91B.\nFurther, Amazon’s operating income and Free Cash Flow (“FCF”) dramatically improved. Operating\nincome in 2023 improved 201% Y oY from $12.2B (an operating margin of 2.4%) to $36.9B (an operatingmargin of 6.4%). Trailing Twelve Month FCF adjusted for equipment finance leases improved from -$12.8Bin 2022 to $35.5B (up $48.3B).' metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-L

Before we are proceeding we are looking into some interesting statistics regarding the document preprocessing we just performed:

In [16]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)

print(f'Average length among {len(documents)} documents loaded is {avg_doc_length(documents)} characters.')
print(f'After the split we have {len(docs)} documents as opposed to the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_doc_length(docs)} characters.')

Average length among 21 documents loaded is 4515 characters.
After the split we have 141 documents as opposed to the original 21.
Average length among 141 documents (after split) is 689 characters.


We had 3 PDF documents and one txt file which have been split into smaller ~500 chunks.

Now we can see how a sample embedding with `cohere-embed` would look like for one of those chunks.

In [17]:
sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [-0.00990295 -0.00534058 -0.05657959 ...  0.04312134 -0.05709839
  0.01496124]
Size of the embedding:  (1024,)


This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. 

In [18]:
vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

## 3. Vector store backed retriever
---

A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store. This is the simplest method and the one that is easiest to get started with. It creates embeddings for each piece of text.

Once you construct a vector store, it's very easy to construct a retriever. Let's walk through an example.

In [21]:
retriever = vectorstore_faiss.as_retriever()
vector_store_answer = retriever.invoke("Generative AI")
print(vector_store_answer)

[Document(page_content='Sometimes, people ask us “what’s your next pillar? Y ou have Marketplace, Prime, and AWS, what’s next?”\nThis, of course, is a thought-provoking question. However, a question people never ask, and might be evenmore interesting is what’s the next set of primitives you’re building that enables breakthrough customer\nexperiences? If you asked me today, I’d lead with Generative AI (“GenAI”).\nMuch of the early public attention has focused on GenAI applications , with the remarkable 2022 launch of\nChatGPT. But, to our “primitive” way of thinking, there are three distinct layers in the GenAI stack, each ofwhich is gigantic, and each of which we’re deeply investing.\nThebottom layer is for developers and companies wanting to build foundation models (“FMs”). The', metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-Letter.pdf'}), Document(page_content='only been the last five to ten years that it’s started to be used more pervasively by companies. This shift wasdr

<b>Maximum marginal relevance retrieval</b>

By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type

In [22]:
retriever = vectorstore_faiss.as_retriever(search_type="mmr")
vector_store_mmr = retriever.invoke("Generative AI")
print(vector_store_mmr)

[Document(page_content='Sometimes, people ask us “what’s your next pillar? Y ou have Marketplace, Prime, and AWS, what’s next?”\nThis, of course, is a thought-provoking question. However, a question people never ask, and might be evenmore interesting is what’s the next set of primitives you’re building that enables breakthrough customer\nexperiences? If you asked me today, I’d lead with Generative AI (“GenAI”).\nMuch of the early public attention has focused on GenAI applications , with the remarkable 2022 launch of\nChatGPT. But, to our “primitive” way of thinking, there are three distinct layers in the GenAI stack, each ofwhich is gigantic, and each of which we’re deeply investing.\nThebottom layer is for developers and companies wanting to build foundation models (“FMs”). The', metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-Letter.pdf'}), Document(page_content='In the early days of AWS, people sometimes asked us why compute wouldn’t just be an undifferentiated', metadata={

<b>Similarity score threshold retrieval</b>

You can also set a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.

In [23]:
retriever = vectorstore_faiss.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.2})
vector_store_sst = retriever.invoke("Generative AI")
print(vector_store_sst)

[Document(page_content='Sometimes, people ask us “what’s your next pillar? Y ou have Marketplace, Prime, and AWS, what’s next?”\nThis, of course, is a thought-provoking question. However, a question people never ask, and might be evenmore interesting is what’s the next set of primitives you’re building that enables breakthrough customer\nexperiences? If you asked me today, I’d lead with Generative AI (“GenAI”).\nMuch of the early public attention has focused on GenAI applications , with the remarkable 2022 launch of\nChatGPT. But, to our “primitive” way of thinking, there are three distinct layers in the GenAI stack, each ofwhich is gigantic, and each of which we’re deeply investing.\nThebottom layer is for developers and companies wanting to build foundation models (“FMs”). The', metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-Letter.pdf'})]


---
### Question Answering with VectorStoreIndexWrapper

Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

In [24]:
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

We use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM. This wrapper performs the following steps behind the scences:

- Takes input the question
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.

*Note: In this example we are using `Cohere Command` as the underlying model, this particular model performs best if the inputs are provided under `"""\"prompt\":\"{query}\" """` . In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.*

In [50]:
from langchain_core.prompts import PromptTemplate

prompt_template = """\"prompt\":\"{query}\" """

VSPROMPT = PromptTemplate(
        template=prompt_template, input_variables=["question"]
    )
chain_type_kwargs = { "prompt" : VSPROMPT }

In [119]:
vs_query = "Provide me some highlights of the business in the year 2023"

In [120]:
answer = wrapper_store_faiss.query(question=VSPROMPT.format(query=vs_query), llm=llm)
print(answer)

 I don't have the specific information regarding the business performance in 2023, however, the provided text does highlight several accomplishments achieved by the company in that year. 

Some of the notable highlights mentioned in the text include:

- Strong overall performance in 2023, despite the challenging macroeconomic conditions of 2022. 
- Growth in demand and innovation across the company's largest businesses. 
- Improvements made to enhance the customer experience both in the short and long term. 
- A focus on invention, collaboration, discipline, execution, and reimagination. 

It seems that the company had a successful year in terms of adapting and growing, despite the challenges faced. 

Would you like me to elaborate on any of the mentioned points? 


## 4. RetrievalQA Chain
---
In the above scenario you explored some quick and easy ways to get a context-aware answers to your questions. Now let's have a look at a more customizable option with the help of [RetrievalQA](https://docs.smith.langchain.com/cookbook/hub-examples/retrieval-qa-chain) where you can customize how the documents fetched should be added to prompt using `chain_type` parameter. Also, if you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using `return_source_documents` which returns the documents that are added to the context of the LLM prompt. `RetrievalQA` also allows you to provide a custom [prompt template](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/) which can be specific to the model.

In [28]:
from langchain.chains import RetrievalQA

prompt_template = """Text: {context}
    Question: {question}
    you are a chatbot designed to assist the users.
    Answer only the questions based on the text provided. If the text doesn't contain the answer,
    reply that the answer is not available.
    keep the answers precise to the question"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever( #default retriever with vectorstore
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [121]:
query = "Provide me some highlights of the business in the year 2023"
result = qa.invoke({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 The text highlights the success of Amazon's business in 2023, but does not mention specific events or achievements from that year.

Here are some ways the text indirectly supports the claim:

1. Success in business - The text states that 2023 was a strong year for Amazon, indicating that the Amazon Business investment has been successful.

2. Investment in ecommerce - The text mentions Amazon Business as an investment where Amazon's ecommerce and logistics capabilities have been leveraged, suggesting that Amazon is focusing on expanding its online retail presence.

3. Teamwork and execution - The author expresses gratitude towards their teams for their hard work and delivery, implying that Amazon's success is due to effective collaboration and execution.

Can I help you with anything else regarding Amazon's business or the year 2023? 

[Document(page_content='Overall, 2023 was a strong year, and I’m grateful to our collective teams who delivered on behalf of\ncustomers. These results 

We will be using the Retrieval QA chain and initialize it with the following retrievers. Previously, we used vectorstore_faiss as the retriever. Although this is the easiest way to get started, it is not very efficient.

---

## 5. Exploring some popular LangChain Retrievers

In this section, we will be going over some popular LangChain Retrieval algorithms/retrievers. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.Retrievers accept a string `query` as input and return a list of `Document`s as output.

### 5.1 Multi-vector

It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base MultiVectorRetriever which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. The Multi-vector retriever is best used when you are able to extract information from documents that you think is more relevant to index than the text itself. In this example, we will be showing how to create multiple vectors by splitting a document into smaller chunks, and embedding those chunks. The next section will cover the [Parent Document Retriever](#5.2-Parent-Document-Retriever-Chain) type that goes into detail about this kind of retrieval.


In [31]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [32]:
# The vectorstore to use to index the child chunks

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
mv_retriever = MultiVectorRetriever(
    vectorstore=vectorstore_faiss,
    byte_store=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [33]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [34]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [35]:
mv_retriever.vectorstore.add_documents(sub_docs)
mv_retriever.docstore.mset(list(zip(doc_ids, docs)))

In [36]:
# Vectorstore alone retrieves the small chunks
mv_retriever.vectorstore.similarity_search("Generative AI")[0]

Document(page_content='More recently, a newer form of machine learning,called Generative AI, has burst onto the scene and promises to significantly accelerate machine learningadoption. Generative AI is based on very Large Language Models (trained on up to hundreds of billionsof parameters, and growing), across expansive datasets, and has radically general and broad recall andlearning capabilities. We have been working', metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf', 'doc_id': '58672aa6-6533-4af7-921a-45f461933226'})

In [37]:
from langchain.retrievers.multi_vector import SearchType

mv_retriever.search_type = SearchType.mmr

mv_retriever.invoke("Generative AI")[0].page_content

'only been the last five to ten years that it’s started to be used more pervasively by companies. This shift wasdriven by several factors, including access to higher volumes of compute capacity at lower prices than was everavailable. Amazon has been using machine learning extensively for 25 years, employing it in everythingfrom personalized ecommerce recommendations, to fulfillment center pick paths, to drones for Prime Air,to Alexa, to the many machine learning services AWS offers (where AWS has the broadest machine learningfunctionality and customer base of any cloud provider). More recently, a newer form of machine learning,called Generative AI, has burst onto the scene and promises to significantly accelerate machine learningadoption. Generative AI is based on very Large Language Models (trained on up to hundreds of billionsof parameters, and growing), across expansive datasets, and has radically general and broad recall andlearning capabilities. We have been working on our own'

Now, let's initialize the chain using the `MultiVectorRetriever`. We will pass the prompt in via the `chain_type_kwargs` argument.

In [58]:
mv_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=mv_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [123]:
query = "Based on the latest shareholder letter, provide me some highlights of the business in the year 2023"
result = mv_qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 
Some of the highlights of the business based on the information provided from the shareholder letter are:
- Optimism and energy going into the new year, despite a challenging macroeconomic backdrop in 2022
- Growth in demand, even after experiencing unprecedented growth during the first half of the pandemic
- Innovation across all business areas that improved the customer experience in the short and long term
- Adjustments to investment strategies while preserving long-term investments that can drive change for customers, shareholders, and employees 

On AWS, they have made optimizations and improvements, such as leveraging better technology like Graviton chips, that while having a short-term negative impact on revenue, was best for their customers in the long run and should pay off in the future.

Additionally, they had a significant delivery year and made various advancements and announcements of their next-generation generalized offerings. 

Would you like to know more about any o

---

### 5.2 Parent Document Retriever Chain


In this scenario, let's have a look at a more advanced rag option with the help of [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever). When working with document retrieval, you may encounter a trade-off between storing small chunks of a document for accurate embeddings and larger documents to preserve more context. The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. 

First, a `parent_splitter` is used to divide the original documents into larger chunks called `parent documents.` These parent documents can preserve a reasonable amount of context so the LLM can.

Next, a `child_splitter` is applied to create smaller `child documents` from the original documents. These child documents allow the embeddings to reflect more accurately their meaning.

The child documents are then indexed in a vectorstore using embeddings. This enables efficient retrieval of relevant child documents based on similarity.

To retrieve relevant information, the `ParentDocumentRetriever` first fetches the child documents from the vectorstore. It then looks up the parent IDs for those child documents and returns the corresponding larger parent documents.

The `ParentDocumentRetriever` uses an [InMemoryStore](https://api.python.langchain.com/en/v0.1.4/storage/langchain.storage.in_memory.InMemoryBaseStore.html) to store and manage the parent documents. By working with both parent and child documents, this approach aims to balance accurate embeddings with contextual information, providing more meaningful and relevant retrieval results.

In [53]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

In [54]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore_faiss = FAISS.from_documents(
    child_splitter.split_documents(documents),
    bedrock_embeddings,
)

# The storage layer for the parent documents
store = InMemoryStore()

In [55]:
# The storage layer for the parent documents
store = InMemoryStore()
pdr_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore_faiss,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [56]:
pdr_retriever.add_documents(documents, ids=None)

Let’s now call the vector store search functionality - we should see that it returns small chunks (since we’re storing the small chunks).

In [61]:
sub_docs = vectorstore_faiss.similarity_search("Generative AI")

In [62]:
len(sub_docs[0].page_content)

106

In [63]:
print(sub_docs[0].page_content)

and Generative AI . Machine learning has been a technology with high promise for several decades, but it’s


Let’s now retrieve from the overall retriever. This should return large documents - since it returns the documents where the smaller chunks are located.

In [64]:
pdr_retrieved_docs = pdr_retriever.invoke("Generative AI")

In [65]:
len(pdr_retrieved_docs[0].page_content)

1879

In [66]:
print(pdr_retrieved_docs[0].page_content)

and Generative AI . Machine learning has been a technology with high promise for several decades, but it’s
only been the last five to ten years that it’s started to be used more pervasively by companies. This shift wasdriven by several factors, including access to higher volumes of compute capacity at lower prices than was everavailable. Amazon has been using machine learning extensively for 25 years, employing it in everythingfrom personalized ecommerce recommendations, to fulfillment center pick paths, to drones for Prime Air,to Alexa, to the many machine learning services AWS offers (where AWS has the broadest machine learningfunctionality and customer base of any cloud provider). More recently, a newer form of machine learning,called Generative AI, has burst onto the scene and promises to significantly accelerate machine learningadoption. Generative AI is based on very Large Language Models (trained on up to hundreds of billionsof parameters, and growing), across expansive datasets

Now, let's initialize the chain using the `ParentDocumentRetriever`. We will pass the prompt in via the chain_type_kwargs argument.

In [124]:
pdr_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=pdr_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [125]:
query = "Provide me some highlights of the business in the year 2023"
result = pdr_qa({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 The following are some key highlights of Amazon's business in 2023:


- Launch of two end-to-end prototype satellites as part of Project Kuiper, with all key systems validated, a significant milestone in the company's journey to provide broadband connectivity to underserved areas
- On track to launch the first production satellites in 2024 as part of Project Kuiper
- Expansion of Amazon's grocery business, including the Whole Foods Market and Amazon Fresh, to meet the needs of customers who prefer to shop for groceries in physical stores
- Investment in Amazon Business, leveraging the company's e-commerce and logistics capabilities to serve businesses and governments seeking better connectivity and performance


Overall, Amazon had strong results in 2023, driven by the collective efforts of its teams to deliver for customers, and is encouraged by its progress and innovation. 

Can I help you with anything else? 

[Document(page_content='In October, we hit a major milestone in our jour

### 5.3 Contextual Compression Chain
---

Contextual Compression is the final Retriever we will be looking at. One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

`Contextual compression` is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.

To use the `Contextual Compression Retriever`, you’ll need: - a `base retriever` - a `Document Compressor`

The `Contextual Compression Retriever` passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.




The `Contextual Compression Retriever` addresses the challenge of retrieving relevant information from a document storage system, where the pertinent data may be buried within documents containing a lot of irrelevant text. By compressing and filtering the retrieved documents based on the given query context, only the most relevant information is returned.
To utilize the `Contextual Compression Retriever`, you'll need:

- **A base retriever**: This is the initial retriever that fetches documents from the storage system based on the query.
- **A Document Compressor**: This component takes the initially retrieved documents and shortens them by reducing the contents of individual documents or dropping irrelevant documents altogether, using the query context to determine relevance.

The workflow is as follows: The query is passed to the base retriever, which fetches a set of potentially relevant documents. These documents are then fed into the Document Compressor, which compresses and filters them based on the query context. The resulting compressed and filtered documents, containing only the most relevant information, are then returned for further processing or use in downstream applications.

By employing contextual compression, the `Contextual Compression Retriever` improves the quality of responses, reducing the cost of LLM calls, and enhancing the overall efficiency of the retrieval process.

#### Adding contextual compression with an LLMChainExtractor
---

Now let’s wrap our base retriever with a `ContextualCompressionRetriever`. We’ll add an [LLMChainExtractor](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.document_compressors.chain_extract.LLMChainExtractor.html), which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [70]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(
    docs,
    bedrock_embeddings,
).as_retriever()

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    " How has generative AI impacted AWS?"
)
print(compressed_docs)

[Document(page_content="Amazon has been using ML for 25 years and offers many machine learning services (like generative AI), and so has broad ML functionality and customer base. \n(None of the other info was relevant to the question, so I didn't extract it as per instructions.)", metadata={'year': 2022, 'source': 'AMZN-2022-Shareholder-Letter.pdf'}), Document(page_content='I’m not sure how to answer your question. I am an AI chatbot and do not have real-time access to information on the internet. However, I can extract relevant information from the given context to provide you with some information. \n\nIs there anything else I can help you with?', metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-Letter.pdf'}), Document(page_content="I'm not sure which parts of the context are most relevant, but here are some possibilities:\n\n- I could write an entire letter about LLMs and Generative AI, so they must be important topics for Amazon.\n- LLMs and Generative AI will have a signif

Now, let's initialize the chain using the `ContextualCompressionRetriever` with an `LLMChainExtractor`. We will pass the prompt in via the chain_type_kwargs argument.

In [88]:
cc_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [115]:
query = "Based on the latest shareholder letter, provide me some highlights of the business in the year 2023"
result = cc_qa.invoke({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 Here are some of the key highlights of Amazon's business in the year 2023, based on the shareholder letter:


1. Total revenue increased by 12% compared to 2022.
2. The North America segment, which generates the majority of Amazon's revenue, grew 12% year-over-year.
3. Amazon's international segment saw an 11% revenue growth.
4. AWS (Amazon Web Services) revenue increased by 13% compared to 2022.
5. Amazon experienced a dramatic improvement in operating income and free cash flow.

The answer to the question is not available. 

[Document(page_content='Dear shareholders:\nAs I sit down to write my second annual shareholder letter as CEO, I find myself optimistic and energized\nby what lies ahead for Amazon. Despite 2022 being one of the harder macroeconomic years in recent memory,and with some of our own operating challenges to boot, we still found a way to grow demand (on top ofthe unprecedented growth we experienced in the first half of the pandemic). We innovated in our largestbusine

#### More built-in compressors: filters
---

##### LLMChainFilter
---

The [LLMChainFilter](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.document_compressors.chain_filter.LLMChainFilter.html) is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [78]:
from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "Generative AI"
)
print(compressed_docs)

[Document(page_content='Sometimes, people ask us “what’s your next pillar? Y ou have Marketplace, Prime, and AWS, what’s next?”\nThis, of course, is a thought-provoking question. However, a question people never ask, and might be evenmore interesting is what’s the next set of primitives you’re building that enables breakthrough customer\nexperiences? If you asked me today, I’d lead with Generative AI (“GenAI”).\nMuch of the early public attention has focused on GenAI applications , with the remarkable 2022 launch of\nChatGPT. But, to our “primitive” way of thinking, there are three distinct layers in the GenAI stack, each ofwhich is gigantic, and each of which we’re deeply investing.\nThebottom layer is for developers and companies wanting to build foundation models (“FMs”). The', metadata={'year': 2023, 'source': 'AMZN-2023-Shareholder-Letter.pdf'}), Document(page_content='only been the last five to ten years that it’s started to be used more pervasively by companies. This shift wasdr

Now, let's initialize the chain using the `ContextualCompressionRetriever` with an `LLMChainFilter`. We will pass the prompt in via the chain_type_kwargs argument.

In [79]:
filter_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

Let's start asking questions:

In [116]:
query = "Based on the latest shareholder letter, provide me some highlights of the business in the year 2023"
result = filter_qa.invoke({"query": query})
print(result['result'])

print(f"\n{result['source_documents']}")

 Here are some of the most important highlights from the letter for the year 2023:

1. Total revenue increased 12% year-over-year, reaching $575B. This growth was driven by all segments of the business, with a notable increase in revenue from North America (12% increase) and AWS (13% increase).

2. Operating income improved significantly, with a margin of 6.4% in 2023 compared to 2.4% in 2022. 

3. The company's Free Cash Flow (FCF) improved dramatically, going from -$12.8B in 2022 to $35.5B in 2023. 

These financial metrics suggest that Amazon's business is growing stronger and more profitable, and that the company is well-positioned for continued success in the future.

Would you like to know anything else from the letter? 

[Document(page_content='Dear shareholders:\nOver the past 25 years at Amazon, I’ve had the opportunity to write many narratives, emails, letters, and\nkeynotes for employees, customers, and partners. But, this is the first time I’ve had the honor of writing oura

### Observation

Overall, we notice that responses are lacking some level of detail. This is due to the use of LangChain's `BedrockLLM` class for chat completion tasks. Cohere Command R an R+ models are being added to LangChain's `ChatBedrock` class and will be updated in this notebook in the near future.

## 6. Conclusion
---

In this notebook, we presented a solution that allows you to leverage Cohere's Command models and their state of the art Embeddings model to implement different langchain retrieval algorithms into your retrieval chains to enhance the ability of the models to process and generate information. We also explored using persistent storage for embeddings and document chunks and integration with enterprise data stores. Overall, we notice gthat while using retrievers, they are able to provide more detailed responses as opposed to vectore store backed retriever and wrapper, which give answers that do not provide explicit examples from the documents. The retrievers we used not only refine the way LLM models access and incorporate external knowledge, but also significantly improve the quality, relevance, and efficiency of their outputs. By combining retrieval from large text corpora with language generation capabilities, these advanced RAG techniques enable LLMs to produce more factual, coherent, and context-appropriate responses, enhancing their performance across various natural language processing tasks.

### Take-aways
---
- Experiment with different retrieval techniques
- Leverage `Cohere Command` and `Cohere-Embed-english` models available under Amazon SageMaker JumpStart
- Explore options such as persistent storage of embeddings and document chunks
- Integration with enterprise data stores