# Retrieval

Retrieval Augmented Generation (RAG) is the process of inluding data in the prompt to the LLM that was was not part of the language model training data. The overall vector store process looks like:

![flow](https://python.langchain.com/assets/images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)

The components of the RAG process:
1. Document Loaders
2. Document Transformers
3. Text Embedding Models
4. Vector Stores
5. Retrievers

## Document Loaders

These load documents from many different sources. The simpliest type of loader is `TextLoader`, which loads an entire file as a single document. The rest are common loaders you will commonly need:

In [4]:
from langchain.document_loaders import (
  TextLoader, 
  DirectoryLoader, 
  UnstructuredHTMLLoader, 
  JSONLoader,
  UnstructuredMarkdownLoader,
  PyPDFLoader,
  AsyncHtmlLoader,
  WebBaseLoader
)
from langchain.document_loaders.csv_loader import CSVLoader

In [None]:
TextLoader("./data/data.txt").load()

In [None]:
CSVLoader("./data/data.csv").load()

In [None]:
DirectoryLoader("./data/", glob="*.txt", loader_cls=TextLoader).load() # you can pick the loader class

In [None]:
UnstructuredHTMLLoader("./data/data.html").load() # will strip markup and load just the text

In [None]:
JSONLoader(file_path="./data/data.json", jq_schema=".data[].name").load() # returns text not json

In [None]:
UnstructuredMarkdownLoader("./data/data.md").load()

In [None]:
PyPDFLoader("./data/data.pdf").load()

In [None]:
PyPDFLoader("./data/data_long.pdf").load_and_split() # uses RecursiveCharacterTextSplitter to split the document

In [None]:
url = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
loader = AsyncHtmlLoader([url])
loader.load()

In [None]:
loader = WebBaseLoader(url) # a combination of a AsyncHtmlLoader and Html2TextLoader
docs = loader.load() # there is also a loader.aload() for asyncronous loading
print(docs[0].metadata)
print(docs[0])

## Document Transformers

After loading your documents, you will often want to transform them. The most common transformation is splitting the document into smaller chunks. Here are some common transformations:

1. Text splitting
2. Content Transformation
3. Extract Metadata

In [6]:
from typing import Any, Sequence
from doctran import Doctran
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.document_transformers.openai_functions import create_metadata_tagger
from langchain.schema import Document
from langchain.llms.openai import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains import AnalyzeDocumentChain
from langchain_core.documents import BaseDocumentTransformer
from langchain.document_transformers import (
  Html2TextTransformer, 
  BeautifulSoupTransformer, 
  LongContextReorder
)

### Text Splitting

The most common text splitter is `RecursiveCharacterTextSplitter` which splits larger documents into smaller documents:

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=100, # maximium number of characters in a chunk (default: 4000)
  chunk_overlap=20 # overlap between chunks to maintain context in each chunk (default: 200)
)

In [None]:
text_splitter.split_documents(TextLoader("./data/sotu.txt").load())[:4]

In [None]:
# can also be used directly on text
RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1).split_text("this is some text 1 2 3 4")

This is a simple `CharacterTextSplitter` that doesn't use multiple separators to split the text:

In [None]:
CharacterTextSplitter(
  separator="#ENTRY",
  chunk_size=10, # you will notice that the entries are still in their own document 
  chunk_overlap=5
).split_documents(TextLoader("./data/data_split.txt").load())

### Content Transformation

#### HTML to Text

In [None]:
url = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
loader = AsyncHtmlLoader(url)
docs = loader.load()
Html2TextTransformer().transform_documents(docs)

In [None]:
loader = AsyncHtmlLoader(url)
docs = loader.load()
BeautifulSoupTransformer().transform_documents(
  docs,
  tags_to_extract=["main"]
)

#### Summarize

There are a few ways to summarize documents:

1. `stuff`: All the documents will be put together and summarized by the LLM.
2. `map_reduce`: Each document will be summarized by the LLM, and the then the LLM will summarize the summaries.
3. `refine`: Will iterate over each document refining the summary until there are no more documents.

In [None]:
sherpa = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
puffer = "https://shop.deere.com/us/product/Puffer-Jacket/p/SCUFLC0082"
loader = AsyncHtmlLoader([sherpa, puffer])
docs = loader.load()
transformed_docs = BeautifulSoupTransformer().transform_documents(
  docs,
  tags_to_extract=["main"]
)
llm = ChatOpenAI(temperature=0)

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(transformed_docs)

In [None]:
chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(transformed_docs)

In [None]:
chain = load_summarize_chain(llm, chain_type="refine")
chain.run(transformed_docs)

If you need more control over the summary prompt, you can provide your own like this:

In [None]:
summary_template = """
Write a concise summary of the following, but DO NOT explicitly say it is a summary:
"{text}"

SUMMARY:
"""
prompt = PromptTemplate.from_template(summary_template)
stuff_chain = load_summarize_chain(llm, "stuff", prompt=prompt)
stuff_chain.run(transformed_docs)

OR

In [None]:
summary_template = """
Write a concise summary of the following, but DO NOT explicitly say it is a summary:
"{text}"

SUMMARY:
"""
prompt = PromptTemplate.from_template(summary_template)
summary_chain = LLMChain(llm=llm, prompt=prompt)
stuff_chain = StuffDocumentsChain(llm_chain=summary_chain, document_variable_name="text")
stuff_chain.run(transformed_docs)

There is also a special chain that can summarize some arbitrary text:

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
chain = load_summarize_chain(llm, chain_type="stuff")
analyze_chain = AnalyzeDocumentChain(combine_docs_chain=chain, text_splitter=text_splitter)
analyze_chain.run(transformed_docs[0].page_content)

There is also a library `doctran` that stands for Document Transformations that uses OpenAI to analyze documents and perform special tasks. One of those tasks is summarization. Currently, LangChain does not offer any wrapper for this doctran task. Here is how that is done:

In [22]:
llm = OpenAI(max_tokens=1000)
output = llm("Provide a details about George Washinton's life.")
len(output)

2394

In [23]:
doctran = Doctran()
doc = doctran.parse(content=output)
doc_summary = doc.summarize(token_limit=500).execute()

In [24]:
summary = doc_summary.transformed_content
len(summary)

1004

We can also create our own custom document transformer that wraps `doctran`:

In [28]:
class SummarizeTransformer(BaseDocumentTransformer):
  def transform_documents(
    self, 
    documents: Sequence[Document], 
    token_limit: int = 200, 
    **kwargs: Any
  ) -> Sequence[Document]:
    doctran = Doctran()
    transformed_docs = []
    for doc in documents:
      doc_summary = doctran.parse(content=doc.page_content).summarize(token_limit=token_limit).execute()
      new_doc = Document(page_content=doc_summary.transformed_content)
      transformed_docs.append(new_doc)
    return transformed_docs

In [None]:
SummarizeTransformer().transform_documents([Document(page_content=output)])

### Reordering (Lost-in-the-Middle)

If there is too much context, some of it can get lost during inference. There are a couple ways of dealing with that. One, is to return less context. Two, sometimes just shuffling the context can also help.

In [72]:
texts = [
  "Call of the Wild (COTW) is a great, vast hunting game.",
  "Way of the Hunter (WOTH) is a modern and good looking hunting game.",
  "Metallica is a great heavy metal band.",
  "VSCode is the best text editor for programming.",
  "God of War is a great single player game."
]

docs = [Document(page_content=x) for x in texts]

In [73]:
docs

[Document(page_content='Call of the Wild (COTW) is a great, vast hunting game.'),
 Document(page_content='Way of the Hunter (WOTH) is a modern and good looking hunting game.'),
 Document(page_content='Metallica is a great heavy metal band.'),
 Document(page_content='VSCode is the best text editor for programming.'),
 Document(page_content='God of War is a great single player game.')]

In [74]:
LongContextReorder().transform_documents(docs)

[Document(page_content='Call of the Wild (COTW) is a great, vast hunting game.'),
 Document(page_content='Metallica is a great heavy metal band.'),
 Document(page_content='God of War is a great single player game.'),
 Document(page_content='VSCode is the best text editor for programming.'),
 Document(page_content='Way of the Hunter (WOTH) is a modern and good looking hunting game.')]

### Extract Metadata

Searching for relevant documents can be greatly improved with the proper metadata associated with the text itself. There are a couple ways that you can add metadata with LangChain.

The first way is by using `OpenAI` functions, which tries to extract the metadata for you:

In [32]:
# you can define the schema with a dict or a pydantic model
schema = {
  "properties": {
      "category": {
        "type": "string",
        "description": "the high-level category of the food being discussed like 'fruit' or 'vegetable' or snack"
      },
      "tone": {
        "type": "string",
        "enum": ["positive", "negative"],
        "description": "the tone of the text like 'positive' or 'negative'"
      }
  },
  "required": ["category", "tone"],
}
llm = ChatOpenAI()
document_transformer = create_metadata_tagger(llm=llm, metadata_schema=schema) # can provide prompt if you want

In [33]:
org_documents = [
  Document(page_content="The apples I ate were delicious."),
  Document(page_content="The brocolli was bad."),
]
enhanced_documents = document_transformer.transform_documents(org_documents)

In [34]:
enhanced_documents

[Document(page_content='The apples I ate were delicious.', metadata={'category': 'fruit', 'tone': 'positive'}),
 Document(page_content='The brocolli was bad.', metadata={'category': 'vegetable', 'tone': 'negative'})]

## Vectore Store of Embeddings

It is very common for unstructured text to be stored as embeddings in a vector database, since it simplifies saving, searching, and retreiving text much easier. This vector store then becomes the primary data source for RAG. It is not the only source of data, but it is a common source.

### Text Embedding Models

These models can produce numerical embeddings of text. The purpose is so that retrieval can happen quickly with the much simplier vector space, which is a compressed representation of the text. There are a number of text embedding models that can be used. Here are a few:

In [11]:
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceBgeEmbeddings
from langchain.vectorstores.chroma import Chroma

In [None]:
openai_model = OpenAIEmbeddings()
hug_model = HuggingFaceBgeEmbeddings(
  model_name="BAAI/bge-small-en",
  model_kwargs={"device": "cpu"},
  encode_kwargs={"normalize_embeddings": True}
)

In [13]:
embeddings = openai_model.embed_documents([
  "this is a test",
  "another test"
])
print(len(embeddings), len(embeddings[0]))

2 1536


In [20]:
embedding = openai_model.embed_query("this is a test") # single string
len(embedding)

1536

In [19]:
embedding = hug_model.embed_query("this is a test")
len(embedding)

384

### Vector Stores

There are a number of possible vector stores, but a few common local ones are `Chroma` and `FAISS`. A common online option is `Pinecone`. All of them implement the `VectorStore` interace. Let's walk through an example using `Chroma`:

In [40]:
docs = [Document(page_content="Michael Jordan is the best basketball player of all time.")]
chroma = Chroma(embedding_function=openai_model)

In [41]:
record_ids = chroma.add_documents(docs)

In [42]:
chroma.get(record_ids) # has embedding but not returned by default

{'ids': ['816fd5d2-9abd-11ee-aa64-825fc74ed04f'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['Michael Jordan is the best basketball player of all time.'],
 'uris': None,
 'data': None}

In [43]:
chroma.similarity_search("jordan", k=1)

[Document(page_content='Michael Jordan is the best basketball player of all time.')]

In [44]:
chroma.similarity_search_with_relevance_scores("jordan", k=1)

[(Document(page_content='Michael Jordan is the best basketball player of all time.'),
  0.7929736264432757)]

In [52]:
# maximal marginal relevance (MMR) optimizes similarity and diversity
chroma.max_marginal_relevance_search("jordan", k=1)

Number of requested results 20 is greater than number of elements in index 1, updating n_results = 1


[Document(page_content='Michael Jordan is one of the best basketball player of all time.', metadata={'type': 'quote'})]

In [48]:
# can update one or more documents
# for some reason requires metadata in document
chroma.update_document(
  record_ids[0], 
  Document(page_content="Michael Jordan is one of the best basketball player of all time.", metadata={"type": "quote"})
)
chroma.get(record_ids[0])

{'ids': ['816fd5d2-9abd-11ee-aa64-825fc74ed04f'],
 'embeddings': None,
 'metadatas': [{'type': 'quote'}],
 'documents': ['Michael Jordan is one of the best basketball player of all time.'],
 'uris': None,
 'data': None}

In [None]:
# it is also possible to create the VectorStore client and add documents at the same time
Chroma.from_texts(["this is a test"], openai_model) # from_documents also available

In [54]:
chroma.delete_collection()

## Retrievers

Retreievers are able to query and retreive documents. Often a retriever uses a VectorStore, but not always. Moreover, retreivers are `Runnable`, which means they can be used as part of a `Chain`.

Once we have the data loaded, transformed, and saved into a VectorStore, then retrieve relevant documents:

In [10]:
import logging
from dotenv import load_dotenv
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.retrievers.document_compressors import (
  LLMChainExtractor, 
  LLMChainFilter, 
  EmbeddingsFilter,
  CohereRerank
)
from langchain.retrievers import (
  BM25Retriever, 
  EnsembleRetriever, 
  MultiQueryRetriever,
  ContextualCompressionRetriever
)

In [None]:
load_dotenv()

In [14]:
# load
docs = TextLoader("./data/sotu.txt").load()
# transform
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
docs = splitter.split_documents(docs)
# embed and save
chroma = Chroma.from_documents(docs, openai_model)

In [84]:
# retrieve directly
retriever = chroma.as_retriever(search_kwargs = {"k": 2}) # top 2 sources
matching_docs = retriever.invoke("What did the president say about the economy?")

We can retrieve the relevant documents as part of a `chain` injecting the context into the prompt:

In [None]:
# retrieve as part of a chain
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context:

{context}                                         

Question: {question}                                          
""")

def format_docs(docs: Sequence[Document]) -> str:
  return "\n\n".join([x.page_content for x in docs])

llm = ChatOpenAI()
chain = (
  {"context": retriever | format_docs, "question": RunnablePassthrough()}
  | prompt | llm
)
chain.invoke("What did the president say about the economy?")

### Multi-query Retriever

Instead of relying directly on the original query to retrieve the results, it can be beneficial to ask slightly different queries. The `MultiQueryRetriever` will provide that by doing multiple retrieval calls with subtle changes and joining them all together.

In [96]:
llm = ChatOpenAI()
chroma = Chroma(embedding_function=openai_model).as_retriever()
retriever = MultiQueryRetriever.from_llm(chroma, llm)

In [None]:
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
retriever.get_relevant_documents("What did the president say about the economy?")

Original Query: "What did the president say about the economy?"

Variations:
1. Can you provide any information on the president's statements regarding the economy?
2. What are the president's comments about the current state of the economy?
3. I'm interested in knowing the president's views on the economy. Could you share any details?

### Contextual Compression

Sometimes the retrieved documents from a vector store may contain unneeded information that is unrelated to the query, or the document itself may just not be relevant to the query. Compression refers to the process of only keeping the relevant part of the document and relevent documents.

First, let's see just the common compressor `LLMChainExtractor`, which extract important parts of the document:

In [7]:
content = [
  "We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees.",
   "Let’s pass the Paycheck Fairness Act and paid leave."
]
docs = [Document(page_content=x) for x in content]

In [None]:
compressor = LLMChainExtractor.from_llm(llm)
compressed_docs = compressor.compress_documents(docs, "fairness act")

In [108]:
compressed_docs

[Document(page_content='giving workers a fair shot'),
 Document(page_content='Let’s pass the Paycheck Fairness Act')]

A slightly different version of the compressor is the `LLMChainFilter`, which will remove the documents that are not important to the query:

In [None]:
compressor = LLMChainFilter.from_llm(llm)
compressed_docs = compressor.compress_documents(docs, "fairness act")
len(compressed_docs)

Thus far we have seen compressors that rely upon the LLM to perform the work. There is also the `EmbeddingsFilter` which will use the embeddings to compare the documents and remove those that fall under a threshold. This approach is much faster than using the LLM:

In [115]:
compressor = EmbeddingsFilter(embeddings=openai_model, similarity_threshold=0.80)
compressed_docs = compressor.compress_documents(docs, "fairness act")
len(compressed_docs)

1

These methods have been good, but the best way to compress is to rerank the documents using a dedicated reranking model, like the one provided by `Cohere` or an open source model. The reason why this is better is because it is faster than using a LLM and it maintains full semantic context unlike embeddings which lose fidelity during the embedding process. Here is how to do it with Cohere:

In [None]:
rerank = CohereRerank()
rerank.compress_documents(docs, "fairness act")

Now, let's combine the compressor with a retriever to see how it works. This is done using the `ContextualCompressionRetriever`:

In [None]:
chroma = Chroma(embedding_function=openai_model).as_retriever()
retriever = ContextualCompressionRetriever(base_compressor=rerank, base_retriever=chroma)
retriever.get_relevant_documents("What did the president say about the economy?")

### Ensemble Retriever

From a high-level this retriever simply uses multiple retrievers, combines their results, reranks them using Reciprocal Rank Fusion and returns a single list of documents. Practically, this approach is normally used with diverse types of retrievers. For example, one can use a sparse database like OpenSearch or any BM25 algorithm and a dense database like the vector stores we have been using. This type of strategy is called a "hybrid search". We will use BM25 and Chroma as our retrievers:

In [22]:
docs = [
  "I like apples",
  "I like oranges",
  "Apples and oranges are fruits"
]
bm25 = BM25Retriever.from_texts(docs)
bm25.k = 2
chroma = Chroma.from_texts(docs, openai_model).as_retriever(search_kwargs = {"k": 2})

In [23]:
retriever = EnsembleRetriever(
  retrievers=[bm25, chroma],
  weights=[0.5, 0.5]
)

In [24]:
retriever.get_relevant_documents("apples")

[Document(page_content='I like apples'),
 Document(page_content='Apples and oranges are fruits')]