# Retrieval

Retrieval Augmented Generation (RAG) is the process of inluding data in the prompt to the LLM that was was not part of the language model training data. The overall process looks like:

![flow](https://python.langchain.com/assets/images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)

The components of the RAG process:
1. Document Loaders
2. Document Transformers
3. Text Embedding Models
4. Vector Stores
5. Retrievers

## Document Loaders

These load documents from many different sources. The simpliest type of loader is `TextLoader`, which loads an entire file as a single document. The rest are common loaders you will commonly need:

In [None]:
from langchain.document_loaders import (
  TextLoader, 
  DirectoryLoader, 
  UnstructuredHTMLLoader, 
  JSONLoader,
  UnstructuredMarkdownLoader,
  PyPDFLoader,
  AsyncHtmlLoader,
  WebBaseLoader
)
from langchain.document_loaders.csv_loader import CSVLoader

In [None]:
TextLoader("./data/data.txt").load()

In [None]:
CSVLoader("./data/data.csv").load()

In [None]:
DirectoryLoader("./data/", glob="*.txt", loader_cls=TextLoader).load() # you can pick the loader class

In [None]:
UnstructuredHTMLLoader("./data/data.html").load() # will strip markup and load just the text

In [None]:
JSONLoader(file_path="./data/data.json", jq_schema=".data[].name").load() # returns text not json

In [None]:
UnstructuredMarkdownLoader("./data/data.md").load()

In [None]:
PyPDFLoader("./data/data.pdf").load()

In [None]:
PyPDFLoader("./data/data_long.pdf").load_and_split() # uses RecursiveCharacterTextSplitter to split the document

In [None]:
url = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
loader = AsyncHtmlLoader([url])
loader.load()

In [None]:
loader = WebBaseLoader(url) # a combination of a AsyncHtmlLoader and Html2TextLoader
docs = loader.load() # there is also a loader.aload() for asyncronous loading
print(docs[0].metadata)
print(docs[0])

## Document Transformers

After loading your documents, you will often want to transform them. The most common transformation is splitting the document into smaller chunks. Here are some common transformations:

1. Text splitting
2. Content Transformation
3. Extract Metadata

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.document_transformers.openai_functions import create_metadata_tagger
from langchain.document_transformers import Html2TextTransformer, BeautifulSoupTransformer
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains import AnalyzeDocumentChain

### Text Splitting

The most common text splitter is `RecursiveCharacterTextSplitter` which splits larger documents into smaller documents:

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=100, # maximium number of characters in a chunk (default: 4000)
  chunk_overlap=20 # overlap between chunks to maintain context in each chunk (default: 200)
)

In [None]:
text_splitter.split_documents(TextLoader("./data/sotu.txt").load())[:4]

In [None]:
# can also be used directly on text
RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1).split_text("this is some text 1 2 3 4")

This is a simple `CharacterTextSplitter` that doesn't use multiple separators to split the text:

In [None]:
CharacterTextSplitter(
  separator="#ENTRY",
  chunk_size=10, # you will notice that the entries are still in their own document 
  chunk_overlap=5
).split_documents(TextLoader("./data/data_split.txt").load())

### Content Transformation

#### HTML to Text

In [None]:
url = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
loader = AsyncHtmlLoader(url)
docs = loader.load()
Html2TextTransformer().transform_documents(docs)

In [None]:
loader = AsyncHtmlLoader(url)
docs = loader.load()
BeautifulSoupTransformer().transform_documents(
  docs,
  tags_to_extract=["main"]
)

#### Summarize

There are a few ways to summarize documents:

1. `stuff`: All the documents will be put together and summarized by the LLM.
2. `map_reduce`: Each document will be summarized by the LLM, and the then the LLM will summarize the summaries.
3. `refine`: Will iterate over each document refining the summary until there are no more documents.

In [None]:
sherpa = "https://shop.deere.com/us/product/Sherpa-Full-Zip-Jacket/p/SCUALC0593"
puffer = "https://shop.deere.com/us/product/Puffer-Jacket/p/SCUFLC0082"
loader = AsyncHtmlLoader([sherpa, puffer])
docs = loader.load()
transformed_docs = BeautifulSoupTransformer().transform_documents(
  docs,
  tags_to_extract=["main"]
)
llm = ChatOpenAI(temperature=0)

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(transformed_docs)

In [None]:
chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(transformed_docs)

In [None]:
chain = load_summarize_chain(llm, chain_type="refine")
chain.run(transformed_docs)

If you need more control over the summary prompt, you can provide your own like this:

In [None]:
summary_template = """
Write a concise summary of the following, but DO NOT explicitly say it is a summary:
"{text}"

SUMMARY:
"""
prompt = PromptTemplate.from_template(summary_template)
stuff_chain = load_summarize_chain(llm, "stuff", prompt=prompt)
stuff_chain.run(transformed_docs)

OR

In [None]:
summary_template = """
Write a concise summary of the following, but DO NOT explicitly say it is a summary:
"{text}"

SUMMARY:
"""
prompt = PromptTemplate.from_template(summary_template)
summary_chain = LLMChain(llm=llm, prompt=prompt)
stuff_chain = StuffDocumentsChain(llm_chain=summary_chain, document_variable_name="text")
stuff_chain.run(transformed_docs)

There is also a special chain that can summarize some arbitrary text:

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
chain = load_summarize_chain(llm, chain_type="stuff")
analyze_chain = AnalyzeDocumentChain(combine_docs_chain=chain, text_splitter=text_splitter)
analyze_chain.run(transformed_docs[0].page_content)

### Extract Metadata

Searching for relevant documents can be greatly improved with the proper metadata associated with the text itself. There are a couple ways that you can add metadata with LangChain.

The first way is by using `OpenAI` functions, which tries to extract the metadata for you:

In [None]:
# you can define the schema with a dict or a pydantic model
schema = {
  "properties": {
      "category": {
        "type": "string",
        "description": "the high-level category of the food being discussed like 'fruit' or 'vegetable' or snack"
      },
      "tone": {
        "type": "string",
        "enum": ["positive", "negative"],
        "description": "the tone of the text like 'positive' or 'negative'"
      }
  },
  "required": ["category", "tone"],
}
llm = ChatOpenAI()
document_transformer = create_metadata_tagger(llm=llm, metadata_schema=schema) # can provide prompt if you want

In [None]:
org_documents = [
  Document(page_content="The apples I ate were delicious."),
  Document(page_content="The pizza tasted like crap."),
]
enhanced_documents = document_transformer.transform_documents(org_documents)

In [None]:
enhanced_documents