Retrieval Agumented Generation (RAG)
- Allowing pretrained models to access external knowledge bases
- Uses user queries to retrieve relevant documents
- Use embeddings to retrieve relevant information to integrate into the prompt

RAG development steps
- Document Loader -> Splitting -> Storage + Retrieval

<image src="./images/rag_workflow.png" alt="RAG Workflow" width="600">

----

Document Loaders
- Classes designed to load and configure documents from system integration
- Document loaders for commmon file types: .pdf, .csv
- 3rd party loaders: S3, .ipynb, .wav

In [None]:
! pip install pypdf
! pip install unstructured

In [None]:
# PDF 
from langchain_core.document_loaders import PyPDFLoader
loader = PyPDFLoader("path/to/file/attention_is_all_you_need.pdf")

data = loader.load()
print(data[0])



# CSV document loader
from langchain_core.document_loaders import CSVLoader
loader = CSVLoader("fifa_countries_audience.csv")

data = loader.load()
print(data[0])



# HTML document loader
from langchain_core.document_loaders import UnstructuredHTMLLoader
 
loader = UnstructuredHTMLLoader("white_house_executive_order_nov_2023.html")
data = loader.load()

data = loader.load()
print(data[0].metadata)

Dcoument Splitting
- Splitting documents into smaller chunks to fit model input limits
- Break document up to fit within an LLM's context window
- Chunk overlap is used to ensure context is not lost when splitting as while splitting the document, some context may be lost

In [10]:
from langchain_text_splitters import CharacterTextSplitter

quote = """One machine can do the work of fifty ordinary humans.\nNo machine can do the work of one extraordinary human."""

chunk_size=24
chunk_overlap=3

ct_splitter = CharacterTextSplitter(
  separator=".",
  chunk_size=chunk_size,
  chunk_overlap=chunk_overlap
)

docs =  ct_splitter.split_text(quote)
print(docs)
print([len(doc) for doc in docs])

Created a chunk of size 52, which is longer than the specified 24


['One machine can do the work of fifty ordinary humans', 'No machine can do the work of one extraordinary human']
[52, 53]


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# It works better with a larger document, but this is just an example.
rc_splitter = RecursiveCharacterTextSplitter(
  separators=["\n\n", "\n", " ", ""],
  chunk_size=chunk_size,
  chunk_overlap=chunk_overlap
)

docs = rc_splitter.split_text(quote)
print(docs)

['One machine can do the', 'work of fifty ordinary', 'humans.', 'No machine can do the', 'work of one', 'extraordinary human.']


Storage + Retrieval
- Vector databases are used to store and retrieve document chunks
  - Embedding text documents into vectors that capture the semantic meaning
  - User query is embedded to find the most similar documents from the database and insert them into the prompt

In [None]:
from langchain_core.documents import Document

docs = [
  Document(
    page_content="In all marketing copy, TechStack should always be written with the T and S capitalized. Incorrect: techstack, Techstack, etc. ",
    metadata={"guideline": "brand-capitalization"}
  ),
  Document(
    page_content="Our users should be referred to as techies in both internal and external communications.",
    metadata={"guideline": "referring-to-users"}
  )
]

Setting up Chroma Vector Database

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding_function = OpenAIEmbeddings(api_key=openai_api_key, model='text-embedding-3-small')

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embedding_function,
    persist_directory="path/to/vectorstore_directory"  # Specify your directory here
)

retreiver = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 2}  # Number of documents to retrieve
)


# --- Building a prompt template ---
from langchain_core.prompts import ChatPromptTemplate

message = """
Review and fix the following TechStack marketing copy with the following guidelines in consideration:

Guidelines:
{guidelines}

Copy:
{copy}

Fixed Copy:
"""

prompt_template = ChatPromptTemplate.from_messages(["human", message])


# --- Chaining it all together ---
from langchain_core.runnables import RunnablePassthrough

rag_chain = ({
  "guidelines" : retreiver, "copy": RunnablePassthrough()
  | prompt_template
  | llm
})

response = rag_chain.invoke("Here at techstack, our users are the best in the world!")
print(response.content)