<a href="https://colab.research.google.com/github/elektromusik/RAG/blob/main/RAG_with_Metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Metadata.

## Load Packages.

In [10]:
!pip --quiet install faiss-cpu langchain langchain_community langchain_mistralai
!pip --quiet install pypdf sentence_transformers

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain_mistralai.chat_models import ChatMistralAI
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

## Data Preprocessing.

In [11]:
# Load the .pdf Documents. Adjust the page numbers.
loader = PyPDFDirectoryLoader("/content/", glob="*.pdf")
pages = loader.load()
[page.metadata.update({"page": page.metadata["page"] + 1}) for page in pages]

# Choose the Chunk Size.
# [1 chunk has at most 256 words (exceeding words are truncated due to the
# embedding model). 1 word in English language has 4.7 characters on average.
# Hence, 1 chunk has at most about 1000 characters. Further, 1 page has
# around 700 words. Alltogether, we end up with at least 3 chunks per page at
# a chunk_size of 1000.]
text_splitter = RecursiveCharacterTextSplitter(
                              chunk_size    =1000,
                              chunk_overlap = 100,
                              separators = ["\n\n", "\n", ".", ",", " ", ""])

# Create the chunks.
chunks = [{"chunk_content" : text_splitter.split_text(pages[i].page_content),
           "metadata" : pages[i].metadata} for i in range(len(pages))]

# Create Langchain Documents.
Documents = [Document(page_content=text, metadata=chunk['metadata'])
             for chunk in chunks for text in chunk['chunk_content']]

# Check the metadata.
[print(Document.metadata) for Document in Documents[:15] + Documents[-15:]]

# Choose the Embedding Model.
# [I tried to find the best embedding model via the MTEB leaderboard at
# huggingface.co:
# 1) nvidia/NV-Embed-v2 (not found on NVIDIA website),
# 2) BAAI/bge-en-icl (runs forever)],
# ...
# 10) nvidia/NV-Embed-v1 (packages incompatible).
# Hence, I ended up with the following standard model. The problem with this
# model is, that it truncates above 257 words ~ 1000 characters.]
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Add the Chunks and the Metadata to the Vector Database.
vectorstore = FAISS.from_documents(documents=Documents,
                                   embedding=embedding_model)

{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 1}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 1}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 1}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 1}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 1}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 2}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 2}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 2}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 2}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 2}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 3}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 3}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 3}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 3}
{'source': '/content/The_Great_Gatsby_Part_1.pdf', 'page': 4}
{'source': '/content/The_Great_Gatsby_Part_2.pdf', 'page': 46}
{'sourc



## Main Components of RAG.

In [12]:
# Retriever.
# [Similarity search with a threshold: search_type="similarity_score_threshold",
# search_kwargs={"score_threshold": 0.05}]
retriever = vectorstore.as_retriever()

# Systemprompt.
template = """
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.
Question: {question}
Context:  {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Set LLM.
llm = ChatMistralAI(mistral_api_key=YOUR_API_KEY)

# Create Pipeline (Retrieve-Augment-Generate)
RAG_chain = ({"context": retriever,  "question": RunnablePassthrough()}
              | prompt
              | llm
              | StrOutputParser())

## Q&A.

In [13]:
# Generate
query = """Who was invited to the parties? On which pages are they listed?"""
RAG_chain.invoke(query)

"The individuals who were invited to Gatsby's parties include Nick Carraway, as mentioned on page 21 of the first document. Other people were not specifically invited and simply forced their way in, as stated on page 24 of the first document. A woman named Roosevelt also brought someone to Gatsby's house, as mentioned on page 24 of the first document."