<a href="https://colab.research.google.com/github/elektromusik/RAG/blob/main/RAG_with_Metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Metadata.

## Load Packages.

In [56]:
!pip --quiet install faiss-cpu langchain langchain_community langchain_mistralai
!pip --quiet install pypdf sentence_transformers

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain_mistralai.chat_models import ChatMistralAI
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from os import getenv

## Data Preprocessing.

In [57]:
# Load all .pdf Documents in some Folder.
loader = PyPDFDirectoryLoader("/content/", glob="*.pdf")
pages = loader.load()

# Choose the Chunk Size.
# [1 page ~ 700 words. 1 chunk <= 256 words (due to the embedding model).
# 1 word ~ 4.7 characters. So, 1 chunk <= 1000 characters, otherwise it is
# truncated. Alltogether, we end up with at least 3 chunks per page at a
# chunk_size of 1000.]
text_splitter = RecursiveCharacterTextSplitter(
                              chunk_size    =1000,
                              chunk_overlap = 100,
                              separators = ["\n\n", "\n", ".", ",", " ", ""])

chunks = [{"page_content" : text_splitter.split_text(pages[i].page_content),
           "metadata" : pages[i].metadata} for i in range(len(pages))]

# Create a List of Langchain Documents.
# [They include content and metadata, i. e. 'source' and 'page'.]
Documents = [Document(page_content=j, metadata=chunk['metadata']) for chunk in
             chunks for j in chunk['page_content']]
# Alternative:
# Documents=[]
# for i in range(len(pages)):
#   for j in chunks[i]['page_content']:
#     Documents.append(Document(page_content=j, metadata=chunks[i]['metadata']))

print(Documents[5])

# Choose the Embedding Model.
# [I tried to find the best embedding model via the MTEB leaderboard at
# huggingface.co:
# 1) nvidia/NV-Embed-v2 (not found on NVIDIA website),
# 2) BAAI/bge-en-icl (runs forever)],
# ...
# 10) nvidia/NV-Embed-v1 (needed packages incompatible).
# Hence, I ended up with the following standard model. The problem with this
# model is, that it truncates above 257 words ~ 1000 characters.]
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Add the Chunks and the Metadata to the Vector Database.
vectorstore = FAISS.from_documents(documents=Documents,
                                   embedding=embedding_model)

page_content='very grave, hesitant faces. Father agreed to finance me for a year and after various delays I came 
east, permanently, I thought, in the spring of twenty-two.
The practical thing was to find rooms in the city but it was a warm season and I had just left a 
country of wide lawns and friendly trees, so when a young man at the office suggested that we take 
a house together in a commuting town it sounded like a great idea. He found the house, a weather 
beaten cardboard bungalow at eighty a month, but at the last minute the firm ordered him to 
Washington and I went out to the country alone. I had a dog, at least I had him for a few days until 
he ran away, and an old Dodge and a Finnish woman who made my bed and cooked breakfast and 
muttered Finnish wisdom to herself over the electric stove.
It was lonely for a day or so until one morning some man, more recently arrived than I, stopped me 
on the road.
"How do you get to West Egg village?" he asked helplessly.' metadata={'



## Main Components of RAG.

In [61]:
# Retriever.
# [Similarity search with a threshold: search_type="similarity_score_threshold",
# search_kwargs={"score_threshold": 0.05}]
retriever = vectorstore.as_retriever()

# Systemprompt.
template = """
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.
Question: {question}
Context:  {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

# Set LLM.
llm = ChatMistralAI(mistral_api_key="QlvclnycvhnkP808e4HS0BWz0kwZU06j")

# Create Pipeline (Retrieve-Augment-Generate)
RAG_chain = ({"context": retriever,  "question": RunnablePassthrough()}
              | prompt
              | llm
              | StrOutputParser())

## Q&A.

In [None]:
print

In [62]:
# Generate
query = """Who was invited to the parties? On which pages are they listed?"""
RAG_chain.invoke(query)

"The individuals invited to Gatsby's parties include Nick Carraway, as mentioned on page 20 of the first document. Uninvited guests also attended, as stated on page 23 of the first document and page 6 of the second document. The list of attendees is not explicitly provided in the context."