# RAG application for Q&A with PDF documents

RAG is short for Retrieval-Augmented Generation and the term was coined in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401.pdf).

The architecture of the proposed model was:

<img src="images/rag-architecture.png" width="1200">

For an overview of RAG models check the paper [Retrieval-Augmented Generation for Large
Language Models: A Survey](https://arxiv.org/abs/2312.10997).

Martin Fowler recently published a blog post [Emerging Patterns in Building GenAI Products](https://martinfowler.com/articles/gen-ai-patterns/) that contains a lot of valuable information.

## Model overview

<img src="images/overview.jpg" width="1200">

## Real-world use case of RAG

<img src="images/kol.jpg" width="1200">

## Steps required to build a RAG application

1. __PDF Extraction and Preprocessing__  

   • Extract Text: Use libraries like PyPDFLoader, PyPDF2, pdfplumber, or similar tools to extract the text content from the PDF file.  
   • Clean and Preprocess: Remove unnecessary formatting, fix encoding issues, and possibly normalize the text (e.g., lowercasing, punctuation handling).  
   • Document Segmentation: Depending on your PDF’s structure, you might want to segment it by chapters, sections, or pages if needed.

2. __Chunking the Documents__  

   • Define Chunk Size: Split the extracted text into manageable chunks (e.g., paragraphs or fixed-size windows) so that each piece can be meaningfully processed.  
   • Overlap Chunks: WHY?

3. __Creating Embeddings for the Text Chunks__  

   • Choose an Embedding Model: Use a state-of-the-art embedding model (e.g., OpenAI’s embedding APIs, Sentence Transformers, etc.) that maps text chunks to high-dimensional vectors.  
   • Generate Embeddings: Iterate over the chunks and compute their embeddings. This turns each text snippet into a vector which captures semantic meaning.

4. __Building a Vector Index__

   • Select a Vector Store: Use libraries such as Qdrant, Chroma, Faiss or Pinecone to store and index your embeddings.  
   • Insert Embeddings: Store each vector along with metadata (like the chunk text, source page, or document section) for quick retrieval later on.

5. __Setting Up the Retrieval Mechanism__  

   • Query Embedding: When a user submits a question, embed the question using the same embedding model.  
   • Similarity Search: Query the vector index to retrieve the top-n most similar text chunks based on the question’s embedding.  

6. __Constructing the RAG Pipeline__  

   • Context Combination: Concatenate the retrieved chunks into a context prompt or pass them as additional inputs to the LLM.  
   • Prompt Engineering: Craft a prompt that combines the user’s question with the retrieved context. Ensure the prompt instructs the LLM to use the provided evidence to answer the query.  
   • LLM Query: Use an LLM (like GPT-4) to generate the final answer based on both the question and the supporting context from the PDF.

The following is based on [LangChain](https://www.langchain.com/) which is a composable framework to build with LLMs.

In [44]:
import os
import pprint

from langchain import hub
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

In [45]:
openai_api_key = os.environ.get("OPENAI_API_KEY")

## Baseline

Let's explore a couple of LLM's before we build the RAG application.

We will be using the famous "Attention Is All You Need" paper as source document.

In [46]:
# Download the paper
# !wget https://arxiv.org/pdf/1706.03762
# !mv 1706.03762 PDFs/attention.pdf

In [47]:
prefix = "I am reading the 'Attention is all you need' paper! "

# Text
query = prefix + "How many GPU's were the models trained on?"
# Answer: 8 GPUs - page 7

# Table
# query = (
#     prefix
#     + "What is the BLEU score of the MoE model for English-to-German translation?"
# )
# Answer: 26.03 - page 8

# Image
# query = prefix + "What does the sentence in figure 5 say?"
# Answer: - page 15
# The Law will never be perfect, but its application should be just.
# This is what we are missing, in my opinion.

In [48]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="You are a helpful assistant. Answer all questions to the best of your ability."
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

In [49]:
openai = ChatOpenAI(model="o1")

openai_chain = prompt | openai

response_openai = openai_chain.invoke(
    {
        "messages": [
            HumanMessage(
                content=query,
            ),
        ],
    }
)

print(response_openai.content)

According to the original paper (“Attention Is All You Need,” Vaswani et al., 2017), the authors trained their Transformer models on 8 NVIDIA P100 GPUs. Specifically, in Section 5.3 (Training), they mention that the “base” Transformer model trains for about 12 hours on 8 P100 GPUs, while the larger model (“Transformer big”) takes around 3.5 days on the same setup.


In [50]:
deepseek = ChatOllama(
    model="deepseek-r1:1.5b", temperature=0, base_url="http://localhost:11434"
)

chain_deepseek = prompt | deepseek

response_deepseek = chain_deepseek.invoke(
    {
        "messages": [
            HumanMessage(
                content=query,
            ),
        ],
    }
)

print(response_deepseek.content)

<think>
Okay, so I'm trying to figure out how many GPUs the models from the "Attention is All You Need" paper were trained on. I remember that this paper introduced a transformer-based model for machine translation, and it was pretty influential. But I'm not exactly sure about the specifics of hardware requirements.

First, I think about what the original setup might have been. The authors probably used some kind of cluster or supercomputer because training large models like this would require a lot of computational power. I recall that Google did some work with GPUs back then, but maybe they were using more than one GPU per machine.

I also remember that each machine in the cluster had multiple GPUs. For example, if there were 8 GPUs on a single machine, that's a common setup for multi-GPU training. So, if each of the 16 machines had 8 GPUs, that would be 128 GPUs in total. That makes sense because having more GPUs per machine allows for parallel training, which is essential for handl

## Building the knowledge base
### Download source document
#### Step 1 Extract text from 

Load document

In [51]:
loader = PyPDFLoader("./PDFs/attention.pdf")
documents = loader.load()
print(len(documents))

15


Each `document` corresponds to one page in the PDF file. Let us explore the content of the first document:

In [52]:
print(f"{documents[0].page_content[:500]}")

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


and the corresponding metadata

In [53]:
pprint.pp(documents[0].metadata)

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-04-10T21:11:43+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2024-04-10T21:11:43+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live '
                    '2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': './PDFs/attention.pdf',
 'total_pages': 15,
 'page': 0,
 'page_label': '1'}


Extracting text can be anything from as easy as this simple example or as complex as you want it to be. It is much harder if you want to preserve metadata as chapters, sections etc. Extracting text from tables is not easy and it is even harder with figures and images.

## Step 2
### Chuncking the document
#### Text splitter

Split each page into smaller chunks. 

Set `add_start_index=True` to preserved meta data.

In [54]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

In [55]:
chunks = text_splitter.split_documents(documents)
len(chunks)

52

In [56]:
chunks[12]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './PDFs/attention.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1610}, page_content='3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3')

## Step 3
### Create Embeddings

#### Embeddings visualized

<img src="images/embedding_vector_space.png" width="1200">

We load an embedding model

In [57]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")



In [58]:
embedding_vector = embeddings.embed_query(chunks[0].page_content)

embedding_vector_length = len(embedding_vector)

print(f"Generated vectors of length {embedding_vector_length}\n")
print(f"first 5 elements in embedding vector: \n{embedding_vector[:5]}")

Generated vectors of length 768

first 5 elements in embedding vector: 
[0.00345193431712687, 0.01597711443901062, -0.013028663583099842, 0.0009539231541566551, -0.051165636628866196]


## Step 4
### Create Vector Index

Here we will be using the vector database [Qdrant](https://qdrant.tech/) running locally in Docker.

Note we set the config value for vector lenght to be the same as produced by the embedding model, i.e. 768.

In [59]:
collection_name = "attention"

client = QdrantClient("http://localhost:6333")

collection_exists = client.collection_exists(collection_name=collection_name)

if not collection_exists:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=embedding_vector_length, distance=Distance.COSINE
        ),
    )

vector_store = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=embeddings,
)

### Add embeddings to vector database

In [60]:
if not collection_exists:
    print("Adding documents to the collection")
    ids = vector_store.add_documents(documents=chunks)

## Step 5
### Search vector database

In [61]:
results = vector_store.similarity_search(query)

print(results[0].page_content)

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using
the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We
trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the
bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps
(3.5 days).
5.3 Optimizer
We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and ϵ = 10−9. We varied the learning
rate over the course of training, according to the formula:
lrate = d−0.5
model · min(step_num−0.5, step_num · warmup_steps−1.5) (3)
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.
5.4 Regularization
We employ three types of regularization during training:
7


## Step 6
### Constructing the RAG pipeline

Setup Q&A with LLM using vector database as context

In [62]:
# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vector_store.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | deepseek
    | StrOutputParser()
)

response = qa_chain.invoke(query)
print(response)



<think>
Okay, so I'm trying to figure out how many GPUs were used in the 'Attention is all you need' paper. The user provided a context where it says they trained on one machine with 8 NVIDIA P100 GPUs. That means each GPU was part of that single machine setup. So, even though there are eight GPUs, they're all working together on the same computer. I don't think the number is asking for how many were used in total but rather how many were available or part of the training environment. So the answer should be 8 P100 GPUs.
</think>

The paper was trained on one machine with 8 NVIDIA P100 GPUs, meaning all eight were part of the same setup.


## Final remarks

Simple RAG applications are "just" advanced prompt engineering!

Be aware of the size of the context window when injecting context from the vector database.

* Multimodal RAG
* More advanced techniques
  * Hybrid search
  * Re-ranking
  * Metadata filtering
* CAG

## What did we not cover?

* Evaluation of RAG models