# RAG application for Q&A with PDF documents

RAG is short for Retrieval-Augmented Generation and the term was coined in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401.pdf).

The architecture of the proposed model was:

<img src="images/rag-architecture.png" width="800">

1. __PDF Extraction and Preprocessing__  
   • Extract Text: Use libraries like PyPDFLoader, PyPDF2, pdfplumber, or similar tools to extract the text content from the PDF file.  
   • Clean and Preprocess: Remove unnecessary formatting, fix encoding issues, and possibly normalize the text (e.g., lowercasing, punctuation handling).  
   • Document Segmentation: Depending on your PDF’s structure, you might want to segment it by chapters, sections, or pages if needed.

2. __Chunking the Document__  
   • Define Chunk Size: Split the extracted text into manageable chunks (e.g., paragraphs or fixed-size windows) so that each piece can be meaningfully processed.  
   • Overlap Chunks: Optionally use overlapping windows to ensure smooth context transitions between chunks, which helps when a concept spans multiple chunks.

3. __Creating Embeddings for the Text Chunks__  
   • Choose an Embedding Model: Use a state-of-the-art embedding model (e.g., OpenAI’s embedding APIs, Sentence Transformers, etc.) that maps text chunks to high-dimensional vectors.  
   • Generate Embeddings: Iterate over the chunks and compute their embeddings. This turns each text snippet into a vector which captures semantic meaning.

4. __Building a Vector Index__
   • Select a Vector Store: Use libraries such as Qdrant, Chroma, Faiss or Pinecone to store and index your embeddings.  
   • Insert Embeddings: Store each vector along with metadata (like the chunk text, source page, or document section) for quick retrieval later on.

5. __Setting Up the Retrieval Mechanism__  
   • Query Embedding: When a user submits a question, embed the question using the same embedding model.  
   • Similarity Search: Query the vector index to retrieve the top-n most similar text chunks based on the question’s embedding.  
   • Relevance Ranking: Optionally rerank or verify retrieved passages to ensure they are the most contextually appropriate.

6. __Constructing the RAG Pipeline__  
   • Context Combination: Concatenate the retrieved chunks into a context prompt or pass them as additional inputs to the LLM.  
   • Prompt Engineering: Craft a prompt that combines the user’s question with the retrieved context. Ensure the prompt instructs the LLM to use the provided evidence to answer the query.  
   • LLM Query: Use an LLM (like GPT-4) to generate the final answer based on both the question and the supporting context from the PDF.

The following is based on [LangChain](https://www.langchain.com/) which is a composable framework to build with LLMs.

In [58]:
import os

from langchain import hub
from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

## Download source document

We will be using the famous "Attention Is All You Need" paper as source document


In [59]:
openai_api_key = os.environ.get("OPENAI_API_KEY")

In [None]:
# !wget https://arxiv.org/pdf/1706.03762
# !mv 1706.03762 PDFs/attention.pdf

In [None]:
prefix = "I am reading the 'Attention is all you need'. "
query1 = prefix + "What does the sentence in figure 5 say?"  # Image
query2 = prefix + "What is the BLEU score of the big transformer model?"  # Table
query3 = prefix + "How many GPU's were the models trained on?"  # Text

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="You are a helpful assistant. Answer all questions to the best of your ability."
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

In [None]:
llm_openai = ChatOpenAI(model="o1")

chain_openai = prompt | llm_openai

response_openai = chain_openai.invoke(
    {
        "messages": [
            HumanMessage(
                content=query3,
            ),
        ],
    }
)

print(response_openai.content)

According to section 5.3 of the original "Attention Is All You Need" paper, the authors trained their Transformer models on a single machine equipped with 8 NVIDIA P100 GPUs.


In [67]:
llm_deepseek = ChatOllama(
    model="deepseek-r1:1.5b", temperature=0, base_url="http://localhost:11434"
)

chain_deepseek = prompt | llm_deepseek

response_deepseek = chain_deepseek.invoke(
    {
        "messages": [
            HumanMessage(
                content=query3,
            ),
        ],
    }
)

print(response_deepseek.content)


<think>
Okay, so I'm trying to figure out how many GPUs were used when training the models in "Attention is All You Need." I remember that the paper was quite influential and it introduced the Transformer model. But I'm not exactly sure about the specifics of hardware usage.

First, I think the original setup involved multiple GPUs because training large models can be computationally intensive. The idea must have been to distribute the workload across several machines to speed things up. I recall that each GPU might have had a certain number of chips or cores. Maybe around 8-16? That seems plausible for modern hardware.

I also remember something about data distribution. The text wasn't just training one model but multiple models simultaneously. So, if there were four GPUs in total, each would handle two different tasks. This parallel processing should have made the training faster by utilizing all available resources effectively.

Wait, I think the original setup used four GPUs. Each 

## Load document

In [68]:
# loader = PyPDFDirectoryLoader("PDFs/")
loader = PyPDFLoader("./PDFs/attention.pdf")
documents = loader.load()
print(len(documents))

15


Each document corresponds to one page in a PDF file. Let us explore the content of the first document

In [69]:
print(f"{documents[0].page_content[:500]}")


Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


and the corresponding metadata

In [70]:
print(documents[0].metadata)

{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './PDFs/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}


Extracting text can be anything from as easy as this simple example or as complex as you want it to be. It is much harder if you want to preserve metadata as chapters, sections etc. Extracting text from tables is not easy and it is even harder with figures and images.

## Text splitter

Split each page into smaller chunks. 

add_start_index=True ensures meta data is preserved.

In [71]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

# https://python.langchain.com/docs/how_to/split_by_token/

# from langchain_text_splitters import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
#     model_name="gpt-4",
#     chunk_size=1000,
#     chunk_overlap=200,
#     add_start_index=True
# )

all_splits = text_splitter.split_documents(documents)

len(all_splits)

52

In [72]:
all_splits[12]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './PDFs/attention.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1610}, page_content='3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3')

## Embeddings

In [73]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")



In [None]:
vector = embeddings.embed_query(all_splits[0].page_content)

print(f"Generated vectors of length {len(vector)}\n")
print(f"first 5 elements in embedding vector: \n{vector[:5]}")

Generated vectors of length 768

first 5 elements in embedding vector: 
[0.00345193431712687, 0.01597711443901062, -0.013028663583099842, 0.0009539231541566551, -0.051165636628866196]


In [75]:
collection_name = "attention"

client = QdrantClient("http://localhost:6333")

collection_exists = client.collection_exists(collection_name=collection_name)

if not collection_exists:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=768, distance=Distance.COSINE),
    )

vector_store = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=embeddings,
)


### Add embeddings to vector database

In [76]:
if not collection_exists:
    print("Adding documents to the collection")
    ids = vector_store.add_documents(documents=all_splits)

### Search vector database

In [77]:
results = vector_store.similarity_search(query3)

print(results[0])

page_content='We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using
the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We
trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the
bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps
(3.5 days).
5.3 Optimizer
We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and ϵ = 10−9. We varied the learning
rate over the course of training, according to the formula:
lrate = d−0.5
model · min(step_num−0.5, step_num · warmup_steps−1.5) (3)
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.
5.4 Regularization
We employ three types of regularization during training:
7' metadata={'producer': 'pdfTeX-1.40.25', 'creat

## Setup Q&A with LLM using vector database as context

In [None]:
# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vector_store.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm_deepseek
    | StrOutputParser()
)

result = qa_chain.invoke(query3)
print(result)



<think>
Okay, so I'm trying to figure out how many GPUs were used in the models described in the 'Attention is all you need' paper. The user provided a context with some information about the setup. Let me read through it again.

The context says that they trained their models on one machine with 8 NVIDIA P100 GPUs. So, each model was run on 8 of those GPUs. That makes sense because when I look at other parts of the text, like the optimizer and regularization sections, they mention using Adam optimizer and specific learning rate schedules. But for the number of GPUs used per model, it's clearly stated as 8.

I don't see any conflicting information here. The context is pretty straightforward. It just mentions that each model was trained on one machine with 8 P100 GPUs. So, I think the answer is simply that they used 8 GPUs per model.
</think>

The models were trained on one machine with 8 NVIDIA P100 GPUs.
