# RAG system for Q&A with PDF documents

1. __PDF Extraction and Preprocessing__  
   • Extract Text: Use libraries like PyPDFLoader, PyPDF2, pdfplumber, or similar tools to extract the text content from the PDF file.  
   • Clean and Preprocess: Remove unnecessary formatting, fix encoding issues, and possibly normalize the text (e.g., lowercasing, punctuation handling).  
   • Document Segmentation: Depending on your PDF’s structure, you might want to segment it by chapters, sections, or pages if needed.

2. __Chunking the Document__  
   • Define Chunk Size: Split the extracted text into manageable chunks (e.g., paragraphs or fixed-size windows) so that each piece can be meaningfully processed.  
   • Overlap Chunks: Optionally use overlapping windows to ensure smooth context transitions between chunks, which helps when a concept spans multiple chunks.

3. __Creating Embeddings for the Text Chunks__  
   • Choose an Embedding Model: Use a state-of-the-art embedding model (e.g., OpenAI’s embedding APIs, Sentence Transformers, etc.) that maps text chunks to high-dimensional vectors.  
   • Generate Embeddings: Iterate over the chunks and compute their embeddings. This turns each text snippet into a vector which captures semantic meaning.

4. __Building a Vector Index__
   • Select a Vector Store: Use libraries such as Qdrant, Chroma, Faiss or Pinecone to store and index your embeddings.  
   • Insert Embeddings: Store each vector along with metadata (like the chunk text, source page, or document section) for quick retrieval later on.

5. __Setting Up the Retrieval Mechanism__  
   • Query Embedding: When a user submits a question, embed the question using the same embedding model.  
   • Similarity Search: Query the vector index to retrieve the top-n most similar text chunks based on the question’s embedding.  
   • Relevance Ranking: Optionally rerank or verify retrieved passages to ensure they are the most contextually appropriate.

6. __Constructing the RAG Pipeline__  
   • Context Combination: Concatenate the retrieved chunks into a context prompt or pass them as additional inputs to the LLM.  
   • Prompt Engineering: Craft a prompt that combines the user’s question with the retrieved context. Ensure the prompt instructs the LLM to use the provided evidence to answer the query.  
   • LLM Query: Use an LLM (like GPT-4) to generate the final answer based on both the question and the supporting context from the PDF.

The following is based on [LangChain](https://www.langchain.com/) which is a composable framework to build with LLMs.

In [60]:
from langchain import hub
from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain_openai import ChatOpenAI

import os


In [61]:
openai_api_key = os.environ.get("OPENAI_API_KEY")

## Load document

In [None]:
# !wget https://arxiv.org/pdf/1706.03762
# !mv 1706.03762 PDFs/attention.pdf

In [63]:
# loader = PyPDFDirectoryLoader("PDFs/")
loader = PyPDFLoader("./PDFs/attention.pdf")
documents = loader.load()
print(len(documents))

15


Each document corresponds to one page in a PDF file. Let us explore the content of the first document

In [77]:
print(f"{documents[0].page_content[:500]}")


Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


and the corresponding metadata

In [78]:
print(documents[0].metadata)

{'source': './PDFs/attention.pdf', 'page': 0, 'page_label': '1'}


## Text splitter

Split each page into smaller chunks. 

add_start_index=True ensures meta data is preserved.

In [66]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
# https://python.langchain.com/docs/how_to/split_by_token/

# from langchain_text_splitters import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
#     model_name="gpt-4",
#     chunk_size=1000,
#     chunk_overlap=200,
#     add_start_index=True
# )

all_splits = text_splitter.split_documents(documents)

len(all_splits)

52

In [67]:
all_splits[12]

Document(metadata={'source': './PDFs/attention.pdf', 'page': 2, 'page_label': '3', 'start_index': 1610}, page_content='3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3')

## Embeddings

In [68]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [80]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(f"first 5 elements in embedding vector: \n{vector_1[:5]}")

Generated vectors of length 768

first 5 elements in embedding vector: 
[0.00345193431712687, 0.01597711443901062, -0.013028663583099842, 0.0009539231541566551, -0.051165636628866196]


In [70]:
collection_name = "attention"

client = QdrantClient("http://localhost:6333")

collection_exists = client.collection_exists(collection_name=collection_name)

if not collection_exists:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=768, distance=Distance.COSINE),
    )

vector_store = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=embeddings,
)


### Tokenize and add embeddings to vector database

In [71]:
ids = vector_store.add_documents(documents=all_splits)

### Search vector database

In [72]:
query = "What does the sentence in figure 5 say?"
results = vector_store.similarity_search(query)

print(results[0])

page_content='Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
Figure 3: An example of the attention mechanism following long-distance dependencies in the
encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of
the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads. Best viewed in color.
13' metadata={'source': './PDFs/attention.pdf', 'page': 12, 'page_label': '13', 'start_index': 0, '_id': 'e92fae7f-ecbc-4628-87e8-9888b3323b55', '_collection_name': 'attention'

## Setup Q&A with LLM using vector database as context

In [75]:
llm = ChatOllama(
    model="deepseek-r1:1.5b", temperature=0, base_url="http://localhost:11434"
)

# llm = ChatOpenAI(model="o1")

In [76]:
# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vector_store.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

qa_chain.invoke(query)



'<think>\nOkay, so I need to figure out what the sentence in Figure 5 says. The user provided some context with figures and sentences. Let me look through the information given.\n\nFirst, there\'s a section about "Attention Visualizations" which seems to be related to attention mechanisms in neural networks, specifically layer 5 of 6. It mentions an example of long-distance dependencies using self-attention. But I\'m not sure how this directly relates to Figure 5 unless it\'s part of the same context.\n\nThen there are some sentences that seem repetitive about "The Law will never be perfect," but they\'re repeated multiple times with slight variations. The structure is similar, so maybe these are examples or references from different figures or sections in the text.\n\nLooking further down, I see a figure labeled Figure 4 with attention heads involved in anaphora resolution. It mentions that two attention heads are focused on "its" and shows very sharp attentions for this word. This su