# RAG application for Q&A with PDF documents

RAG is short for Retrieval-Augmented Generation and the term was coined in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401.pdf).

The architecture of the proposed model was:

<img src="images/rag-architecture.png" width="800">

For an overview of RAG models check the paper [Retrieval-Augmented Generation for Large
Language Models: A Survey](https://arxiv.org/abs/2312.10997).

Martin Fowler recently published a blog post [Emerging Patterns in Building GenAI Products](https://martinfowler.com/articles/gen-ai-patterns/) that contains a lot of valuable information.

# INSERT DRAWING

## Steps required to build a RAG application

1. __PDF Extraction and Preprocessing__  

   • Extract Text: Use libraries like PyPDFLoader, PyPDF2, pdfplumber, or similar tools to extract the text content from the PDF file.  
   • Clean and Preprocess: Remove unnecessary formatting, fix encoding issues, and possibly normalize the text (e.g., lowercasing, punctuation handling).  
   • Document Segmentation: Depending on your PDF’s structure, you might want to segment it by chapters, sections, or pages if needed.

2. __Chunking the Documents__  

   • Define Chunk Size: Split the extracted text into manageable chunks (e.g., paragraphs or fixed-size windows) so that each piece can be meaningfully processed.  
   • Overlap Chunks: Optionally use overlapping windows to ensure smooth context transitions between chunks, which helps when a concept spans multiple chunks.

3. __Creating Embeddings for the Text Chunks__  

   • Choose an Embedding Model: Use a state-of-the-art embedding model (e.g., OpenAI’s embedding APIs, Sentence Transformers, etc.) that maps text chunks to high-dimensional vectors.  
   • Generate Embeddings: Iterate over the chunks and compute their embeddings. This turns each text snippet into a vector which captures semantic meaning.

4. __Building a Vector Index__

   • Select a Vector Store: Use libraries such as Qdrant, Chroma, Faiss or Pinecone to store and index your embeddings.  
   • Insert Embeddings: Store each vector along with metadata (like the chunk text, source page, or document section) for quick retrieval later on.

5. __Setting Up the Retrieval Mechanism__  

   • Query Embedding: When a user submits a question, embed the question using the same embedding model.  
   • Similarity Search: Query the vector index to retrieve the top-n most similar text chunks based on the question’s embedding.  

6. __Constructing the RAG Pipeline__  

   • Context Combination: Concatenate the retrieved chunks into a context prompt or pass them as additional inputs to the LLM.  
   • Prompt Engineering: Craft a prompt that combines the user’s question with the retrieved context. Ensure the prompt instructs the LLM to use the provided evidence to answer the query.  
   • LLM Query: Use an LLM (like GPT-4) to generate the final answer based on both the question and the supporting context from the PDF.

The following is based on [LangChain](https://www.langchain.com/) which is a composable framework to build with LLMs.

In [None]:
import os
import pprint

from langchain import hub
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

In [105]:
openai_api_key = os.environ.get("OPENAI_API_KEY")

## Baseline

Let's explore a couple of LLM's before we build the RAG application.

We will be using the famous "Attention Is All You Need" paper as source document.

In [106]:
# Download the paper
# !wget https://arxiv.org/pdf/1706.03762
# !mv 1706.03762 PDFs/attention.pdf

In [None]:
prefix = "I am reading the 'Attention is all you need' paper! "

# Text
query = prefix + "How many GPU's were the models trained on?"
# Answer: 8 GPUs

# Table
# query = (
#     prefix
#     + "What is the BLEU score of the MoE model for English-to-German translation?"
# )
# Answer: 26.03

# Image
# query = prefix + "What does the sentence in figure 5 say?"
# Answer:
# The Law will never be perfect, but its application should be just.
# This is what we are missing, in my opinion.

In [108]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="You are a helpful assistant. Answer all questions to the best of your ability."
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

In [109]:
llm_openai = ChatOpenAI(model="o1")

openai = prompt | llm_openai

response_openai = openai.invoke(
    {
        "messages": [
            HumanMessage(
                content=query,
            ),
        ],
    }
)

print(response_openai.content)

The original “Attention Is All You Need” paper (Vaswani et al., 2017) does not include a Mixture-of-Experts (MoE) variant. In that paper, the authors report BLEU scores for what they call “Transformer (base)” and “Transformer (big)” models on WMT 2014 English→German and English→French, but there is no MoE experiment.

If you are looking for a Mixture-of-Experts approach by some of the same researchers, you may be thinking of separate follow-up work on MoE layers (e.g. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” by Shazeer et al., 2017) or subsequent papers like Switch Transformers and GShard. However, those MoE results do not appear in the original “Attention Is All You Need” paper.


In [110]:
deepseek = ChatOllama(
    model="deepseek-r1:1.5b", temperature=0, base_url="http://localhost:11434"
)

chain_deepseek = prompt | deepseek

response_deepseek = chain_deepseek.invoke(
    {
        "messages": [
            HumanMessage(
                content=query,
            ),
        ],
    }
)

print(response_deepseek.content)

<think>
Okay, so I'm trying to figure out the BLEU score for the MoE model in the English-to-German translation task from the 'Attention is All You Need' paper. I remember that BLEU is a common metric used to evaluate machine translation models, but I'm not exactly sure how it's calculated or what factors influence its value.

First, I think about what BLEU measures. It stands for Bilingual Evaluation Underlying Small Trees, and it's based on the number of exact matches between the predicted and actual sequences. So, higher scores mean better performance because there are more exact matches. But wait, isn't that not always the case? Maybe sometimes models can have multiple near-matches without exact matches, which could affect the score.

I also recall that BLEU is sensitive to the diversity of the outputs. If a model produces many similar but incorrect translations, it might still perform well because there are enough exact matches elsewhere. But if all the outputs are very similar an

## Building the knowledge base
### Download source document
#### Step 1 Extract text from 

Load document

In [111]:
loader = PyPDFLoader("./PDFs/attention.pdf")
documents = loader.load()
print(len(documents))

15


Each `document` corresponds to one page in the PDF file. Let us explore the content of the first document:

In [112]:
print(f"{documents[0].page_content[:500]}")

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


and the corresponding metadata

In [113]:
pprint.pp(documents[0].metadata)

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-04-10T21:11:43+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2024-04-10T21:11:43+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live '
                    '2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': './PDFs/attention.pdf',
 'total_pages': 15,
 'page': 0,
 'page_label': '1'}


Extracting text can be anything from as easy as this simple example or as complex as you want it to be. It is much harder if you want to preserve metadata as chapters, sections etc. Extracting text from tables is not easy and it is even harder with figures and images.

## Step 2
### Chuncking the document
#### Text splitter

Split each page into smaller chunks. 

Set `add_start_index=True` to preserved meta data.

In [114]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

In [115]:
chunks = text_splitter.split_documents(documents)
len(chunks)

52

In [116]:
chunks[12]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './PDFs/attention.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1610}, page_content='3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3')

## Step 3
### Create Embeddings

We load an embedding model

In [117]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")



In [118]:
embedding_vector = embeddings.embed_query(chunks[0].page_content)

embedding_vector_length = len(embedding_vector)

print(f"Generated vectors of length {embedding_vector_length}\n")
print(f"first 5 elements in embedding vector: \n{embedding_vector[:5]}")

Generated vectors of length 768

first 5 elements in embedding vector: 
[0.00345193431712687, 0.01597711443901062, -0.013028663583099842, 0.0009539231541566551, -0.051165636628866196]


## Step 4
### Create Vector Index

Here we will be using the vector database [Qdrant](https://qdrant.tech/) running locally in Docker.

Note we set the config value for vector lenght to be the same as produced by the embedding model, i.e. 768.

In [119]:
collection_name = "attention"

client = QdrantClient("http://localhost:6333")

collection_exists = client.collection_exists(collection_name=collection_name)

if not collection_exists:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=embedding_vector_length, distance=Distance.COSINE
        ),
    )

vector_store = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=embeddings,
)

### Add embeddings to vector database

In [120]:
if not collection_exists:
    print("Adding documents to the collection")
    ids = vector_store.add_documents(documents=chunks)

## Step 5
### Search vector database

In [121]:
results = vector_store.similarity_search(query)

print(results[1].page_content)

positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
Label Smoothing During training, we employed label smoothing of value ϵls = 0.1 [36]. This
hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
6 Results
6.1 Machine Translation
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0
BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is
listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model
surpasses all previously published models and ensembles, at a fraction of the training cost of any of
the competitive models.
On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0,


## Step 6
### Constructing the RAG pipeline

Setup Q&A with LLM using vector database as context

In [122]:
# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vector_store.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | deepseek
    | StrOutputParser()
)

response = qa_chain.invoke(query)
print(response)



<think>
Okay, so I'm trying to figure out the BLEU score for the MoE model in English-to-German translation. The user mentioned that they're reading the 'Attention is all you need' paper and specifically wants to know about the MoE model's BLEU score.

Looking at the context provided, it seems like there are several sections discussing different models, including Transformers and their configurations. In one part, under "6 Results," there's a section titled "Machine Translation" where they talk about the big Transformer model. It mentions that on the English-to-German task, this model achieved a BLEU score of 28.4, which is higher than all previously published models and ensembles.

Additionally, it notes that for the base model (which I assume refers to the configuration used in the experiments), they use a dropout rate Pdrop = 0.1 and label smoothing ϵls = 0.1. This might be relevant if there's another MoE variant mentioned later on, but based on the context given, it seems like the 

## Final remarks

Simple RAG applications are "just" advanced prompt engineering!

Be aware of the size of the context window when injecting context from the vector database.

* Multimodal RAG
* More advanced techniques
  * Hybrid search
  * Re-ranking
  * Metadata filtering
* CAG
* 

## What did we not cover?

* Evaluation of RAG models