
# Create and run a local RAG pipeline from scratch

## What is RAG ?

RAG (Retrieval-Augmented Generation) is an approach that combines information retrieval with generative models. In a RAG pipeline, a retriever first searches a large collection of documents to find relevant context for a given query. Then, a generator (such as a language model) uses both the query and the retrieved context to generate a more accurate and informed response. This method enhances the quality of generated answers by grounding them in external knowledge sources.

**The 3 main steps in a RAG system are:**
1. **Retrieval:** Search a knowledge base or document collection to find passages relevant to the input query.
2. **Augmentation:** Combine the retrieved passages with the original query to provide enriched context.
3. **Generation:** Use a generative model to produce a response based on the augmented input, leveraging both the query and the retrieved information.

https://arxiv.org/abs/2005.11401


In [1]:
import torch
import numpy as np
import pandas as pd

if torch.cuda.is_available():
    print("CUDA is available!")
    print("GPU Name:", torch.cuda.get_device_name(0))
    print("Number of GPUs:", torch.cuda.device_count())
    print("CUDA version:", torch.version.cuda)
else:
    print("CUDA is not available. Running on CPU.")


CUDA is available!
GPU Name: NVIDIA GeForce MX450
Number of GPUs: 1
CUDA version: 12.1


## 1 Document processing  and embedding creation

### 1.1 import PDF

In [2]:
import os 
import requests

pdf_path = "human-nutrition-text.pdf"

## download the PDF file if it does not exist
if not os.path.exists(pdf_path):
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf" # URL of the PDF file
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception(f"Failed to download PDF. Status code: {response.status_code}")
    with open(pdf_path, "wb") as f:
        f.write(response.content)
    print(f"Downloaded {pdf_path} from {url}")

In [3]:
import fitz # PyMuPDF
from tqdm import tqdm

def text_formatter(text: str) -> str:
    """
    Replace newlines with spaces and trim whitespace at the end of the text.
    """
    return text.replace('\n', ' ').rstrip()

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Open a PDF file and return a list of dicts with page number as key and page text as value.
    """
    doc = fitz.open(pdf_path)
    pages = []
    for page_num, page in enumerate(tqdm(doc, desc="Reading PDF pages")):
        page_text = text_formatter(text=page.get_text())
        pages.append({"page_num": page_num - 41,
                      "page_char_count": len(page_text),
                      "page_word_count": len(page_text.split()),
                      "page_sent_count": len(page_text.split('. ')),
                      "page_token_count": len(page_text) / 4,
                      "text": page_text})
    doc.close()
    return pages

In [4]:
pages = open_and_read_pdf(pdf_path)

Reading PDF pages: 100%|██████████| 1208/1208 [00:02<00:00, 494.99it/s]


In [5]:
import random

random.sample(pages, k=3)  # Display 3 random pages from the PDF

[{'page_num': 697,
  'page_char_count': 1325,
  'page_word_count': 202,
  'page_sent_count': 12,
  'page_token_count': 331.25,
  'text': 'Dietary Reference Intake  The IOM has given Adequate Intakes (AI) for fluoride, but has not yet  developed RDAs. The AIs are based on the doses of fluoride shown  to reduce the incidence of cavities, but not cause dental fluorosis.  From infancy to adolescence, the AIs for fluoride increase from 0.01  milligrams per day for ages less than six months to 2 milligrams  per day for those between the ages of fourteen and eighteen. In  adulthood, the AI for males is 4 milligrams per day and for females is  3 milligrams per day. The UL for young children is set at 1.3 and 2.2  milligrams per day for girls and boys, respectively. For adults, the UL  is set at 10 milligrams per day.  Table 11.10 Dietary Reference Intakes for Fluoride  Age Group  AI (mg/day) UL (mg/day)  Infants (0–6 months)  0.01  0.7  Infants (6–12 months)  0.50  0.9  Children (1–3 years)  0

In [6]:
import pandas as pd

# Create a DataFrame from the list of page dictionaries
df_pages = pd.DataFrame(pages)

# Display the first
df_pages.head()

Unnamed: 0,page_num,page_char_count,page_word_count,page_sent_count,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,0,1,0.0,
2,-39,320,42,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,30,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,116,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [7]:
df_pages.describe()

Unnamed: 0,page_num,page_char_count,page_word_count,page_sent_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.016556,171.96606,10.519868,287.004139
std,348.86387,560.368736,86.491465,6.548495,140.092184
min,-41.0,0.0,0.0,1.0,0.0
25%,260.75,762.0,109.0,5.0,190.5
50%,562.5,1231.5,183.0,10.0,307.875
75%,864.25,1603.5,239.0,15.0,400.875
max,1166.0,2308.0,393.0,39.0,577.0


### 1.2 Splitting the Pages per Sentence

There are two main options for splitting the text in each page into sentences:

- **Option 1:** Split by the period character `.` (simple string split).
- **Option 2:** Use an NLP library such as spaCy or NLTK for more accurate sentence segmentation.

In [8]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

doc = nlp("This is the first sentence. This is the second sentence. And this is the third.")

list(doc.sents)  # List of sentences in the document


[This is the first sentence.,
 This is the second sentence.,
 And this is the third.]

In [9]:
for item in tqdm(pages, desc="Processing pages"):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [s.text for s in item["sentences"]]
    item["sent_count"] = len(item["sentences"])

Processing pages: 100%|██████████| 1208/1208 [00:02<00:00, 497.20it/s]


In [10]:
random.sample(pages, k=1)  # Display 1 random pages with sentences

[{'page_num': 630,
  'page_char_count': 75,
  'page_word_count': 5,
  'page_sent_count': 1,
  'page_token_count': 18.75,
  'text': 'http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=364    630  |  Calcium',
  'sentences': ['http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=364    630  |  Calcium'],
  'sent_count': 1}]

In [11]:
df_pages = pd.DataFrame(pages)
df_pages.head()

Unnamed: 0,page_num,page_char_count,page_word_count,page_sent_count,page_token_count,text,sentences,sent_count
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition],1
1,-40,0,0,1,0.0,,[],0
2,-39,320,42,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...,[Human Nutrition: 2020 Edition UNIVERSITY OF...,1
3,-38,212,30,1,53.0,Human Nutrition: 2020 Edition by University of...,[Human Nutrition: 2020 Edition by University o...,1
4,-37,797,116,3,199.25,Contents Preface University of Hawai‘i at Mā...,[Contents Preface University of Hawai‘i at M...,2


In [12]:
df_pages.describe().round(2)  # Display summary statistics of the DataFrame rounded to 2 decimal places

Unnamed: 0,page_num,page_char_count,page_word_count,page_sent_count,page_token_count,sent_count
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.02,171.97,10.52,287.0,10.32
std,348.86,560.37,86.49,6.55,140.09,6.3
min,-41.0,0.0,0.0,1.0,0.0,0.0
25%,260.75,762.0,109.0,5.0,190.5,5.0
50%,562.5,1231.5,183.0,10.0,307.88,10.0
75%,864.25,1603.5,239.0,15.0,400.88,15.0
max,1166.0,2308.0,393.0,39.0,577.0,28.0


### 1.3 Chuncking sentences together 

In [13]:
chunck_size = 10    # Number of sentences to chunk together
def chunk_sentences(sentences:list[str], chunk_size:int) -> list[list[str]]:
    """
    Chunk sentences into groups of a specified size.
    """
    return [sentences[i:i + chunk_size] for i in range(0, len(sentences), chunk_size)]


In [14]:
test_list = list(range(35))
chunk_sentences(test_list, chunck_size)  # Test the chunking function with a list of numbers

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34]]

In [15]:
for item in tqdm(pages, desc="Chunking sentences"):
    item["chunks"] = chunk_sentences(item["sentences"], chunck_size)
    item["page_chunck_count"] = len(item["chunks"])


df_pages = pd.DataFrame(pages)
df_pages.describe().round(2)  # Display summary statistics of the DataFrame rounded to 2 decimal places

Chunking sentences: 100%|██████████| 1208/1208 [00:00<00:00, 533563.52it/s]


Unnamed: 0,page_num,page_char_count,page_word_count,page_sent_count,page_token_count,sent_count,page_chunck_count
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.02,171.97,10.52,287.0,10.32,1.53
std,348.86,560.37,86.49,6.55,140.09,6.3,0.64
min,-41.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,109.0,5.0,190.5,5.0,1.0
50%,562.5,1231.5,183.0,10.0,307.88,10.0,1.0
75%,864.25,1603.5,239.0,15.0,400.88,15.0,2.0
max,1166.0,2308.0,393.0,39.0,577.0,28.0,3.0


### 1.4 Splitting each chunk into its own item

In [16]:
def split_chunks_into_items(pages: list[dict]) -> list[dict]:
    """
    Split each chunk in the pages into separate items.
    Each item will contain the chunk text and metadata from the original page.
    """
    items = []
    for page in tqdm(pages, desc="Splitting chunks into items"):
        page_num = page["page_num"]
        for chunk_idx, chunk in enumerate(page["chunks"]):
            chunk_text = " ".join(chunk).strip()
            chunk_metadata = {
                "page_num": page_num,
                "chunk_idx": chunk_idx,
                "chunk_size": len(chunk),
                "chunk_char_count": len(chunk_text),
                "chunk_word_count": len(chunk_text.split()),
                "chunk_sent_count": len(chunk),
                "chunk_token_count": len(chunk_text) / 4,
                "text": chunk_text,
                "sentences": chunk
            }
            items.append(chunk_metadata)
    return items

# Example usage
chunk_items = split_chunks_into_items(pages)
print(f"Total number of chunks created: {len(chunk_items)}")


Splitting chunks into items: 100%|██████████| 1208/1208 [00:00<00:00, 51048.52it/s]

Total number of chunks created: 1843





In [17]:
# Display a few sample chunks
import random
random.sample(chunk_items, k=3)

[{'page_num': 645,
  'chunk_idx': 0,
  'chunk_size': 1,
  'chunk_char_count': 206,
  'chunk_word_count': 33,
  'chunk_sent_count': 1,
  'chunk_token_count': 51.5,
  'text': 'Summary of Major Minerals  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Table 10.8 A Summary of the Major Minerals  Summary of Major Minerals  |  645',
  'sentences': ['Summary of Major Minerals  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Table 10.8 A Summary of the Major Minerals  Summary of Major Minerals  |  645']},
 {'page_num': 506,
  'chunk_idx': 1,
  'chunk_size': 7,
  'chunk_char_count': 350,
  'chunk_word_count': 35,
  'chunk_sent_count': 7,
  'chunk_token_count': 87.5,
  'text': 'http://www.health.gov/paguidelines/guidelines/ chapter2.aspx. Published 2008. Accessed September 22,  2017.  8. Source: 2008 Physical Activity Guidelines for Americans.  US Department of Health and Human Services.  

In [18]:
df_chunks = pd.DataFrame(chunk_items)
df_chunks.describe().round(2)

Unnamed: 0,page_num,chunk_idx,chunk_size,chunk_char_count,chunk_word_count,chunk_sent_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0,1843.0,1843.0,1843.0
mean,583.38,0.4,6.76,752.12,112.85,6.76,188.03
std,347.79,0.56,3.3,456.24,71.28,3.3,114.06
min,-41.0,0.0,1.0,14.0,3.0,1.0,3.5
25%,280.5,0.0,4.0,322.5,45.0,4.0,80.62
50%,586.0,0.0,8.0,765.0,115.0,8.0,191.25
75%,890.0,1.0,10.0,1139.5,173.0,10.0,284.88
max,1166.0,2.0,10.0,1871.0,298.0,10.0,467.75


In [19]:
# filter out chunks with less than 30 tokens
token_count_threshold = 30
rows_with_few_tokens = df_chunks[df_chunks['chunk_token_count'] < 30]
rows_with_few_tokens.sample(5)[["chunk_token_count", "chunk_sent_count", "chunk_char_count","text"]]


Unnamed: 0,chunk_token_count,chunk_sent_count,chunk_char_count,text
250,18.25,3,73,"Published August 2011. Accessed September 22, ..."
462,17.25,1,69,Table 4.6 Sweeteners Carbohydrates and Person...
252,25.5,1,102,view it online here: http://pressbooks.oer.ha...
1540,20.0,1,80,http://pressbooks.oer.hawaii.edu/ humannutriti...
1613,16.75,2,67,"Accessed January 20, 2018. The Effect of New ..."


In [20]:
# Filter out the chunk dictionaries with token count less than the threshold
filtered_chunk_items = [item for item in chunk_items if item['chunk_token_count'] >= token_count_threshold]
print(f"Number of chunks after filtering: {len(filtered_chunk_items)}")


Number of chunks after filtering: 1687


In [21]:
df_chunks_filtered = df_chunks[df_chunks['chunk_token_count'] >= token_count_threshold]
df_chunks_filtered.describe().round(2)

Unnamed: 0,page_num,chunk_idx,chunk_size,chunk_char_count,chunk_word_count,chunk_sent_count,chunk_token_count
count,1687.0,1687.0,1687.0,1687.0,1687.0,1687.0,1687.0
mean,580.14,0.35,7.24,815.16,122.4,7.24,203.79
std,350.06,0.52,3.03,424.68,66.86,3.03,106.17
min,-39.0,0.0,1.0,120.0,10.0,1.0,30.0
25%,276.5,0.0,5.0,426.5,62.0,5.0,106.62
50%,579.0,0.0,8.0,836.0,125.0,8.0,209.0
75%,888.5,1.0,10.0,1166.5,177.5,10.0,291.62
max,1166.0,2.0,10.0,1871.0,298.0,10.0,467.75


### 1.5 Converting the dictionary of chunks into numerical embeddings

In [22]:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu")

# test the embedder on a sample of 3 chunks
sample_size = min(3, len(filtered_chunk_items))
sample_chunks = random.sample(filtered_chunk_items, k=sample_size)
sample_texts = [item['text'] for item in sample_chunks]

# Compute embeddings for the sample texts
sample_embeddings = embedder.encode(sample_texts, show_progress_bar=True, convert_to_numpy=True)

print(f"Computed embeddings for {len(sample_embeddings)} chunks. Each embedding has shape: {sample_embeddings.shape[1]}")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Computed embeddings for 3 chunks. Each embedding has shape: 768


In [23]:
%%time
chunck_text = [item['text'] for item in filtered_chunk_items]

embedder.to("cuda")
embeddings = embedder.encode(chunck_text, show_progress_bar=True, convert_to_tensor=True)

# print the shape of the embeddings
print(embeddings.shape)

Batches:   0%|          | 0/53 [00:00<?, ?it/s]

CPU times: total: 22.7 s
Wall time: 2min 22s


KeyboardInterrupt: 

In [None]:
# save the embeddings to a file
# Save the embeddings tensor to a file using torch
output_embeddings_path = "embeddings.pt"
torch.save(embeddings, output_embeddings_path)
print(f"Embeddings saved to {output_embeddings_path}")


In [None]:
for chunck_idx, item in enumerate(filtered_chunk_items):
    item["embedding"] = np.array(embeddings[chunck_idx].cpu())

df_chunks_filtered = pd.DataFrame(filtered_chunk_items)
df_chunks_filtered.head()

df_chunks_filtered.to_csv("df_chunks_filtered.csv", index=False)

In [24]:
from pdf_ragger import create_embedding_from_pdf

pdf_chunk = create_embedding_from_pdf(r"C:\Users\nidha\OneDrive\Documents\Projects\simple-local-rag\human-nutrition-text.pdf", chunk_size=10, token_count_threshold=30)

Reading PDF pages: 100%|██████████| 1208/1208 [00:01<00:00, 608.90it/s]
Processing pages: 100%|██████████| 1208/1208 [00:03<00:00, 352.48it/s]
Chunking sentences: 100%|██████████| 1208/1208 [00:00<00:00, 411995.38it/s]
Splitting chunks into items: 100%|██████████| 1208/1208 [00:00<00:00, 30336.91it/s]


Creating embedding arrays ...


Batches:   0%|          | 0/53 [00:00<?, ?it/s]

Adding embeddings to dictionnary: 1687it [00:00, 8132.02it/s]



------- Time taken: 00:07:08.716


if the embeddings is quite large a better option is to save it as a vector database (to explore later)