# RAG Pipepline from Scratch 

RAG (Retrieval Augmented Generation) has to goal to take information and pass it to a Large Language Model (LLM) so it can generate outputs based on that information. 

* **Retrieval**: Find relevant information given a user query. I.e. What are the macronutrients and what do they do -> Retrives any passages of text related to the macronutrients from a nutritien textbook. 
* **Augmented**: We want to take the relevant information from our data and then augment our imput (prompt) to an LLM with that relevant information. 
* **Generation**: Take the first two stepes and pass them to an LLM for a good output. 

Why RAG? 
The main goal of RAG is to improve the generation output of LLMs.
1. Prevent Hallucinations - LLMs are good at generating good looking text, however it may not be factual.
RAG can help LLMs create text based on text that is factual. 
2. Many LLMs are trained on internet data, as such they have a good understanding of language. RAG allows us to use custom data. We can use customer support Q&A for chatting. We can retrieve relevant snippets of text for example. We can retrieve the snippets and then use an LLM to craft an answer from these snippets. 
3. Why run it locally. We do not have to wait for any transfers. Cost is another big factor. If we own our own hardware, we can save on large amounts of costs. Furthermore, there is no vendor locking, when we run our own software, hardware. If OpenAI or another large internet company shuts down, we can still run the buisness. Privacy - Id you have documentation, maybe you do not want to send it to an API. You want to setup an LLM and run it on your own hardware.

## What are we going to build?
https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

1. Open a pdf document.
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all the chunk of text in the textbook and turn them into numerical representations (embedding) which can store for later. 
4. Build retrieval system that uses vector search to find the relevant chunks of text based on query. 
5. Create a prompt that incorperates the retrieved prieces of text. 
6. Generate the answer to a query based on the passages based on the passages of the textbook with an LLM.


## 1. Document pre-processing and embedding creation 

Ingridients: PDF document of choice (could be any kind of document.) and an embedding model of choice. 
1. Import PDF document
2. Process text for embedding (splitting into chunks of sentences)
3. Embedd textchunks with embedding model.
4. Save embedding to file for later (embeddings will store on file for many years until you loose them on hd).

In [5]:
# Programatically get the pdf document 
import os 
import requests 

# Get PDF document:
pdf_path = "./data/human-nutrition-text.pdf"

# Download the PDF:
if not os.path.exists(pdf_path):
    print("[INFO] File does not exist, downloading....")

    # Enter the URL of the PDF: 
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # Local Filename to save the file:
    filename = pdf_path

    # Send a GET request:
    response = requests.get(url=url)

    # Check if the request was successfull:
    if response.status_code == 200:
        # Open file and save it (wb = write binary)
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO] The file already exists")


[INFO] The file already exists


We got a PDF as such we can open it. We can use PyMUPDF which seems to be the best for PDF reading with the best Text formatting.

In [8]:
!pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [6]:
import pdfplumber # MIT Licence 
from tqdm.auto import tqdm
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data if not already present
def download_nltk_data():
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        print("Downloading NLTK punkt data...")
        nltk.download('punkt', quiet=True)

# Call the function to download NLTK data
download_nltk_data()

def text_formatter(text: str) -> str:
    """Performs basic formatting on text."""
    # Replace newlines and tabs with spaces
    cleanted_text = text.replace('\n', ' ').strip()
    return cleanted_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics using NLTK.
    """
    reader = pdfplumber.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(reader.pages)):
        text = page.extract_text()
        text = text_formatter(text)
        
        # Use NLTK for tokenization (Tokenize words and sentences)
        words = word_tokenize(text)
        sentences = sent_tokenize(text)
        
        pages_and_texts.append({
            "page_number": page_number + 1,
            "page_char_count": len(text),
            "page_word_count": len(words),
            "page_sentence_count": len(sentences),
            "page_token_count": len(text) // 4,  # Approximation of Tokens 1 token = 4 char in eng.
            "text": text
        })
    return pages_and_texts

# Usage
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)


0it [00:00, ?it/s]

In [7]:
import random
random.sample(pages_and_texts, k=3)

[{'page_number': 171,
  'page_char_count': 1765,
  'page_word_count': 319,
  'page_sentence_count': 12,
  'page_token_count': 441,
  'text': 'The Immune System UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM The immune system comprises several types of white blood cells that circulate in the blood and lymph. Their jobs are to seek, recruit, attack, and destroy foreign invaders, such as bacteria and viruses. Other less realized components of the immune system are the skin (which acts as a barricade), mucus (which traps and entangles microorganisms), and even the bacteria in the large intestine (which prevent the colonization of bad bacteria in the gut). Immune system functions are completely dependent on dietary nutrients. In fact, malnutrition is the leading cause of immune-system deficiency worldwide. When immune system functions are inadequate there is a marked increase in the chance of getting an infection. Children in many poor, d

In [8]:
import pandas as pd 

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,1,29,5,1,7,Human Nutrition: 2020 Edition
1,2,0,0,0,0,
2,3,308,55,1,77,Human Nutrition: 2020 Edition UNIVERSITY OF HA...
3,4,210,35,1,52,Human Nutrition: 2020 Edition by University of...
4,5,766,130,3,191,Contents Preface xxv University of Hawai‘i at ...


In [9]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92
std,348.86,552.32,100.37,6.58,138.08
min,1.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0
50%,604.5,1191.5,214.0,10.0,297.5
75%,906.25,1572.5,282.0,15.0,393.0
max,1208.0,2271.0,441.0,30.0,567.0


The token count is very important, because:
1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens. 

For example an embedding model may have been trained to embedd sequences of 384  tokens. For this we will use 'all-mpnet-base-v2' to start off. 

As for LLMs, they cannot accept infinite tokens in their context window. 

### Further Text Processing (Splitting pages into sentences)

We can split our sentences into groups of ten sentences for example. We can use this using an NLP library (spaCy or NLTK)

In [10]:
from spacy.lang.en import English 
# We create a pipeline here: 
nlp = English()

# Add a sentencizer pipeline: (Turns text into sentences)
# spacy.to/api/sentencizer
nlp.add_pipe("sentencizer")

# Create a documents instance:
doc = nlp("This is a sentence. This is another sentence. I lile elephants.")
assert(len(list(doc.sents))==3)
list(doc.sents)

[This is a sentence., This is another sentence., I lile elephants.]

In [11]:
for item in tqdm(pages_and_texts): # is a dict
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all the sentences are strings (defult type is spacy datatypes)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [12]:
random.sample(pages_and_texts, k=1)

[{'page_number': 1197,
  'page_char_count': 1593,
  'page_word_count': 286,
  'page_sentence_count': 10,
  'page_token_count': 398,
  'text': 'Increased blood pressure in salt-sensitive Sodium (mg) 1500* 2400 2300 – individuals (when consumed as sodium chloride) Vanadium Gastrointestinal – – 1.8 zero (mg) irritation; fatigue Impaired immune Zinc (mg) 11 / 8 15 40 25 function, low HDL-cholesterol aFood and Nutrition Board, Institute of Medicine (U.S.). Dietary Reference Intakes Tables. b(RDA) = Recommended Dietary Allowance, AI = Adequate Intake, indicated with * cUL = Tolerable Upper Intake Level (from food & supplements combined) dSUL = Safe Upper Levels; SULs and Guidance Levels (indicated by **) set by the Expert Group on Vitamins and Minerals of the Food Standards Agency, United Kingdom. These are intended to be levels of daily intake of nutrients in dietary supplements that potentially susceptible individuals could take daily on a life-long basis without medical supervision in rea

In [13]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92,10.58
std,348.86,552.32,100.37,6.58,138.08,6.6
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0,5.0
50%,604.5,1191.5,214.0,10.0,297.5,10.0
75%,906.25,1572.5,282.0,15.0,393.0,15.0
max,1208.0,2271.0,441.0,30.0,567.0,30.0


### Chunking our sentences togheter:

The concept of splitting larger pieces of text into smalles ones is ofter referred to as text splitting or chunking. There is no 100% correct way of doing this. We may also want to have a certain overlap inside our chunks. There are libraries, which help us do this. 

1. Helps us filter text (smalles groups of text can easier to inspect than larger ones.)
2. So our text chunks can fit into the embedding model. 
3. So our context passed to an LLM can be more specific and focused.

In [14]:
## Define split size to turn groups of sentences into chunks 
num_sentence_chunk_size = 10 

# SSplit list of text recursively into chunk size e-g-> 20 -> (10,10) (25) -> 10. 10. 5
def split_list(input_list: list[str], slice_size: int) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list, num_sentence_chunk_size)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [15]:
# Loop through pages and text & split sentences into chunks: 
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [16]:
random.sample(pages_and_texts, k=1)

[{'page_number': 226,
  'page_char_count': 1545,
  'page_word_count': 289,
  'page_sentence_count': 11,
  'page_token_count': 386,
  'text': 'The Nutrition Facts panel displays the amount of sodium (in milligrams) per serving of the food in question (Figure 3.10 “Nutrition Label” ). Food additives are often high in sodium, for example, monosodium glutamate (MSG) contains 12 percent sodium. Additionally, baking soda, baking powder, disodium phosphate, sodium alginate, and sodium nitrate or nitrite contain a significant proportion of sodium as well. When you see a food’s Nutrition Facts label, you can check the ingredients list to identify the source of the added sodium. Various claims about the sodium content in foods must be in accordance with Food and Drug Administration (FDA) regulations (Table 3.4 “Food Packaging Claims Regarding Sodium”). Table 3.4 Food Packaging Claims Regarding Sodium Claim Meaning Sodium is reduced by at “Light in Sodium” or “Low in Sodium” least 50 percent No s

In [17]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92,10.58,1.56
std,348.86,552.32,100.37,6.58,138.08,6.6,0.68
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0,5.0,1.0
50%,604.5,1191.5,214.0,10.0,297.5,10.0,1.0
75%,906.25,1572.5,282.0,15.0,393.0,15.0,2.0
max,1208.0,2271.0,441.0,30.0,567.0,30.0,3.0


### Splitting each chunk into its own item:

We'd like to embedd each chunk of sentences into its own numerical representation. This will give us a good level of granularity. Meaning, we can dive specifically into the text sample that was used in our model.

In [18]:
import re 

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences togheter into a paragraph structure => 1 paragrapg
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" 


        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats:
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4
        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1880

In [19]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 875,
  'sentence_chunk': 'However, in developing nations where HIV infection rates are high and acceptable infant formula can be difficult to find, many newborns would be deprived of the nutrients they need to develop and grow. Also, inappropriate or contaminated infant formulas cause 1.5 million infant deaths each year. As a result, the WHO recommends that women infected with HIV in the developing world should nurse Infancy | 833',
  'chunk_char_count': 408,
  'chunk_word_count': 66,
  'chunk_token_count': 102.0}]

In [20]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1880.0,1880.0,1880.0,1880.0
mean,631.35,719.29,110.01,179.82
std,348.81,437.6,70.02,109.4
min,1.0,12.0,3.0,3.0
25%,325.0,315.0,43.75,78.75
50%,640.0,728.5,111.0,182.12
75%,939.0,1089.25,169.0,272.31
max,1208.0,1830.0,297.0,457.5


In [21]:
# Chunks that are under (30 tokens will be removed: experimental)
# We will remove them as they may not have any need to be used => not usefull information may be provided by them: 
min_token_lenght = 30
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_lenght].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:5]

[{'page_number': 3,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': 4,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5},
 {'page_number': 5,
  'sentence_chunk': 'Contents Preface xxv University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program About the Contributors xxvi University of Hawai‘i at Mānoa Food Sc

### Embedding our Chunks:

Machines understand numbers. We want to turn our text chunks into numbers, specifically into numbers. A useful numerical representation. The best part of the embedding is it is a learned representation. 


In [68]:
!pip install sentence-transformers




In [22]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2")

# Create a list of sentences:
sentences = ["The sentence Transformer library provides an easy way to create embeddings.", "Sentences can be embedded one by on or in a list.",
             "I like horses."]

# Sentences are encoded/embedded with model.encode()
embeddings = embedding_model.encode(sentences=sentences)
embedding_dict = dict(zip(sentences, embeddings))

# See the embeddings: 
for sentence, embedding in embedding_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding:{embedding}")
    print("")




Sentence: The sentence Transformer library provides an easy way to create embeddings.
Embedding:[-3.44285630e-02  2.95328554e-02 -2.33643241e-02  5.57257086e-02
 -2.19098385e-02 -6.47061085e-03  1.02849910e-02 -6.57804683e-02
  2.29717176e-02 -2.61120796e-02  3.80421393e-02  5.61401844e-02
 -3.68746854e-02  1.52788078e-02  4.37021032e-02 -5.19723073e-02
  4.89479080e-02  3.58103076e-03 -1.29750865e-02  3.54383606e-03
  4.23261933e-02  3.52607071e-02  2.49401405e-02  2.99177393e-02
 -1.99381579e-02 -2.39752904e-02 -3.33368522e-03 -4.30449843e-02
  5.72014228e-02 -1.32517796e-02 -3.54477838e-02 -1.13936039e-02
  5.55561744e-02  3.61100724e-03  8.88526984e-07  1.14026805e-02
 -3.82230096e-02 -2.43551331e-03  1.51313953e-02 -1.32638845e-04
  5.00659458e-02 -5.50877005e-02  1.73444860e-02  5.00958674e-02
 -3.75959426e-02 -1.04463352e-02  5.08322828e-02  1.24861319e-02
  8.67376179e-02  4.64143679e-02 -2.10690070e-02 -3.90251353e-02
  1.99691602e-03 -1.42346118e-02 -1.86795220e-02  2.8266970

In [23]:
embeddings[0].shape

(768,)

In [24]:
embedding = embedding_model.encode("My favorite animal is the cow")

In [29]:
from tqdm.auto import tqdm
embedding_model.to("mps") # to cuda if on gpu

for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode([item["sentence_chunk"]])

  0%|          | 0/1710 [00:00<?, ?it/s]

In [31]:
# We may now use batch mode:
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[419]

print(len(text_chunks))

1710


In [32]:
%%time

# Embedd all text chunks in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
batch_size=32, convert_to_tensor=True)

text_chunk_embeddings

CPU times: user 1min 14s, sys: 13.6 s, total: 1min 27s
Wall time: 42.4 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0268,  0.0327, -0.0179,  ..., -0.0062,  0.0247,  0.0328],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='mps:0')

In [33]:
## Save the embedding to a file:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunk_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [54]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,3,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[[ 6.74242601e-02 9.02282000e-02 -5.09547163e...
1,4,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[[ 5.52156009e-02 5.92139773e-02 -1.66167282e...
2,5,Contents Preface xxv University of Hawai‘i at ...,766,116,191.5,[[ 2.67928392e-02 3.26878279e-02 -1.79278739e...
3,6,Lifestyles and Nutrition 21 University of Hawa...,941,144,235.25,[[ 6.98495582e-02 3.41225788e-02 -9.41189937e...
4,7,The Cardiovascular System 82 University of Haw...,998,152,249.5,[[ 3.05138230e-02 -7.97409657e-03 1.29058948e...


If your embedding database is really large (100k-1M) we might want to use a vector db for storage. 

### 2. RAG - Search and Answer 

Retrieve relevant passages based on a query. GOAL: Retrieve relevant passages from our query to augment an input to an LLM so it can generate an output on those relevant passages. 

### Similarity Search 
Embeddings can be used for almost any type of data. For example we can turn images, sound, text into embeddings etc. Comparing embeddings with each other is called semantic search. In our case we want to query our nutrition textbook passages based on semnatic or vibe. From my query i should retrieve the relevant passages. (May not contain the exact words)

In [66]:
import random

import torch
import numpy as np 
import pandas as pd

device = "mps"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunk_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1710, 768])

In [67]:
# Create model 
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",device="mps")



Lets create a semantic search pipeline: 
We want to search for macronutrient function and get back the relevent passages from our textbook.

1. Define a query string.
2. Turn the query string into an embedding.
3. Perform a dot product or a cosine similiarity function between (text embeddings) & query embedding 
4. Sort the results in 3. in descending order. 

In [76]:
# 1. Define the query: 
query = "Macronutrient functions"
print(f"Query: {query}")

# 2. Embed the query 
# Note embedd the query with the same model we used to embedd the documents:
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Use cosine similiarity if output of models are not normalised:
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()
print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds")

# 4. Get the top-k results:
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: Macronutrient functions
[INFO] Time taken to get scores on 1710 embeddings: 0.00022 seconds


torch.return_types.topk(
values=tensor([0.6843, 0.6717, 0.6484, 0.6478, 0.6377], device='mps:0'),
indices=tensor([42, 47, 51, 41, 52], device='mps:0'))

The embeddings are still indexes we need to get the text chunks now: in pages and chunks.

In [77]:
pages_and_chunks[47]['sentence_chunk']

'Water There is one other nutrient that we must have in large quantities: water. Water does not contain carbon, but is composed of two hydrogens and one oxygen per molecule of water. More than 60 percent of your total body weight is water. Without it, nothing could be transported in or out of the body, chemical reactions would not occur, organs would not be cushioned, and body temperature would fluctuate widely. On average, an adult consumes just over two liters of water per day from food and drink combined. Since water is so critical for life’s basic processes, the amount of water input and output is supremely important, a topic we will explore in detail in Chapter 4. Micronutrients Micronutrients are nutrients required by the body in lesser amounts, but are still essential for carrying out bodily functions. Micronutrients include all the essential minerals and vitamins. There are sixteen essential minerals and thirteen vitamins (See Table 1.1 “Minerals and Their Major Functions” and 

we can see that searching over embeddings is very fast even if we do exhaustive. But if you had 10M+ embeddings, you likely want to create an index. An index is like letters in the dictonnariy. For example if you wanted to search for duck in the dict, youd start at "d" then find words close to "du.." etc. This helps narrow it down. A popular indexing library for vector search is **faiss**.


Here we start using approximate nearest neighbour search.

In [84]:
import textwrap

def print_wrapped(text, wrap_lenght=80) -> str:
    wrapped_text = textwrap.fill(text, wrap_lenght)
    print(wrapped_text)

In [85]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped togheter scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    print(f"Page Number: {pages_and_chunks[idx]['page_number']}")
    print("\n")
          

Query: 'Macronutrient functions'

Results:
Score: 0.6843
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other macronutrients, it does not yield calories. Carbohydrates Carbohydrates
are m

Note: We could potentially improve the order of these results with a re-ranking model. A model which specifically takes the top 25 semantic results and rank them in order from most likely top 1 to least likely. i.e. mxbai-rerank-large-v1

In [97]:
import pdfplumber
from PIL import Image, ImageEnhance, ImageOps

# Open the PDF and select the first page
pdf_path = "/Users/dennis/Documents/GitHub/rag/data/human-nutrition-text.pdf"
with pdfplumber.open(pdf_path) as reader:
    page = reader.pages[47 - 1]
    
    # Render page as image with higher resolution
    page_image = page.to_image(resolution=300).original
    
    # Enhance the image
    # Convert to grayscale to improve readability
    page_image = page_image.convert("L")
    # Increase contrast
    enhancer = ImageEnhance.Contrast(page_image)
    page_image = enhancer.enhance(1.5)
    # Add a border to make it visually distinct
    page_image = ImageOps.expand(page_image, border=20, fill="black")
    
    # Display the enhanced image
    page_image.show()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Similarity Measures

Dot product and cosine similarity. Closer vector have higher scores, and further away have lower scores. 