# RAG Pipepline from Scratch 

RAG (Retrieval Augmented Generation) has to goal to take information and pass it to a Large Language Model (LLM) so it can generate outputs based on that information. 

* **Retrieval**: Find relevant information given a user query. I.e. What are the macronutrients and what do they do -> Retrives any passages of text related to the macronutrients from a nutritien textbook. 
* **Augmented**: We want to take the relevant information from our data and then augment our imput (prompt) to an LLM with that relevant information. 
* **Generation**: Take the first two stepes and pass them to an LLM for a good output. 

Why RAG? 
The main goal of RAG is to improve the generation output of LLMs.
1. Prevent Hallucinations - LLMs are good at generating good looking text, however it may not be factual.
RAG can help LLMs create text based on text that is factual. 
2. Many LLMs are trained on internet data, as such they have a good understanding of language. RAG allows us to use custom data. We can use customer support Q&A for chatting. We can retrieve relevant snippets of text for example. We can retrieve the snippets and then use an LLM to craft an answer from these snippets. 
3. Why run it locally. We do not have to wait for any transfers. Cost is another big factor. If we own our own hardware, we can save on large amounts of costs. Furthermore, there is no vendor locking, when we run our own software, hardware. If OpenAI or another large internet company shuts down, we can still run the buisness. Privacy - Id you have documentation, maybe you do not want to send it to an API. You want to setup an LLM and run it on your own hardware.

## What are we going to build?
https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

1. Open a pdf document.
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all the chunk of text in the textbook and turn them into numerical representations (embedding) which can store for later. 
4. Build retrieval system that uses vector search to find the relevant chunks of text based on query. 
5. Create a prompt that incorperates the retrieved prieces of text. 
6. Generate the answer to a query based on the passages based on the passages of the textbook with an LLM.


## 1. Document pre-processing and embedding creation 

Ingridients: PDF document of choice (could be any kind of document.) and an embedding model of choice. 
1. Import PDF document
2. Process text for embedding (splitting into chunks of sentences)
3. Embedd textchunks with embedding model.
4. Save embedding to file for later (embeddings will store on file for many years until you loose them on hd).

In [1]:
# Programatically get the pdf document 
import os 
import requests 

# Get PDF document:
pdf_path = "./data/human-nutrition-text.pdf"

# Download the PDF:
if not os.path.exists(pdf_path):
    print("[INFO] File does not exist, downloading....")

    # Enter the URL of the PDF: 
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # Local Filename to save the file:
    filename = pdf_path

    # Send a GET request:
    response = requests.get(url=url)

    # Check if the request was successfull:
    if response.status_code == 200:
        # Open file and save it (wb = write binary)
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO] The file already exists")


[INFO] The file already exists




We got a PDF as such we can open it. We can use PyMUPDF which seems to be the best for PDF reading with the best Text formatting.

In [2]:
!pip install nltk



In [20]:
import pdfplumber # MIT Licence 
from tqdm.auto import tqdm
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data if not already present
def download_nltk_data():
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        print("Downloading NLTK punkt data...")
        nltk.download('punkt', quiet=True)

# Call the function to download NLTK data
download_nltk_data()

def text_formatter(text: str) -> str:
    """Performs basic formatting on text."""
    # Replace newlines and tabs with spaces
    cleanted_text = text.replace('\n', ' ').strip()
    return cleanted_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics using NLTK.
    """
    reader = pdfplumber.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(reader.pages)):
        text = page.extract_text()
        text = text_formatter(text)
        
        # Use NLTK for tokenization (Tokenize words and sentences)
        words = word_tokenize(text)
        sentences = sent_tokenize(text)
        
        pages_and_texts.append({
            "page_number": page_number + 1,
            "page_char_count": len(text),
            "page_word_count": len(words),
            "page_sentence_count": len(sentences),
            "page_token_count": len(text) // 4,  # Approximation of Tokens 1 token = 4 char in eng.
            "text": text
        })
    return pages_and_texts

# Usage
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)


0it [00:00, ?it/s]

In [21]:
import random
random.sample(pages_and_texts, k=3)

[{'page_number': 493,
  'page_char_count': 51,
  'page_word_count': 12,
  'page_sentence_count': 3,
  'page_token_count': 12,
  'text': 'PART VIII CHAPTER 8. ENERGY Chapter 8. Energy | 451'},
 {'page_number': 816,
  'page_char_count': 1067,
  'page_word_count': 169,
  'page_sentence_count': 7,
  'page_token_count': 266,
  'text': 'Instead of… Replace with… Sweetened fruit Plain fat-free yogurt with fresh fruit yogurt Whole milk Low-fat or fat-free milk Cheese Low-fat or reduced-fat cheese Bacon or sausage Canadian bacon or lean ham Sweetened Minimally sweetened cereals with fresh fruit cereals Apple or berry Fresh apple or berries pie Deep-fried Oven-baked French fries or sweet potato baked fries French fries Fried vegetables Steamed or roasted vegetables Sugary sweetened soft Seltzer mixed with 100 percent fruit juice drinks Recipes that call Experiment with reducing amount of sugar and for sugar adding spices (cinnamon, nutmeg, etc…) Source: Food Groups. US Department of Agriculture.

In [22]:
import pandas as pd 

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,1,29,5,1,7,Human Nutrition: 2020 Edition
1,2,0,0,0,0,
2,3,308,55,1,77,Human Nutrition: 2020 Edition UNIVERSITY OF HA...
3,4,210,35,1,52,Human Nutrition: 2020 Edition by University of...
4,5,766,130,3,191,Contents Preface xxv University of Hawai‘i at ...


In [23]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92
std,348.86,552.32,100.37,6.58,138.08
min,1.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0
50%,604.5,1191.5,214.0,10.0,297.5
75%,906.25,1572.5,282.0,15.0,393.0
max,1208.0,2271.0,441.0,30.0,567.0


The token count is very important, because:
1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens. 

For example an embedding model may have been trained to embedd sequences of 384  tokens. For this we will use 'all-mpnet-base-v2' to start off. 

As for LLMs, they cannot accept infinite tokens in their context window. 

### Further Text Processing (Splitting pages into sentences)

We can split our sentences into groups of ten sentences for example. We can use this using an NLP library (spaCy or NLTK)

In [33]:
from spacy.lang.en import English 
# We create a pipeline here: 
nlp = English()

# Add a sentencizer pipeline: (Turns text into sentences)
# spacy.to/api/sentencizer
nlp.add_pipe("sentencizer")

# Create a documents instance:
doc = nlp("This is a sentence. This is another sentence. I lile elephants.")
assert(len(list(doc.sents))==3)
list(doc.sents)

[This is a sentence., This is another sentence., I lile elephants.]

In [34]:
for item in tqdm(pages_and_texts): # is a dict
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all the sentences are strings (defult type is spacy datatypes)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [35]:
random.sample(pages_and_texts, k=1)

[{'page_number': 417,
  'page_char_count': 1070,
  'page_word_count': 188,
  'page_sentence_count': 10,
  'page_token_count': 267,
  'text': 'Image by Annie Spratt on unspash.com / CC0 Protein Denaturation: Unraveling the Fold When a cake is baked, the proteins are denatured. Denaturation refers to the physical changes that take place in a protein exposed to abnormal conditions in the environment. Heat, acid, high salt concentrations, alcohol, and mechanical agitation can cause proteins to denature. When a protein denatures, its complicated folded structure unravels, and it becomes just a long strand of amino acids again. Weak chemical forces that hold tertiary and secondary protein structures together are broken when a protein is exposed to unnatural conditions. Because proteins’ function is dependent on their shape, denatured proteins are no longer functional. During cooking the applied heat causes proteins to vibrate. This destroys the weak bonds holding proteins in their complex sh

In [36]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92,10.58
std,348.86,552.32,100.37,6.58,138.08,6.6
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0,5.0
50%,604.5,1191.5,214.0,10.0,297.5,10.0
75%,906.25,1572.5,282.0,15.0,393.0,15.0
max,1208.0,2271.0,441.0,30.0,567.0,30.0


### Chunking our sentences togheter:

The concept of splitting larger pieces of text into smalles ones is ofter referred to as text splitting or chunking. There is no 100% correct way of doing this. We may also want to have a certain overlap inside our chunks. There are libraries, which help us do this. 

1. Helps us filter text (smalles groups of text can easier to inspect than larger ones.)
2. So our text chunks can fit into the embedding model. 
3. So our context passed to an LLM can be more specific and focused.

In [43]:
## Define split size to turn groups of sentences into chunks 
num_sentence_chunk_size = 10 

# SSplit list of text recursively into chunk size e-g-> 20 -> (10,10) (25) -> 10. 10. 5
def split_list(input_list: list[str], slice_size: int) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list, num_sentence_chunk_size)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [44]:
# Loop through pages and text & split sentences into chunks: 
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [47]:
random.sample(pages_and_texts, k=1)

[{'page_number': 416,
  'page_char_count': 1152,
  'page_word_count': 205,
  'page_sentence_count': 12,
  'page_token_count': 288,
  'text': 'The Role of Proteins in Foods: Cooking and Denaturation UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM In addition to having many vital functions within the body, proteins perform different roles in our foods by adding certain functional qualities to them. Protein provides food with structure and texture and enables water retention. For example, proteins foam when agitated. (Picture whisking egg whites to make angel food cake. The foam bubbles are what give the angel food cake its airy texture.) Yogurt is another good example of proteins providing texture. Milk proteins called caseins coagulate, increasing yogurt’s thickness. Cooked proteins add some color and flavor to foods as the amino group binds with carbohydrates and produces a brown pigment and aroma. Eggs are between 10 and 15 percent p

In [49]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,604.5,1121.18,201.3,10.55,279.92,10.58,1.56
std,348.86,552.32,100.37,6.58,138.08,6.6,0.68
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,302.75,741.5,130.0,5.0,185.0,5.0,1.0
50%,604.5,1191.5,214.0,10.0,297.5,10.0,1.0
75%,906.25,1572.5,282.0,15.0,393.0,15.0,2.0
max,1208.0,2271.0,441.0,30.0,567.0,30.0,3.0


### Splitting each chunk into its own item:

We'd like to embedd each chunk of sentences into its own numerical representation. This will give us a good level of granularity. Meaning, we can dive specifically into the text sample that was used in our model.

In [54]:
import re 

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences togheter into a paragraph structure => 1 paragrapg
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" 


        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats:
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4
        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1880

In [62]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 970,
  'sentence_chunk': 'Rivlin, RS. (2007). Keeping the Young-Elderly Healthy: Is It Too Late to Improve Our Health through Nutrition?. American Journal of Clinical Nutrition, 86, 1572S–6S. 928 | Older Adulthood: The Golden Years',
  'chunk_char_count': 205,
  'chunk_word_count': 31,
  'chunk_token_count': 51.25}]

In [58]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1880.0,1880.0,1880.0,1880.0
mean,631.35,719.29,110.01,179.82
std,348.81,437.6,70.02,109.4
min,1.0,12.0,3.0,3.0
25%,325.0,315.0,43.75,78.75
50%,640.0,728.5,111.0,182.12
75%,939.0,1089.25,169.0,272.31
max,1208.0,1830.0,297.0,457.5


In [67]:
# Chunks that are under (30 tokens will be removed: experimental)
# We will remove them as they may not have any need to be used => not usefull information may be provided by them: 
min_token_lenght = 30
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_lenght].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:5]

[{'page_number': 3,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': 4,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5},
 {'page_number': 5,
  'sentence_chunk': 'Contents Preface xxv University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program About the Contributors xxvi University of Hawai‘i at Mānoa Food Sc

### Embedding our Chunks:

