# RAG Pipepline from Scratch 

RAG (Retrieval Augmented Generation) has to goal to take information and pass it to a Large Language Model (LLM) so it can generate outputs based on that information. 

* **Retrieval**: Find relevant information given a user query. I.e. What are the macronutrients and what do they do -> Retrives any passages of text related to the macronutrients from a nutritien textbook. 
* **Augmented**: We want to take the relevant information from our data and then augment our imput (prompt) to an LLM with that relevant information. 
* **Generation**: Take the first two stepes and pass them to an LLM for a good output. 

Why RAG? 
The main goal of RAG is to improve the generation output of LLMs.
1. Prevent Hallucinations - LLMs are good at generating good looking text, however it may not be factual.
RAG can help LLMs create text based on text that is factual. 
2. Many LLMs are trained on internet data, as such they have a good understanding of language. RAG allows us to use custom data. We can use customer support Q&A for chatting. We can retrieve relevant snippets of text for example. We can retrieve the snippets and then use an LLM to craft an answer from these snippets. 
3. Why run it locally. We do not have to wait for any transfers. Cost is another big factor. If we own our own hardware, we can save on large amounts of costs. Furthermore, there is no vendor locking, when we run our own software, hardware. If OpenAI or another large internet company shuts down, we can still run the buisness. Privacy - Id you have documentation, maybe you do not want to send it to an API. You want to setup an LLM and run it on your own hardware.

## What are we going to build?
https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

1. Open a pdf document.
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all the chunk of text in the textbook and turn them into numerical representations (embedding) which can store for later. 
4. Build retrieval system that uses vector search to find the relevant chunks of text based on query. 
5. Create a prompt that incorperates the retrieved prieces of text. 
6. Generate the answer to a query based on the passages based on the passages of the textbook with an LLM.


## 1. Document pre-processing and embedding creation 

Ingridients: PDF document of choice (could be any kind of document.) and an embedding model of choice. 
1. Import PDF document
2. Process text for embedding (splitting into chunks of sentences)
3. Embedd textchunks with embedding model.
4. Save embedding to file for later (embeddings will store on file for many years until you loose them on hd).

In [20]:
# Programatically get the pdf document 
import os 
import requests 

# Get PDF document:
pdf_path = "./data/human-nutrition-text.pdf"

# Download the PDF:
if not os.path.exists(pdf_path):
    print("[INFO] File does not exist, downloading....")

    # Enter the URL of the PDF: 
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # Local Filename to save the file:
    filename = pdf_path

    # Send a GET request:
    response = requests.get(url=url)

    # Check if the request was successfull:
    if response.status_code == 200:
        # Open file and save it (wb = write binary)
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")
else:
    print(f"[INFO] The file already exists")


[INFO] The file already exists


We got a PDF as such we can open it. We can use PyMUPDF which seems to be the best for PDF reading with the best Text formatting.

In [25]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0meta [36m0:00:01[0m
[?25hInstalling collected packages: nltk
Successfully installed nltk-3.9.1


In [30]:
import pdfplumber
from tqdm.auto import tqdm
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data if not already present
def download_nltk_data():
    try:
        nltk.data.find('tokenizers/punkt')
        nltk.data.find('tokenizers/punkt_tab')
    except LookupError:
        print("Downloading NLTK punkt data...")
        nltk.download('punkt', quiet=True)
        print("Downloading NLTK punkt_tab data...")
        nltk.download('punkt_tab', quiet=True)

# Call the function to download NLTK data
download_nltk_data()

def text_formatter(text: str) -> str:
    """Performs basic formatting on text."""
    # Replace newlines and tabs with spaces
    text = text.replace('\n', ' ').replace('\t', ' ')
    
    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics using NLTK.
    """
    reader = pdfplumber.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(reader.pages)):
        text = page.extract_text()
        text = text_formatter(text)
        
        # Use NLTK for tokenization
        words = word_tokenize(text)
        sentences = sent_tokenize(text)
        
        pages_and_texts.append({
            "page_number": page_number,
            "page_char_count": len(text),
            "page_word_count": len(words),
            "page_sentence_count": len(sentences),
            "page_token_count": len(text) // 4,  # Approximation of Tokens 1 token = 4 char in eng.
            "text": text
        })
    return pages_and_texts

# Usage
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)


Downloading NLTK punkt data...
Downloading NLTK punkt_tab data...


0it [00:00, ?it/s]

[{'page_number': 307, 'page_char_count': 1627, 'page_word_count': 271, 'page_sentence_count': 14, 'page_token_count': 406, 'text': 'Journal concluded that all diets, (independent of carbohydrate, fat, and protein content) that incorporated an exercise regimen significantly decreased weight and waist circumference in obese 6 women. Some studies do provide evidence that in comparison to other diets, low-carbohydrate diets improve insulin levels and other risk factors for Type 2 diabetes and cardiovascular disease. The overall scientific consensus is that consuming fewer calories in a balanced diet will promote health and stimulate weight loss, with significantly better results achieved when combined with regular exercise. Health Benefits of Whole Grains in the Diet While excessive consumption of simple carbohydrates is potentially bad for your health, consuming more complex carbohydrates is extremely beneficial to health. There is a wealth of scientific evidence supporting that replacing

In [32]:
import random
random.sample(pages_and_texts, k=3)

[{'page_number': 825,
  'page_char_count': 185,
  'page_word_count': 31,
  'page_sentence_count': 2,
  'page_token_count': 46,
  'text': 'An interactive or media element has been excluded from this version of the text. You can view it online here: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=440 784 | Introduction'},
 {'page_number': 157,
  'page_char_count': 974,
  'page_word_count': 166,
  'page_sentence_count': 8,
  'page_token_count': 243,
  'text': 'water. Three electrolytes are more closely regulated than others: Na+, Ca++, and K+. The kidneys share pH regulation with the lungs and plasma buffers, so that proteins can preserve their three- dimensional conformation and thus their function. Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities. These activities are available in the web-based textbook and not available in the downloadable versions (EPUB, Digital PDF, Pr

In [34]:
import pandas as pd 

df = pd.DataFrame(pages_and_texts)
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,0,29,5,1,7,Human Nutrition: 2020 Edition
1,1,0,0,0,0,
2,2,308,55,1,77,Human Nutrition: 2020 Edition UNIVERSITY OF HA...
3,3,210,35,1,52,Human Nutrition: 2020 Edition by University of...
4,4,766,130,3,191,Contents Preface xxv University of Hawai‘i at ...
...,...,...,...,...,...,...
1203,1203,1649,298,18,412,39. Exercise 10.2 & 11.3 reused “Egg Oval Food...
1204,1204,1585,296,10,396,Images / Pixabay License; “Pumpkin Cartoon Ora...
1205,1205,1679,310,13,419,Flashcard Images Note: Most images in the flas...
1206,1206,1696,306,13,424,ShareAlike 11. Organs reused “Pancreas Organ A...


In [35]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1121.18,201.3,10.55,279.92
std,348.86,552.32,100.37,6.58,138.08
min,0.0,0.0,0.0,0.0,0.0
25%,301.75,741.5,130.0,5.0,185.0
50%,603.5,1191.5,214.0,10.0,297.5
75%,905.25,1572.5,282.0,15.0,393.0
max,1207.0,2271.0,441.0,30.0,567.0


The token count is very important, because:
1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens. 
