## RAG pipeline from Scratch

#### The Goal of RAG is to take information and pass it to an LLM so that it can generate output based on that information 

* Retrieval - Find relavent info from a query
* Augmented - Augment out input to the LLM
* Generation - Generative output from LLM

## STEPS

1. Open a PDF document (even a collection of PDFs)
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical embeddings which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunck of text based ona aquery
5. Create a prompt that incorporates the relevant pieces of text.
6. Generat an answer to a query based on the passages of the textbook with an LLM

In [2]:
# downloading requirements
!pip install PyMuPDF
!pip install tqdm

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.5-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.5-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.5 PyMuPDFb-1.24.3


In [3]:
# Step 1
import os

# get pdf path
pdf_path = "/kaggle/input/nutrition-rag/human-nutrition-text.pdf"

In [4]:
# Step 2

import fitz
from tqdm.auto import tqdm 

def text_formatter(text: str) ->str:
    cleaned_text = text.replace("\n"," ").strip()
    
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    for page_number, page in tqdm(enumerate(doc)):
#         print(page_number)
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,
                               "page_char_count": len(text),
                               "page_word_count": len(text.split(" ")),
                               "page_sentence_count_raw": len(text.split(".")),
                               "page_token_count": len(text)/4, # 1 toke  is ~4 characters
                               "text":text})
        
    return pages_and_texts
    
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [5]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [6]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.594371,198.889901,14.180464,287.148593
std,348.86387,560.441673,95.747365,9.544587,140.110418
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,8.0,190.6875
50%,562.5,1232.5,215.0,13.0,308.125
75%,864.25,1605.25,271.25,19.0,401.3125
max,1166.0,2308.0,429.0,82.0,577.0


In [10]:
# Splitting text in page into sentences, using NLP lib - spaCy
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item['text']).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    #Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [12]:
import random

random.sample(pages_and_texts, k=1)

[{'page_number': 664,
  'page_char_count': 974,
  'page_word_count': 137,
  'page_sentence_count_raw': 16,
  'page_token_count': 243.5,
  'text': 'occur in Africa and Southeast Asia. The World Bank states five key  interventions to combat anemia:4  • Provide at-risk groups with iron supplements.  • Fortify staple foods with iron and other micronutrients whose  deficiencies are linked with anemia.  • Prevent the spread of malaria and treat the hundreds of  millions with the disease.  • Provide insecticide-treated bed netting to prevent parasitic  infections.  • Treat parasitic-worm infestations in high-risk populations.  Also, there is ongoing investigation as to whether supplying iron  cookware to at-risk populations is effective in preventing and  treating iron-deficiency anemia.  Learning Activities  Technology Note: The second edition of the Human  4.\xa0Anemia. The World Bank.\xa0http:/ /web.worldbank.org/ WBSITE/EXTERNAL/TOPICS/ EXTHEALTHNUTRITIONANDPOPULATION/ EXTPHAAG/ 0,,conten

In [18]:
# STEP 3

# Chucking our sentences together
# Chucking - grouping sentences into groups

# We do this because, to make our text chunks fit into our embedding model context window

num_sentence_chunk_size = 10

# eg: [25] => [10,10,5] 

def split_list(input_list:list[str],
              slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list),slice_size)]

# Testing
test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [20]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list = item["sentences"],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [21]:
random.sample(pages_and_texts, k=1)

[{'page_number': 925,
  'page_char_count': 1823,
  'page_word_count': 306,
  'page_sentence_count_raw': 26,
  'page_token_count': 455.75,
  'text': 'The Anorexia of Aging  In addition to concerns about obesity among senior citizens, being  underweight can be a major problem. A condition known as the  anorexia of aging is characterized by poor food intake, which results  in dangerous weight loss. This major health problem among the  elderly leads to a higher risk for immune deficiency, frequent falls,  muscle loss, and cognitive deficits. Reduced muscle mass and  physical activity mean that older adults need fewer calories per  day to maintain a normal weight. It is important for health care  providers to examine the causes for anorexia of aging among their  patients, which can vary from one individual to another.  Understanding why some elderly people eat less as they age can  help healthcare professionals assess the risk factors associated with  this condition. Decreased intake may be