<a href="https://colab.research.google.com/github/chawbel/RAG_Project/blob/main/RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create and run a RAG pipeline from scratch






import PDF Document

In [None]:
import os
import requests

path = "human-nutrition-text.pdf"

#Download PDF
if not os.path.exists(path):
  print("[INFO] file does not exist downloading....")

  #URL of the pdf
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2e22/open/download?type=pdf"

  #the local name to save the pdf
  file_name = path

  #send a get request to the URL
  response = requests.get(url)

  #check if the request was successfull
  if response.status_code == 200:
    #open the file and save it
    with open(file_name, "wb") as file:
      file.write(response.content)
    print(f"the file has been downloaded and saved as {file_name}")

  else:
    print(f"failed to download the file, Statues code {response.status_code}")

else:
    print(f"file {path} exists")

file human-nutrition-text.pdf exists


In [None]:
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
  cleaned_text = text.replace("\n"," ").strip()
  return cleaned_text

def open_and_read_pdf(path: str) -> list[dict]:
  doc = fitz.open(path)
  pages_and_text = []
  for page_number, page in tqdm(enumerate(doc)):
    text = page.get_text()
    text = text_formatter(text=text)
    pages_and_text.append(
                            {
                            "page number" : page_number-34,
                            "page_char_count" : len(text),
                            "page_word_count" : len(text.split(" ")),
                            "page_sentence_count_raw" : len(text.split(". ")),
                            "page_token_count" : len(text)/4,
                            "text" : text
                            }
                          )
  return pages_and_text

pages_and_text = open_and_read_pdf(path=path)
pages_and_text[34:35]

0it [00:00, ?it/s]

[{'page number': 0,
  'page_char_count': 85,
  'page_word_count': 18,
  'page_sentence_count_raw': 3,
  'page_token_count': 21.25,
  'text': 'CHAPTER 1. BASIC  CONCEPTS IN NUTRITION  Chapter 1. Basic Concepts in Nutrition  |  1'}]

In [None]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-34,18,3,1,4.5,Human Nutrition 2e
1,-33,0,1,1,0.0,
2,-32,336,57,1,84.0,Human Nutrition 2e UNIVERSITY OF HAWAI‘I AT M...
3,-31,227,32,1,56.75,Human Nutrition 2e by University of Hawai‘i at...
4,-30,593,123,3,148.25,Contents Preface xiii About the Contributor...


In [None]:
df.describe()

Unnamed: 0,page number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1198.0,1198.0,1198.0,1198.0,1198.0
mean,564.5,1138.198664,198.276294,10.583472,284.549666
std,345.977119,563.65834,96.520525,6.52552,140.914585
min,-34.0,0.0,1.0,1.0,0.0
25%,265.25,737.25,132.0,5.0,184.3125
50%,564.5,1222.0,215.0,10.0,305.5
75%,863.75,1596.75,272.0,15.0,399.1875
max,1163.0,2310.0,432.0,39.0,577.5


##Further text preprocessing (splitting pages into sentences)


1. we've done this by splitting on ". "  
2. we can also do this using an NLP library such as spaCy



In [None]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

#create a document instance as an exmaple
doc = nlp("this is a sentence. another sentence. i like elephants")
assert len(list(doc.sents)) == 3

list(doc.sents)

[this is a sentence., another sentence., i like elephants]

In [None]:
for item in tqdm(pages_and_text):
  item["sentences"] = list(nlp(item["text"]).sents)

  #make sure all sentences are strings (the default is spacy datatype)
  item["sentences"] = [str(sentence) for sentence in item["sentences"]]

  #count the senteces
  item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1198 [00:00<?, ?it/s]

In [None]:
pages_and_text[123:124]

[{'page number': 89,
  'page_char_count': 802,
  'page_word_count': 143,
  'page_sentence_count_raw': 7,
  'page_token_count': 200.5,
  'text': 'prostate-specific antigen. The results of a blood test give the  concentrations of substances in a person’s blood and display the  normal ranges for a certain population group. Many factors, such as  physical activity level, diet, alcohol intake, and medicine intake can  influence a person’s blood-test levels and cause them to fall outside  the normal range, so results of blood tests outside the “normal”  range are not always indicative of health problems. The assessment  of multiple blood parameters aids in the diagnosis of disease risk  and is indicative of overall health status. See Table 2.2 “Blood Tests”  for a partial list of substances measured in a typical blood test. This  table notes only a few of the things that their levels tell us about  health.  90  |  The Cardiovascular System',
  'sentences': ['prostate-specific antigen.',
   '

In [None]:
df = pd.DataFrame(pages_and_text)
df.describe()

Unnamed: 0,page number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0
mean,564.5,1138.198664,198.276294,10.583472,284.549666,10.380634
std,345.977119,563.65834,96.520525,6.52552,140.914585,6.275653
min,-34.0,0.0,1.0,1.0,0.0,0.0
25%,265.25,737.25,132.0,5.0,184.3125,5.0
50%,564.5,1222.0,215.0,10.0,305.5,10.0
75%,863.75,1596.75,272.0,15.0,399.1875,15.0
max,1163.0,2310.0,432.0,39.0,577.5,27.0


##chunking our sentences together

the process of splitting large pieces of text into smaller ones often refered to as text splitting or chunking

there is no 100% correct way to do this

we'll keep it simple and split into groups of 10 sentences

why do we do this?


1. so our text are easier to filter (smaller groups of text can be easier to inspect)
2. so our text chunks can fit into embedding model context window
3. so our contents passed to an LLM can be more specific and focused



In [None]:
#define splt size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

#create a function to split lists of text recursively into chunk size
def split_list(input_list: list[str], slice_size: int=num_sentence_chunk_size)-> list[list[str]]:
  return [input_list[i:i+slice_size]
            for i in range(0,len(input_list), slice_size)]



In [None]:
#loop through pages and texts and split sentences into chunks

for item in tqdm(pages_and_text):
  item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                       slice_size=num_sentence_chunk_size)
  item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1198 [00:00<?, ?it/s]

In [None]:
pages_and_text[900:901]

[{'page number': 866,
  'page_char_count': 977,
  'page_word_count': 168,
  'page_sentence_count_raw': 8,
  'page_token_count': 244.25,
  'text': 'often mimic their behavior and eating habits. Parents must continue  to help their school-aged children and adolescents establish healthy  eating habits and attitudes toward food. Their primary role is to  bring a wide variety of health-promoting foods into the home, so  that their children can make good choices.    Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.  These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user experience it is strongly  recommended that users complete these activities using a  desktop or laptop computer.    An i

In [None]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0
mean,564.5,1138.2,198.28,10.58,284.55,10.38,1.52
std,345.98,563.66,96.52,6.53,140.91,6.28,0.64
min,-34.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,265.25,737.25,132.0,5.0,184.31,5.0,1.0
50%,564.5,1222.0,215.0,10.0,305.5,10.0,1.0
75%,863.75,1596.75,272.0,15.0,399.19,15.0,2.0
max,1163.0,2310.0,432.0,39.0,577.5,27.0,3.0


##Splitting each chunk into its own item

we'd like to embedd each chunk of sentences into its own numerical representation

that'll give us a good level of granularity

Meaning we can dive specifically into the text sample that was used in our model

In [None]:
import re

#split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_text):
  for sentence_chunk in item["sentence_chunks"]:
    chunk_dict = {}
    chunk_dict["page_number"] = item["page number"]

    #join the sentences together into a paragraph-like structure aka join the list of sentences into one paragraph
    joined_sentence_chunk = "".join(sentence_chunk).replace("  "," ").strip()
    joined_sentence_chunk = re.sub(r'\.([A-Z])',r'. \1', joined_sentence_chunk)
    chunk_dict["sentence_chunk"] = joined_sentence_chunk

    chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
    chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
    chunk_dict["chunk_token_count"] = len(joined_sentence_chunk)/4

    pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1198 [00:00<?, ?it/s]

1820

In [None]:
pages_and_chunks[400:401]

[{'page_number': 252,
  'sentence_chunk': 'Cells in our bodies break these bonds and capture the energy to perform cellular respiration. Cellular respiration is basically a controlled burning of glucose versus an uncontrolled burning. A cell uses many chemical reactions in multiple enzymatic steps to slow the release of energy (no explosion) and more efficiently capture the energy held within the chemical bonds in glucose. The first stage in the breakdown of glucose is called glycolysis. Glycolysis, or the splitting of glucose, occurs in an intricate series The Functions of Carbohydrates in the Body | 253',
  'chunk_char_count': 569,
  'chunk_word_count': 90,
  'chunk_token_count': 142.25}]