# Create and run a local RAG pipeline from scratch

`RAG` stands for Retrieval Augmented Generation
The goal of RAG is to take information and pass it to an LLM so it can generate output based on that information
* Retrieval - Find relevant information given a query
* Augmented - Take the relevant information and augment our input (prompt) to an relevant information
* Generation - Take the first two steps and pass them to an LLM for generative outputs.

chatgtp is great example of rag software

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs

Two primary improvements can be seen as:

1. **Prevent hallucinations** - LLMs are increadibly good at generating good *looking* text, however, this text doesn't mean that it's factual. RAG can help LLMs generate information based on relevant passages that are factual.
2. **Work with custom data** -Many base LLMs are trained with internet-scale text data. This means they have a great ability to model language, however, they often lack specific knowledge. RAG systems can provide LLMs with domain-specific data such as medical information or company documentation and thus customized their outputs to suit specific use cases.

The authors of the original RAG paper mentioned above outlined these two points in their discussion.


This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.


RAG can also be a much quicker solution to implement than fine-tuning an LLM on specific data.

In [1]:
# # Perform Google Colab installs (if running in Google Colab)
# import os

# if "COLAB_GPU" in os.environ:
#     print("[INFO] Running in Google Colab, installing requirements.")
#     !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
#     !pip install PyMuPDF # for reading PDFs with Python
#     !pip install tqdm # for progress bars
#     !pip install sentence-transformers # for embedding models
#     !pip install accelerate # for quantization model loading
#     !pip install bitsandbytes # for quantizing models (less storage space)
#     !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## text precessing and embedding creation

Ingredients:
* PDF document of choice (note: this could be almost any kind of document)
* Embedding model of choice

steps:
1. Import PDF document
2. Process text for embedding
3. Embed text chuks with embedding model
4. Save embedding to file for later

### import pdf document via online

```
# This is formatted as code
```



In [3]:
import os
import requests

# Define the path where the PDF should be saved and the filename
pdf_directory = '/content/drive/MyDrive/rag/'
filename = "human-nutrition-text.pdf"
pdf_path = os.path.join(pdf_directory, filename)

# Function to download the PDF
def download_pdf(url, save_path):
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, "wb") as file:
            file.write(response.content)
        print(f"The file has been downloaded and saved as {save_path}.")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")

# Check if the directory exists, if not, create it
if not os.path.exists(pdf_directory):
    os.makedirs(pdf_directory)

# Check if the file already exists in the specified directory
if not os.path.exists(pdf_path):
    print("File doesn't exist, downloading...")
    # URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"
    download_pdf(url, pdf_path)
else:
    print(f"{filename} already exists in the specified directory.")


human-nutrition-text.pdf already exists in the specified directory.


In [4]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.4-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.4 PyMuPDFb-1.24.3


In [5]:
import fitz  # PyMuPDF

from tqdm.auto import tqdm
from typing import List, Dict

def text_formatter(text: str) -> str:
    """ Perform formatting on text """
    text = text.replace("\n", " ").strip()
    return text

def open_and_read_pdf(pdf_path: str) -> List[Dict]:
    doc = fitz.open(pdf_path)
    pages_and_text = []
    for page_num, page in tqdm(enumerate(doc), total=len(doc)):
        text = page.get_text()
        text = text_formatter(text)
        words = text.split()
        pages_and_text.append({
            "page_number": page_num - 41,  # Assuming this offset is intentional
            "page_char_count": len(text),
            "page_word_count": len(words),
            "page_token_count": len(words),  # Assuming 1 word = 1 token for simplicity
            "text": text
        })
    return pages_and_text

# Replace 'pdf_path' with the actual path to your PDF file
pdf_path = "/content/drive/MyDrive/rag/human-nutrition-text.pdf"
pages_and_text = open_and_read_pdf(pdf_path=pdf_path)
pages_and_text[:3]

  0%|          | 0/1208 [00:00<?, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_token_count': 4,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 0,
  'page_token_count': 0,
  'text': ''},
 {'page_number': -39,
  'page_char_count': 320,
  'page_word_count': 42,
  'page_token_count': 42,
  'text': 'Human Nutrition: 2020  Edition  UNIVERSITY OF HAWAI‘I AT MĀNOA  FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM  ALAN TITCHENAL, SKYLAR HARA,  NOEMI ARCEO CAACBAY, WILLIAM  MEINKE-LAU, YA-YUN YANG, MARIE  KAINOA FIALKOWSKI REVILLA,  JENNIFER DRAPER, GEMADY  LANGFELDER, CHERYL GIBBY, CHYNA  NICOLE CHUN, AND ALLISON  CALABRESE'}]

In [6]:
for page in pages_and_text:
    print(page)
    break


{'page_number': -41, 'page_char_count': 29, 'page_word_count': 4, 'page_token_count': 4, 'text': 'Human Nutrition: 2020 Edition'}


Check the pdf text file

In [7]:
import random
random.sample(pages_and_text, k = 3)

[{'page_number': 73,
  'page_char_count': 1134,
  'page_word_count': 184,
  'page_token_count': 184,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  once it begins. As you swallow, the bolus is pushed from the mouth  through the pharynx and into a muscular tube called the esophagus.  As the bolus travels through the pharynx, a small flap called the  epiglottis closes to prevent choking by keeping food from going  into the trachea. Peristaltic contractions also known as peristalsis in  the esophagus propel the food bolus down to the stomach (Figure  3.6 “Peristalsis in the Esophagus”). At the junction between the  esophagus and stomach there is a sphincter muscle that remains  closed until the food bolus approaches. The pressure of the food  bolus stimulates the lower esophageal sphincter to relax and open  and food then moves from the esophagus into the stomach. The  mechanical breakdown of food is accentuated by the muscular  contractions of the stomach and small intestine that 

Covert text to dataframe

In [8]:
import pandas as pd
df = pd.DataFrame(pages_and_text)
df.head(5)

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count,text
0,-41,29,4,4,Human Nutrition: 2020 Edition
1,-40,0,0,0,
2,-39,320,42,42,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,30,30,Human Nutrition: 2020 Edition by University of...
4,-37,797,116,116,Contents Preface University of Hawai‘i at Mā...


In [9]:
df.shape

(1208, 5)

In [10]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,172.56,172.56
std,348.86,560.44,86.48,86.48
min,-41.0,0.0,0.0,0.0
25%,260.75,762.75,109.75,109.75
50%,562.5,1232.5,183.0,183.0
75%,864.25,1605.25,240.0,240.0
max,1166.0,2308.0,393.0,393.0


**what is token?**

Text generation and embeddings models process text in chunks called tokens. Tokens represent commonly occurring sequences of characters. For example, the string " tokenization" is decomposed as " token" and "ization", while a short and common word like " the" is represented as a single token. Note that in a sentence, the first token of each word typically starts with a space character. Check out our tokenizer tool to test specific strings and see how they are translated into tokens. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.

## Text Processing(splitting pages into sentences)

Two ways to do this:
1. Done this by splitting on '". "'
2. Done using spaCy(https://spacy.io/) and NLTK(https://www.nltk.org/) library

In [11]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence. I like elephants.")

# Ensure that there are three sentences in the document
assert len(list(doc.sents)) == 3

# Output the sentences
list(doc.sents)


[This is a sentence., This is another sentence., I like elephants.]

In [12]:
import spacy

In [13]:
for item in tqdm(pages_and_text):
  item["sentences"] = list(nlp(item["text"]).sents)
  item["sentences"] = [str(sentences) for sentences in item["sentences"]]
  item["num_sentences"] = len(item["sentences"])
  item["num_tokens"] = sum([len(nlp(sentence)) for sentence in item["sentences"]])
  item["num_chars"] = sum([len(sentence) for sentence in item["sentences"]])
  item["num_token_chars"] = sum([len(nlp(sentence)) for sentence in item["sentences"]])


  0%|          | 0/1208 [00:00<?, ?it/s]

In [14]:
random.sample(pages_and_text, 2)

[{'page_number': 41,
  'page_char_count': 557,
  'page_word_count': 78,
  'page_token_count': 78,
  'text': 'Types of Scientific Studies  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  There are various types of scientific studies on humans that can  be used to provide supporting evidence for a particular hypothesis.  These include epidemiological studies, interventional clinical trials,  and randomized clinical trials. Valuable nutrition knowledge also is  obtained from animal studies and cellular and molecular biology  research.  Table 1.4 Types of Scientific Studies  Types of Scientific Studies  |  41',
  'sentences': ['Types of Scientific Studies  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  There are various types of scientific studies on humans that can  be used to provide supporting evidence for a particular hypothesis.',
   ' These include epidemiological studies, int

In [15]:
random.sample(pages_and_text, 2)

[{'page_number': 216,
  'page_char_count': 1357,
  'page_word_count': 226,
  'page_token_count': 226,
  'text': 'Popular Beverage Choices  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Caffeine  Caffeine is a chemical called xanthine found in the seeds, leaves,  and fruit of many plants, where it acts as a natural pesticide. It is  the most widely consumed psychoactive substance and is such an  important part of many people’s lives that they might not even think  of it as a drug. Up to 90 percent of adults around the world use it on  a daily basis. According to both the FDA and the American Medical  Association the moderate use of caffeine is “generally recognized as  safe.” It is considered a legal psychoactive drug and, for the most  part, is completely unregulated.  Typical Doses and Dietary Sources  What is a “moderate intake” of caffeine? Caffeine intakes are  described in the following manner:  • Low–moderate intake. 130–300

In [16]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count,num_sentences,num_tokens,num_chars,num_token_chars
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,172.56,172.56,10.32,232.17,1140.0,232.17
std,348.86,560.44,86.48,86.48,6.3,113.56,555.52,113.56
min,-41.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,260.75,762.75,109.75,109.75,5.0,151.5,758.75,151.5
50%,562.5,1232.5,183.0,183.0,10.0,253.0,1225.0,253.0
75%,864.25,1605.25,240.0,240.0,15.0,323.0,1591.25,323.0
max,1166.0,2308.0,393.0,393.0,28.0,491.0,2293.0,491.0


## Chunking our sentence together
The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking .
There is no 100% correct way to do this.

Lets try to groups of 10 sentences

There are frameworks such as LangChain which can help with this. However, we'll stick with python for now

**Why We Do?**
1. texts are easier to filter groups of text can be easier to inspect that large passages of text.
2. Texts chunks can fit into our embedding model context window
3. Contexts passed to an LLM can be more specific and focused

In [17]:
num_sentence_chunk_size = 10
# chunks document with python
# [30] -> [10,10,10,] or [35] -> [10,10,10,5]
def split_list(input_list, chunk_size):
    return [input_list[i:i+chunk_size] for i in range(0, len(input_list), chunk_size)]
test_list = list(range(35))
split_list(test_list, num_sentence_chunk_size)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34]]

In [18]:
# loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_text):
  item["sentences_chunks"] = split_list(item["sentences"], num_sentence_chunk_size)
  item["num_sentences_chunks"] = len(item["sentences_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [19]:
random.sample(pages_and_text, k=2)

[{'page_number': 480,
  'page_char_count': 1569,
  'page_word_count': 214,
  'page_token_count': 214,
  'text': 'DietaryGuidelines2010.pdf. Published 2010. Accessed September 22,  2017.  Table 8.5 Acceptable Macronutrient Distribution Ranges  Age  Carbohydrates (%  of Calories)  Protein (% of  Calories)  Fat (% of  Calories)  Young Children (1–3)  45–65  5–20  30–40  Older children/ adolescents (4–18)  45–65  10–30  25–35  Adults (19 and older)  45–65  10–35  20–35  Source: Dietary Reference Intakes: Macronutrients.” Dietary  Reference Intakes for Energy, Carbohydrate. Fiber, Fat, Fatty Acids,  Cholesterol, Protein, and Amino Acids. Institute of Medicine.  http:/ /nationalacademies.org/hmd/~/media/Files/ Activity%20Files/Nutrition/DRI-Tables/ 8_Macronutrient%20Summary.pdf?la=en. Accessed September 22,  2017.  Total Energy Expenditure (Output)  The amount of energy you expend every day includes not only the  calories you burn during physical activity, but also the calories you  burn whi

In [20]:
df.shape

(1208, 10)

In [21]:
df = pd.DataFrame(pages_and_text)
df.head(5)

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count,text,sentences,num_sentences,num_tokens,num_chars,num_token_chars,sentences_chunks,num_sentences_chunks
0,-41,29,4,4,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition],1,5,29,5,[[Human Nutrition: 2020 Edition]],1
1,-40,0,0,0,,[],0,0,0,0,[],0
2,-39,320,42,42,Human Nutrition: 2020 Edition UNIVERSITY OF ...,[Human Nutrition: 2020 Edition UNIVERSITY OF...,1,69,320,69,[[Human Nutrition: 2020 Edition UNIVERSITY O...,1
3,-38,212,30,30,Human Nutrition: 2020 Edition by University of...,[Human Nutrition: 2020 Edition by University o...,1,35,212,35,[[Human Nutrition: 2020 Edition by University ...,1
4,-37,797,116,116,Contents Preface University of Hawai‘i at Mā...,[Contents Preface University of Hawai‘i at M...,2,150,796,150,[[Contents Preface University of Hawai‘i at ...,1


In [22]:
df.shape


(1208, 12)

In [23]:
df.describe().round()

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count,num_sentences,num_tokens,num_chars,num_token_chars,num_sentences_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.0,1149.0,173.0,173.0,10.0,232.0,1140.0,232.0,2.0
std,349.0,560.0,86.0,86.0,6.0,114.0,556.0,114.0,1.0
min,-41.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,261.0,763.0,110.0,110.0,5.0,152.0,759.0,152.0,1.0
50%,562.0,1232.0,183.0,183.0,10.0,253.0,1225.0,253.0,1.0
75%,864.0,1605.0,240.0,240.0,15.0,323.0,1591.0,323.0,2.0
max,1166.0,2308.0,393.0,393.0,28.0,491.0,2293.0,491.0,3.0


In [24]:
df.columns

Index(['page_number', 'page_char_count', 'page_word_count', 'page_token_count',
       'text', 'sentences', 'num_sentences', 'num_tokens', 'num_chars',
       'num_token_chars', 'sentences_chunks', 'num_sentences_chunks'],
      dtype='object')

In [25]:
df.sample(5)

Unnamed: 0,page_number,page_char_count,page_word_count,page_token_count,text,sentences,num_sentences,num_tokens,num_chars,num_token_chars,sentences_chunks,num_sentences_chunks
326,285,1173,180,180,Sucrose • Sugar ~4 kcal/ g Extracted from ...,[Sucrose • Sugar ~4 kcal/ g Extracted from ...,13,274,1162,274,[[Sucrose • Sugar ~4 kcal/ g Extracted from...,2
255,214,1512,210,210,Source: Image credit Robert Tauxe. Drinking W...,"[Source: Image credit Robert Tauxe., Drinking...",14,288,1500,288,"[[Source: Image credit Robert Tauxe., Drinkin...",2
638,597,1606,247,247,The Body’s Offense UNIVERSITY OF HAWAI‘I AT M...,[The Body’s Offense UNIVERSITY OF HAWAI‘I AT ...,12,306,1595,306,[[The Body’s Offense UNIVERSITY OF HAWAI‘I AT...,2
894,853,1741,274,274,becoming school-aged children. Their physical ...,"[becoming school-aged children., Their physica...",14,362,1728,362,"[[becoming school-aged children., Their physic...",2
868,827,1901,283,283,Components of Breastmilk Human breast milk no...,[Components of Breastmilk Human breast milk n...,20,377,1884,377,[[Components of Breastmilk Human breast milk ...,2


### Spitting each chunk into its own item

Embed each chunk of sentence into its own numerical representation. That will give a good level of granularity

Meaning, We can dive specifically into the text sample that was used in our model.


In [26]:
import re

pages_and_chunks = []
for item in tqdm(pages_and_text):
  for sentence_chunk in item["sentences_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [27]:
random.sample(pages_and_chunks, k=2)

[{'page_number': 553,
  'sentence_chunk': 'Water-Soluble Vitamins | 553',
  'chunk_char_count': 28,
  'chunk_word_count': 4,
  'chunk_token_count': 7.0},
 {'page_number': 822,
  'sentence_chunk': 'http:/ /www.fao.org/3/ca5162en/ca5162en.pdf 822 | Infancy',
  'chunk_char_count': 57,
  'chunk_word_count': 5,
  'chunk_token_count': 14.25}]

In [28]:
df = pd.DataFrame(pages_and_chunks)
df.head(5)

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-41,Human Nutrition: 2020 Edition,29,4,7.25
1,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0
2,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5
3,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5
4,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25


In [29]:
df.columns

Index(['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count',
       'chunk_token_count'],
      dtype='object')

In [30]:
df.shape

(1843, 5)

In [31]:
df.columns

Index(['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count',
       'chunk_token_count'],
      dtype='object')

In [32]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 28.75 | Text: Bouayed, J. and T. Bohn. (2010). Exogenous Antioxidants—Double-Edged Swords in Cellular Redox MyPlate Planner | 753
Chunk token count: 15.25 | Text: Accessed November 30, 2017. Discovering Nutrition Facts | 737
Chunk token count: 9.75 | Text: Table 3.5 Salt Substitutes Sodium | 185
Chunk token count: 16.0 | Text: PART II CHAPTER 2. THE HUMAN BODY Chapter 2. The Human Body | 53
Chunk token count: 24.25 | Text: http:/ /pressbooks.oer.hawaii.edu/ humannutrition2/?p=485 930 | Older Adulthood: The Golden Years


### Filter chunks of text for short chunks


In [33]:
# Define minimum token length
min_token_length = 20

# Filter the dataframe to get rows where chunk_token_count is less than or equal to the minimum token length
filtered_df = df[df["chunk_token_count"] <= min_token_length]

# Randomly sample 5 rows from the filtered dataframe and iterate over them
for index, row in filtered_df.sample(5).iterrows():
    print(f"Chunk token count: {row['chunk_token_count']} | Text: {row['sentence_chunk']}")

Chunk token count: 13.0 | Text: PART VII CHAPTER 7. ALCOHOL Chapter 7. Alcohol | 429
Chunk token count: 15.75 | Text: https:/ /www.ncbi.nlm.nih.gov/pubmed/24456350 Introduction | 57
Chunk token count: 12.0 | Text: PART V CHAPTER 5. LIPIDS Chapter 5. Lipids | 289
Chunk token count: 16.0 | Text: Accessed January 20, 2018. The Effect of New Technologies | 1031
Chunk token count: 18.0 | Text: Updated July 24, 2017. Accessed April 15, 2018. 1112 | Threats to Health


In [34]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

## Embedding text chunks

Embeddings are a broad but powerful concept.

Humans understand text, machines understand numbers.

* Turn our text chunks into numbers, specifically embeddings

A useful

In [35]:

!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl (224 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/224.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-no

In [36]:
from sentence_transformers import SentenceTransformer

In [37]:
# Embedding model are downloaded
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device="cpu")
# create a list of sentences
sentence = ["Bangladesh will won the world  one day","This is a test sentence","I love my family",
            "i am use linux operating system","I like to eat pizza"]
# sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentence)
embedding_dict = dict(zip(sentence, embeddings))

# see the embedding
for sentence,embedding in embedding_dict.items():
  print(f"Sentence: {sentence} \nEmbedding: {embedding}")
  print("")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence: Bangladesh will won the world  one day 
Embedding: [-8.58733710e-03  4.94242683e-02 -2.66705304e-02  1.00702336e-02
 -7.47658126e-03 -4.01978120e-02 -5.94082810e-02 -4.96949963e-02
 -5.26912697e-03  4.78123240e-02  2.49715559e-02 -3.60853821e-02
 -9.95584298e-03  4.79048043e-02  6.21235073e-02 -1.55137004e-02
  3.73650575e-04  1.50357550e-02 -2.31155213e-02  2.15814542e-02
  2.08892152e-02 -8.05956870e-03 -2.37267511e-03  2.41653360e-02
  2.06762310e-02  7.63245858e-03  3.22654136e-02 -3.43736261e-03
 -8.52692053e-02  1.81441754e-02  5.00198379e-02 -1.81737747e-02
 -5.42332605e-02 -2.82335710e-02  1.16497995e-06 -3.83836254e-02
  5.65074896e-03  1.76016632e-02 -7.34784920e-03  1.79011561e-02
 -7.16348439e-02  5.66040650e-02 -3.60859483e-02 -8.62864195e-04
 -2.12407187e-02  2.59555932e-02 -9.55192000e-03  7.81845748e-02
 -1.10999383e-02 -6.77117109e-02 -3.47377593e-03  4.06222977e-02
 -6.36007860e-02  1.28257237e-02  2.75766756e-02  1.04985889e-02
  1.96203180e-02 -1.77630316e

In [38]:
embeddings[0].shape

(768,)

In [39]:
embedd_a_sentence = embedding_model.encode("My name is sujon islam from jamalpur.")
embedd_a_sentence.shape

(768,)

In [40]:
%%time
embedding_model.to("cpu")

for item in tqdm(pages_and_chunks_over_min_token_len):
  item["embedding"] = embedding_model.encode(item["sentence_chunk"])


  0%|          | 0/1749 [00:00<?, ?it/s]

CPU times: user 20min 53s, sys: 7.32 s, total: 21min 1s
Wall time: 21min 38s


In [41]:
# %%time
# embedding_model.to("cuda")

# for item in tqdm(pages_and_chunks_over_min_token_len):
#   item["embedding"] = embedding_model.encode(item["sentence_chunk"])
# RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx


In [42]:
%%time
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[:3]

CPU times: user 1.74 ms, sys: 0 ns, total: 1.74 ms
Wall time: 1.75 ms


['Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
 'Contents Preface University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program xxv About the Contributors University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program xxvi Acknowledgements University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program xl Part\xa0I.\xa0Chapter 1. Basic Concepts in Nutrition Introduction University of Hawai‘i at Mānoa Food

In [43]:
len(text_chunks)

1749

In [44]:
%%time

#embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # remain to change anytime to experiment
                                               convert_to_tensor=True, #
                                               show_progress_bar=True
                                               )

Batches:   0%|          | 0/55 [00:00<?, ?it/s]

CPU times: user 23min 7s, sys: 5min 26s, total: 28min 33s
Wall time: 28min 51s


In [45]:
text_chunk_embeddings

tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]])

### Save embeddings to file

In [46]:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embedding_df_save_path = "/content/drive/MyDrive/rag/text_chunks_and_embeddings.csv"
text_chunks_and_embeddings_df.to_csv(embedding_df_save_path, index=False)

In [47]:
sentences = ["The Sentence Transformer library provides an easy way to create embedding ",
             "I hate no longer to pain",]


In [48]:
# import save file and view
text_chunks_and_embeddings_df_loaded = pd.read_csv("/content/drive/MyDrive/rag/text_chunks_and_embeddings.csv")

In [49]:
text_chunks_and_embeddings_df_loaded.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1749.0,1749.0,1749.0,1749.0
mean,579.87,771.8,118.38,192.95
std,348.94,429.09,68.51,107.27
min,-39.0,81.0,7.0,20.25
25%,277.0,374.0,55.0,93.5
50%,579.0,783.0,121.0,195.75
75%,887.0,1134.0,175.0,283.5
max,1166.0,1831.0,297.0,457.75


In [50]:
text_chunks_and_embeddings_df_loaded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1749 entries, 0 to 1748
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   page_number        1749 non-null   int64  
 1   sentence_chunk     1749 non-null   object 
 2   chunk_char_count   1749 non-null   int64  
 3   chunk_word_count   1749 non-null   int64  
 4   chunk_token_count  1749 non-null   float64
 5   embedding          1749 non-null   object 
dtypes: float64(1), int64(3), object(2)
memory usage: 82.1+ KB


### Vector Database
 A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes. Each vector has a certain number of dimensions, which can range from tens to thousands, depending on the complexity and granularity of the data.Mar 18, 2024


## RAG - Search and Answer
RAG goal Retrieve relevent passages

Comparing embeddings is know as similarity search, vector searchm semantic or  "vibe"


In [51]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("/content/drive/MyDrive/rag/text_chunks_and_embeddings.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1749, 768])

In [52]:
embeddings

tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]])

In [53]:
# Create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device)




Embedding model ready

lets create a small semantic search pipeline

In essencem we want to search for a query and get back relevant passages from our textbook.

We can do so with the following steps:
1. Define a query string
2. Turn the query string into an embedding.
3. Perform a dot product or cosine similarity function between the text embeddings and the query embeddings.
4. Sort the results from 3 in descending order.

In [54]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = input(str("Query"))
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Querywhat is feed neutrition
Query: what is feed neutrition
Time take to get scores on 1749 embeddings: 0.00159 seconds.


torch.return_types.topk(
values=tensor([0.4683, 0.4430, 0.4272, 0.4267, 0.4263]),
indices=tensor([1696, 1675, 1596,  856,  888]))

In [None]:
larger_embeddings = torch.randn(1000*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

Embeddings shape: torch.Size([1749000, 768])


In [55]:
1000

1680000