## Create and run a local RAG pipeline from scratch

### What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "what are the macronutrients and what do they do?"  -> retrieve passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information and augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first two steps and pass them to an LLM for generative outputs.

If you want to read where RAG came from, see the paper from Facebook AI: https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

> This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

### Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text does not mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does mean a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).


### What can RAG be used for?

* Customer support Q&A chat - Treat your existing support documents as a resource and when a customer asks a question, you could have a retrieval system , retrieve relevant documentation snippets and then have an LLM craft those snippets into and answer. Think of this as a "chatbot" for your documentation.

* Email chain analysis - Let's say you're a large insurance company and you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails and then use an LLM to process that information into structured data.

* Company internal documentation chat

* Textbook Q&A - Let's say you are a nutrition student and you have got a 1200 page textbook to read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: take your relevant documents to a query and process them with an LLM.

From this angle, you can consider an LLM as a calculator for words.


### Why Local ?

Fun.

Privacy, speed and cost.

* Privacy - If you have private documentation, may be you do not want to send that to an API. You want to setup an LLM and run it on your own hardware.

* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we do not have to wait for transfers of data.

* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you do not have to keep paying API fees.

* No vendor lockin - If you run your own software/hardware. If OpenAI/another large internet company shut down tomorrow, you can still run your business.

In [None]:
!nvidia-smi

Sun Dec  7 03:51:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   56C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### What we are going to build

* https://github.com/mrdbourke/simple-local-rag
* https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

We are going to build NutriChat to "chat with a nutrition textbook".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).

2. Format the text of the PDF textbook ready for an embedding model.

3. Embed all of the chunks of text in the textbook and turn them into numerical representations which we can store for later.

4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.

5. Create a prompt that incorporates the retrieved pieces of text.

6. Generate an answer to a query based on the passages of the textbook with an LLM.

1. Steps 1-3: Document preprocessing and embedding creation.

2. Steps 4-6: Search and answer

### 1. Document/text processing and embedding creation

Ingredients:

* PDF document of choice (note: this could be almost any kind of document, I have just chosen to focus on PDFs for now).

* Embedding model of choice.

Steps:

1. Import PDF document.

2. Process text for embedding (e.g. split into chunks of sentences).

3. Embed text chunks with embedding model.

4. Save embeddings file for later (embeddings will store on file for many years or until you lose your hard drive).



In [None]:
#### Import PDF document
# !wget https://github.com/mrdbourke/simple-local-rag/blob/main/human-nutrition-text.pdf

In [None]:
import os
import requests

# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
  print(f"File does not exist, downloading...")

  # Enter the URL of the pdf
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request is successful
  if response.status_code == 200:
    # Open the file and save it
    with open(filename, "wb") as file:
      file.write(response.content)
    print(f"[INFO] the file has been downloaded and saved as {filename}")
  else:
    print(f"[INFO] failed to download the file. Status code: {response.status_code}")

else:
  print(f"File {pdf_path} exists.")

File does not exist, downloading...
[INFO] the file has been downloaded and saved as human-nutrition-text.pdf


We have got a PDF, let's open it!

In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.6-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.6-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.6


In [None]:
import fitz  # requires pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

In [None]:
def text_formatter(text: str) -> str:
  """
  Performs minor formatting on text.
  """
  cleaned_text = text.replace("\n", " ").strip()

  # Potentially more text formatting function can go here
  return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
  doc = fitz.open(pdf_path)
  pages_and_texts = []
  for page_number, page in tqdm(enumerate(doc)):
    text = page.get_text()
    text = text_formatter(text=text)
    pages_and_texts.append({"page_number": page_number - 41,
                            "page_char_count": len(text),
                            "page_word_count": len(text.split(" ")),
                            "page_sentence_count_raw": len(text.split(". ")),
                            "page_token_count": len(text) / 4, # 1 token ~ 4 characters
                            "text": text
                          })
  return pages_and_texts


pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

0it [00:00, ?it/s]

In [None]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [None]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 653,
  'page_char_count': 387,
  'page_word_count': 66,
  'page_sentence_count_raw': 2,
  'page_token_count': 96.75,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Introduction  |  653'},
 {'page_number': 795,
  'page_char_count': 1238,
  'page_word_count': 211,
  'page_sentence_count_raw': 13,
  'page_token_count': 309.5,
  'text': 'Nutrient  Nonpregnant Women  Pregnant Women  Vitamin A (mcg)  700.0  770.0  Vitamin B6 (mg)  1.5  1.9  Vitamin B12 (mcg)  2.4  2.6  Vitamin C (mg)  75  85  Vitamin D (mcg)  15  15  Vitamin E (mg)  15  15  Calcium (mg)  1,000.0  1,000.0  Folate (mcg)  400  600  Iron (mg)  18  27  Magnesium (mg)  320  360  Niacin(

In [None]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [None]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about, because:

1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens.

For example, an embedding model may have been trained to embed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://sbert.net/docs/cross_encoder/pretrained_models.html)

As for LLMs, they can't accept infinite tokens in their context window.

### Further text processing (splitting pages into sentences)

Two ways to do this:

1. We have done this by splitting on `". "`.

2. We can do this with a NLP library such as spaCy (https://spacy.io/usage) and nltk (https://www.nltk.org/).


In [None]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This another sentence. I like elephants.")
assert len(list(doc.sents)) == 3

# Print out our sentences split
list(doc.sents)

[This is a sentence., This another sentence., I like elephants.]

In [None]:
pages_and_texts[600]

{'page_number': 559,
 'page_char_count': 863,
 'page_word_count': 136,
 'page_sentence_count_raw': 8,
 'page_token_count': 215.75,
 'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Korsakoff syndrome can cause similar symptoms as beriberi such  as confusion, loss of coordination, vision changes, hallucinations,  and may progress to coma and death. This condition is specific  to alcoholics as diets high in alcohol can cause thiamin deficiency.  Other individuals at risk include individuals who also consume diets  typically low in micronutrients such as those with eating disorders,  elderly, and individuals who have gone through gastric bypass  surgery.5  Figure 9.10 The Role of Thiamin  Figure 9.11 Beriberi, Thiamin Deficiency  5.\xa0Fact Sheets for Health Professionals: Thiamin. National  Institute of Health, Office of Dietary Supplements.  \xa0https://ods.od.nih.gov/factsheets/Thiamin- HealthProfessional/. Updated Feburary 11, 2016.  Accessed October 22, 2017.  Water-Soluble Vitami

In [None]:
for item in tqdm(pages_and_texts):
  item["sentences"] = list(nlp(item["text"]).sents)

  # Make sure all sentences are strings (default type is a spaCy datatype)
  item["sentences"] = [str(sentence) for sentence in item["sentences"]]

  # Count the sentences
  item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [None]:
random.sample(pages_and_texts, k=1)

[{'page_number': 374,
  'page_char_count': 1174,
  'page_word_count': 205,
  'page_sentence_count_raw': 11,
  'page_token_count': 293.5,
  'text': 'The Role of Proteins in  Foods: Cooking and  Denaturation  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  In addition to having many vital functions within the body, proteins  perform different roles in our foods by adding certain functional  qualities to them. Protein provides food with structure and texture  and enables water retention. For example, proteins foam when  agitated. (Picture whisking egg whites to make angel food cake. The  foam bubbles are what give the angel food cake its airy texture.)  Yogurt is another good example of proteins providing texture. Milk  proteins called caseins coagulate, increasing yogurt’s thickness.  Cooked proteins add some color and flavor to foods as the amino  group binds with carbohydrates and produces a brown pigment and  aroma. Eggs are betwee

In [None]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We will keep it simple and split it into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however stick with Python for now:
https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:

1. So our texts are easier to filter (smaller groups of text can be easier to inspect than large passages of text).

2. So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).

3. So our contexts passed to an LLM can be more specific and focused.

In [None]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recursively into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]

def split_list(input_list: list,
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i: i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [None]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
  item["sentence_chunks"] = split_list(input_list = item["sentences"],
                                       slice_size = num_sentence_chunk_size)
  item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [None]:
random.sample(pages_and_texts, k=1)

[{'page_number': 31,
  'page_char_count': 1556,
  'page_word_count': 259,
  'page_sentence_count_raw': 14,
  'page_token_count': 389.0,
  'text': 'Adequacy  An adequate diet is one that favors nutrient-dense foods. Nutrient- dense foods are defined as foods that contain many essential  nutrients per calorie. Nutrient-dense foods are the opposite of  “empty-calorie” foods, such as sugary carbonated beverages, which  are also called “nutrient-poor.” Nutrient-dense foods include fruits  and vegetables, lean meats, poultry, fish, low-fat dairy products, and  whole grains. Choosing more nutrient-dense foods will facilitate  weight loss, while simultaneously providing all necessary nutrients.  Balance  Balance the foods in your diet. Achieving balance in your diet entails  not consuming one nutrient at the expense of another. For example,  calcium is essential for healthy teeth and bones, but too much  calcium will interfere with iron absorption. Most foods that are  good sources of iron are

In [None]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32,1.53
std,348.86,560.38,95.76,6.19,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We would like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into text sample that was used in our model.

In [None]:
import re

# Split each chunk into its own item
pages_and_chunks = []

for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into paragraph like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)  #  ".A" => ". A" (will work for any capital letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4  # 1 token = ~ 4 chunks

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [None]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 305,
  'sentence_chunk': 'Image by Allison Calabrese/ CC BY 4.0 How Lipids Work UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM Lipids are unique organic compounds, each serving key roles and performing specific functions within the body. As we discuss the various types of lipids (triglycerides, phospholipids, and sterols) in further detail, we will compare their structures and functions and examine their impact on human health. Triglycerides Structure and Functions Triglycerides are the main form of lipid found in the body and in the diet. Fatty acids and glycerol are the building blocks of triglycerides. Glycerol is a thick, smooth, syrupy compound that is often used in the food industry. To form a triglyceride, a glycerol molecule is joined by three fatty acid chains.triglycerides contain varying mixtures of fatty acids. Figure 5.3 The Structure of a Triglycerides How Lipids Work | 305',
  'chunk_char_count': 927,


In [None]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


#### Filter chunks of text for short chunks

These chunks may not contain much useful information.

In [None]:
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-41,Human Nutrition: 2020 Edition,29,4,7.25
1,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0
2,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5
3,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5
4,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25


In [None]:
# Show random chunks with under 30 tokens in length
min_token_length = 30

for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 11.75 | Text: Accessed March 17, 2018. Sports Nutrition | 961
Chunk token count: 19.25 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=463   870 | Introduction
Chunk token count: 16.0 | Text: Accessed January 20, 2018. 1032 | The Effect of New Technologies
Chunk token count: 20.75 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=84   The Digestive System | 81
Chunk token count: 9.5 | Text: 742 | Building Healthy Eating Patterns


In [None]:
# Filter our Dataframe for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [None]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 545,
  'sentence_chunk': 'Institute of Medicine. Dietary reference intakes for vitamin A, vitamin K, arsenic, boron, chromium, copper, iodine, iron, manganese, molybdenum, nickel, silicon, vanadium, and zinc. Washington, DC: National Academy Press; 2001. Table 9.8 Dietary Reference Intakes for Vitamin K Fat-Soluble Vitamins | 545',
  'chunk_char_count': 305,
  'chunk_word_count': 42,
  'chunk_token_count': 76.25}]

### Embedding our text chunks

Embeddings are a broad but powerful concept.

While humans understand texts, machines understand numbers.

What we'd like to do:

- Turn our text chunks into numbers, specifically embeddings.

A useful numerical representation.

The best part about embeddings is that they are a *learned* representation.

```
{"the":0,
"a": 1,
...
```

For a great resource on learning embeddings, see here: https://vickiboykis.com/what_are_embeddings/


In [None]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu")

# Create a list of sentences
sentences = ["The Sentence Transformer library provides an easy way to create embeddings",
             "Sentences can be embedded one by one or in a list.",
             "I like horses!"]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence: The Sentence Transformer library provides an easy way to create embeddings
Embedding: [-3.17512117e-02  3.37267779e-02 -2.52437871e-02  5.22287413e-02
 -2.35249251e-02 -6.19112700e-03  1.35026313e-02 -6.25501126e-02
  7.50829559e-03 -2.29684301e-02  2.98147090e-02  4.57554609e-02
 -3.26700062e-02  1.39847305e-02  4.18014117e-02 -5.92969656e-02
  4.26309630e-02  5.04656229e-03 -2.44552456e-02  3.98593862e-03
  3.55897620e-02  2.78742500e-02  1.84098668e-02  3.67700271e-02
 -2.29961295e-02 -3.01797204e-02  5.99575753e-04 -3.64504009e-02
  5.69104478e-02 -7.49938656e-03 -3.70003581e-02 -3.04356613e-03
  4.64354493e-02  2.36149714e-03  9.06850175e-07  7.00033410e-03
 -3.92289534e-02 -5.95696364e-03  1.38653377e-02  1.87106978e-03
  5.34202680e-02 -6.18613027e-02  2.19613463e-02  4.86050583e-02
 -4.25697528e-02 -1.69858951e-02  5.04178032e-02  1.54733751e-02
  8.12859237e-02  5.07106185e-02 -2.27497052e-02 -4.35721017e-02
 -2.18391954e-03 -2.14091651e-02 -2.01757699e-02  3.0683213

In [None]:
embeddings[0].shape

(768,)

In [None]:
embedding = embedding_model.encode("My favorite animal is the cow!")
embedding

array([-1.45920366e-02,  8.02744478e-02, -2.35814210e-02, -3.19282077e-02,
        4.08718586e-02,  5.27201593e-02, -6.59843534e-02,  1.63273532e-02,
        1.03796236e-02, -3.25809605e-02, -2.78962869e-02,  5.04059680e-02,
       -3.03209741e-02, -5.52543020e-03, -2.24996568e-03, -3.40672955e-02,
        4.15263586e-02, -6.02290686e-03, -1.18760532e-02,  5.03419824e-02,
       -2.46707872e-02,  4.90849502e-02, -1.78524461e-02, -2.02775057e-02,
       -3.04977577e-02,  8.45101848e-03, -2.10023876e-02, -2.69276239e-02,
        1.77504830e-02,  1.21456198e-02, -5.96181341e-02, -8.12657177e-02,
        3.16369571e-02, -1.59019569e-03,  1.23865721e-06, -8.03155825e-03,
       -3.90659943e-02,  2.38245446e-02,  3.93481143e-02,  2.12699478e-03,
        2.04685982e-02,  4.92226332e-03, -2.50347406e-02,  1.32806078e-02,
        3.23007554e-02,  5.64530268e-02,  4.20428775e-02,  1.70866624e-02,
       -9.11563933e-02,  1.88811645e-02, -3.20625724e-03,  4.00640332e-04,
       -3.64725441e-02,  

In [None]:
%%time

# embedding_model.to("cpu")

# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.48 µs


In [None]:
%%time

embedding_model.to("cuda")

for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: user 29.9 s, sys: 420 ms, total: 30.3 s
Wall time: 31.2 s


In [None]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

text_chunks[419]

CPU times: user 311 µs, sys: 0 ns, total: 311 µs
Wall time: 317 µs


'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture to their fascinating food creations. Add

In [None]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                                batch_size = 32,  # you can experiment to find which batch size leads to best results
                                                convert_to_tensor = True)

text_chunk_embeddings

CPU times: user 24.3 s, sys: 55 ms, total: 24.3 s
Wall time: 24 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

#### Save embeddings to file



In [None]:
pages_and_chunks_over_min_token_len[419]

{'page_number': 277,
 'sentence_chunk': 'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture 

In [None]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)

embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [None]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242675e-02 9.02281404e-02 -5.09548886e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156419e-02 5.92139773e-02 -1.66167244e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,[ 2.79801842e-02 3.39813754e-02 -2.06426680e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,[ 6.82566911e-02 3.81275006e-02 -8.46854132e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49763490e-03 9.57159605e-...


If your embedding databse is really large (e.g. over 100k-1M samples) you might want to look into using a vector database for storage:
https://en.wikipedia.org/wiki/Vector_database

### 2. RAG -Search and Answer

RAG goal: Retrieve relevant passages based on a query and use those passages to augment an input to an LLM so it can generate an output based on those relevant passages.

### Similarity search

Embeddings can be used for almost any time of data.

For example, you can turn images into embeddings, sound into embeddings, text into embeddings, etc...

Comparing embeddings is known as similarity search, vector search, semantic search.

In our case, we want to query our nutrition textbook passages based on semantics or "vibe".

So if I search for "macronutrient nutrition" I should get relevant passages to that text but may not contain exactly the words "macronutrient functions".

Whereas with keyword search, if I search "apple" I get back passages with specifically "apple".


In [None]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert our embeddings into a torch.tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df['embedding'].tolist(), axis=0), dtype=torch.float32).to(device)

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embedding_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.00,"[0.0674242675, 0.0902281404, -0.00509548886, -..."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.50,"[0.0552156419, 0.0592139773, -0.0166167244, -0..."
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.50,"[0.0279801842, 0.0339813754, -0.020642668, 0.0..."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,"[0.0682566911, 0.0381275006, -0.00846854132, -..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.50,"[0.0330264494, -0.0084976349, 0.00957159605, -..."
...,...,...,...,...,...,...
1675,1164,Flashcard Images Note: Most images in the flas...,1305,176,326.25,"[0.0185622536, -0.0164277665, -0.0127045633, -..."
1676,1164,Hazard Analysis Critical Control Points reused...,375,51,93.75,"[0.0334720612, -0.0570440851, 0.0151489386, -0..."
1677,1165,ShareAlike 11. Organs reused “Pancreas Organ A...,1286,173,321.50,"[0.0770515501, 0.00978557579, -0.0121817412, 0..."
1678,1165,Sucrose reused “Figure 03 02 05” by OpenStax B...,410,59,102.50,"[0.103045158, -0.0164701864, 0.00826846063, 0...."


In [None]:
embeddings.shape

torch.Size([1680, 768])

In [None]:
# Create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path = "all-mpnet-base-v2",
                                      device=device)

Embedding model ready!

Let's create a small semantic search pipeline.

In essence, we want to search for a query (e.g. "macronutrient functions") and get back relevant passages from our textbook.

We can do so with the following steps:

1. Define a query string.

2. Turn the query string into and embedding.

3. Perform a dot product or cosine similarity function between the text embeddings and the query embedding.

4. Sort the results from 3 in descending order.

In [None]:
embeddings.shape

torch.Size([1680, 768])

Note: to use dot product for comparison, ensure vector sizes of same shape (e.g.768) and tensors/vectors are in the same datatype (e.g. both are in torch.float32)

In [None]:
# 1. Define the query
query = "macronutrient functions"
print(f"Query: {query}")

# 2. Embed the query
# Note: It is important to embed the query with the same model you embedded your passages
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

# 3. Get similarity scores with the dot product (use cosine similarity if outputs of model are'nt normalized)
from time import perf_counter as timer

start_time = timer()

dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

end_time = timer()

print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time - start_time: .5f} seconds.")

# 4. Get the top-k results (we will keep top 5)
top_results_dot_product = torch.topk(dot_scores, k=5)

top_results_dot_product

Query: macronutrient functions
[INFO] Time taken to get scores on 1680 embeddings:  0.00025 seconds.


torch.return_types.topk(
values=tensor([0.6843, 0.6717, 0.6517, 0.6493, 0.6478], device='cuda:0'),
indices=tensor([42, 47, 46, 51, 41], device='cuda:0'))

In [None]:
larger_embeddings = torch.randn(100*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

# Perform dot product across 168000 embeddings
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=larger_embeddings)[0]
end_time = timer()

print(f"[INFO] Time taken to get scores on {len(larger_embeddings)} embeddings: {end_time - start_time: .5f} seconds.")

Embeddings shape: torch.Size([168000, 768])
[INFO] Time taken to get scores on 168000 embeddings:  0.00076 seconds.


We can see that searching over embeddings is very fast even if we do exhaustive search.

But if you had 10M+ embeddings, you likely want to create an index.

An index is like letters in a dictionary.

For example, if you wanted to search "duck" in the dictionary, you would start at "d" then find words close to "du.." etc.

An index helps to narrow it down.

A popular indexing library for vector search is Faiss, see here:
https://github.com/facebookresearch/faiss

One technique that library provides is approximate nearest neighbor search (ANN):  https://en.wikipedia.org/wiki/(1%2B%CE%B5)-approximate_nearest_neighbor_search

Let's make our vector search results pretty.

In [None]:
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [None]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indices from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    print("Text:")
    print_wrapped(pages_and_chunks[idx]['sentence_chunk'])
    print(f"Page number: {pages_and_chunks[idx]["page_number"]}")
    print("\n")

Query: 'macronutrient functions'

Results:
Score: 0.6843
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other macronutrients, it does not yield calories. Carbohydrates Carbohydrates
are m