# Create & Run a Local RAG Pipeline from Scratch

## What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM, so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "What are the macronutrients & what do they do?" -> retrieves passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information & augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first 2 steps & pass them to an LLM for generative outputs.

Where RAG came from - Facebook / Meta AI Paper: *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*
> This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text doesn't mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).

## What can RAG be used for?

* Customer Support Q&A Chat - Treat your existing customer support documents as a resource and when a customer asks a question, you could have a retrieval system, retrieve relevant documentation snippets & then have a LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation".

* Email Chain Analysis - Let's say you are a large insurance company & you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails & then use an LLM to process that information into structured data.

* Company Interval Documentation Chat

* Textbook Q&A - Let's say you are a nutrition student and you've got a 1200 pages textbook read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: Take your relevant documents to a query & process them with an LLM.

From this angle, consider LLM as a calculator for words.

## Why Local?

Fun.

Privacy, Speed, Cost.

* Privacy - If you have private documentation, maybe you don't want to send that to an API. You want to setup an LLM and run it on your own hardware.
* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we don't have to wait for transfers of data.
* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you don't have to keep paying API fees.
* No Vendor Lock-in - If you run your own software/ hardware. If Large company shuts down tomorrow, you can still run your business.

## What Will Be Built?

Build NutriChat to "chat with a nutrition document".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embbed all of the chunks of text in the textbook, and turn them into numerical representations (embeddings) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All Locally!

1. Steps 1 - 3: Document Preprocessing & Embedding Creation.
2. Steps 4 - 6: Search & Answer.

## 1. Document / Text Preprocessing & Embedding Creation

Ingredients:
* PDF document of choice (note: this could be almost any kind of document, just that PDFs are focused for now).
* Embedding model of choice

Steps:
1. Import PDF Document.
2. Preprocess Text for Embedding (e.g. Split into Chunks of Sentences).
3. Embbed Text Chunks with Embedding Model.
4. Save Embeddings to File for Later (Embeddings will store on files for many years or until you lose your hard drive).

## Import PDF Document

In [1]:
import os
import requests

In [2]:
# path to document
pdf_path = 'human-nutrition-text.pdf'

# download PDF
if not os.path.exists(pdf_path):
    print(f'[INFO] File does not exist, downloading...')

    # url of the pdf
    url = 'https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf'

    # the local file name to save the downloaded file
    fname = pdf_path

    # GET request
    res = requests.get(url)

    # check if the request is successful
    if res.status_code == 200:
        # open the file & save it
        with open(fname, 'wb') as f:
            f.write(res.content)
        print(f'[INFO] The file has been downloaded & saved as {fname}.')
    else:
        print(f'[INFO] Failed to download the file. Status Code: {res.status_code}')
else:
    print(f'[INFO] File {pdf_path} exists.')

[INFO] File does not exist, downloading...
[INFO] The file has been downloaded & saved as human-nutrition-text.pdf.


PDF is now available, let's open it.

In [3]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.10-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.10 PyMuPDFb-1.24.10


In [4]:
import fitz # from PyMuPDF
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    '''Performs minor formatting on text.'''
    cleaned_text = text.replace('\n', ' ').strip()

    return cleaned_text

def open_and_read_pdf(path: str) -> list[dict]:
    doc = fitz.open(path)
    pages_and_texts = []

    for page_no, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({'page_no': page_no - 41,
                                'page_char_cnt': len(text),
                                'page_word_cnt': len(text.split(' ')),
                                'page_sentence_cnt_raw': len(text.split('. ')),
                                'page_token_cnt': len(text) / 4, # 1 token ~ 4 chars
                                'text': text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_no': -41,
  'page_char_cnt': 29,
  'page_word_cnt': 4,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_no': -40,
  'page_char_cnt': 0,
  'page_word_cnt': 1,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 0.0,
  'text': ''}]

In [5]:
import random

random.sample(pages_and_texts, k=3)

[{'page_no': 907,
  'page_char_cnt': 1002,
  'page_word_cnt': 188,
  'page_sentence_cnt_raw': 12,
  'page_token_cnt': 250.5,
  'text': 'Image by  David Marcu  on  unsplash.co m / CC0  Young Adulthood  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Young adulthood is the period from ages nineteen to thirty years.  It is a stable time compared to childhood and adolescence. Physical  growth has been completed and all of the organs and body systems  are fully developed. Typically, a young adult who is active has  reached his or her physical peak and is in prime health. For example,  vital capacity, or the maximum amount of air that the lungs can  inhale and exhale, is at its peak between the ages of twenty and  forty.1 During this life stage, it important to continue to practice  good  nutrition.  Healthy  eating  habits  promote  metabolic  functioning, assist repair and regeneration, and prevent the  1.\xa0Polan EU, Taylor DR. (2003)

In [6]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [7]:
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about because:

1. Embedding models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example, an embedding model may have been trained to embbed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

As for LLMs, they can't accept infinite tokens in their context window.

## Further Text Preprocessing

Splitting pages into sentences.

2 Ways to do this:

1. Done this by splitting on `'.'`.
2. Do this by NLP library, such as spaCy and nltk.

In [8]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline (turning texts into sentences)
nlp.add_pipe('sentencizer')

# Create document instance as an example
doc = nlp('This is a sentence. This is another sentence. I like elephants.')
assert len(list(doc.sents)) == 3

# Print sentences split
list(doc.sents)

[This is a sentence., This is another sentence., I like elephants.]

# New Section

In [9]:
pages_and_texts[0]

{'page_no': -41,
 'page_char_cnt': 29,
 'page_word_cnt': 4,
 'page_sentence_cnt_raw': 1,
 'page_token_cnt': 7.25,
 'text': 'Human Nutrition: 2020 Edition'}

In [10]:
for item in tqdm(pages_and_texts):
    item['sentences'] = list(nlp(item['text']).sents)

    # Make sure all sentences are string (default type is spaCy data type)
    item['sentences'] = [str(sentence) for sentence in item['sentences']]

    # Count the sentences
    item['page_sentence_cnt_spacy'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [11]:
random.sample(pages_and_texts, k=1)

[{'page_no': -13,
  'page_char_cnt': 379,
  'page_word_cnt': 69,
  'page_sentence_cnt_raw': 4,
  'page_token_cnt': 94.75,
  'text': 'Students  Noemi Arceo Caacbay  Noemi Arceo Caacbay is a Masters Student in the Public Health,  Health Policy and Management Program at the University of Hawai‘i  at Mānoa. She enjoys learning about all things health-science  related. She is passionate about returning to her home of Saipan,  CNMI where she will give back and serve her community.  About the Contributors  |  xxix',
  'sentences': ['Students  Noemi Arceo Caacbay  Noemi Arceo Caacbay is a Masters Student in the Public Health,  Health Policy and Management Program at the University of Hawai‘i  at Mānoa.',
   'She enjoys learning about all things health-science  related.',
   'She is passionate about returning to her home of Saipan,  CNMI where she will give back and serve her community.',
   ' About the Contributors  |  xxix'],
  'page_sentence_cnt_spacy': 4}]

In [12]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,page_sentence_cnt_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of texts into smaller ones, often refer to as `text splitting` or `chunking`.

There is no 100% of correct way to do this - experiment!

To keep it simple, it will split into groups of 10 sentences.

There are frameworks such as `langchain` which can help with this, but we will use `python` for now.
- https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

Why do we do this:
1. So the texts are easier to filter (smaller group of texts can be easier to inspect than large passages of texts).
2. So the text chunks can fit into the embedding model of context. (eg. 384 tokens has a limit).
3. So the contexts passed into LLM can be more specific and focused.

In [13]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recurively into chunk size
# eg. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               split_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i : i + split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [14]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item['sentence_chunks'] = split_list(input_list=item['sentences'],
                                         split_size=num_sentence_chunk_size)
    item['num_chunks'] = len(item['sentence_chunks'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [15]:
random.sample(pages_and_texts, k=1)

[{'page_no': 71,
  'page_char_cnt': 958,
  'page_word_cnt': 150,
  'page_sentence_cnt_raw': 11,
  'page_token_cnt': 239.5,
  'text': 'consensus that probiotics ward off viral-induced diarrhea  and reduce the symptoms of lactose intolerance.1  Expert nutritionists agree that more health benefits of  pre- and probiotics will likely reach scientific consensus. As  the fields of pre- and probiotic manufacturing and their  clinical study progress, more information on proper dosing  and what exact strains of bacteria are potentially “friendly”  will become available.  You may be interested in trying some of these foods in  your diet. A simple food to try is kefir. Several websites  provide good recipes, including http://www.kefir.net/ recipes.htm.  Kefir, a dairy product fermented with probiotic bacteria,  can make a pleasant tasting milkshake.  \xa0 Figure 2.5 The Human Digestive System  1.\xa0Farnworth ER. (2008). The Evidence to Support Health  Claims for Probiotics. Journal of Nutrition,

In [16]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,page_sentence_cnt_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32,1.53
std,348.86,560.38,95.76,6.19,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We'd like to embbed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into text sample that was used in our model.

In [17]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_no'] = item['page_no']

        # Join the sentences together into paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = ''.join(sentence_chunk).replace('  ', ' ').strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # '.A' -> '. A'

        chunk_dict['sentence_chunk'] = joined_sentence_chunk

        # get some stats on the chunks
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(' ')])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [18]:
random.sample(pages_and_chunks, k=1)

[{'page_no': 814,
  'sentence_chunk': 'of eight weeks. The anterior fontanel closes about a year later, at eighteen months on average. Developmental milestones include sitting up without support, learning to walk, teething, and vocalizing among many, many others. All of these changes require adequate nutrition to ensure development at the appropriate rate.8 Healthy infants grow steadily, but not always at an even pace. For example, during the first year of life, height increases by 50 percent, while weight triples. Physicians and other health professionals use growth charts to track a baby’s development process. Because infants cannot stand, length is used instead of height to determine the rate of a child’s growth. Other important developmental measurements include head circumference and weight. All of these must be tracked and compared against standard measurements for an infant’s age. In the US, for infants and toddlers from birth to 24 months of age, the WHO growth charts are used 

In [19]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_no,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


### Filter chunks of texts for short chunks

These chunks may not contain much useful information.

In [20]:
# Show random chunks with under 30 tokens in length
min_token_len = 30

for row in df[df['chunk_token_count'] <= min_token_len].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 16.5 | Text: PART X CHAPTER 10. MAJOR MINERALS Chapter 10. Major Minerals | 607
Chunk token count: 3.5 | Text: Fluoride | 697
Chunk token count: 26.0 | Text: http://www.ncbi.nlm.nih.gov/pubmed/20182023. Accessed September 22, 2017. 220 | Popular Beverage Choices
Chunk token count: 28.75 | Text: Image by FDA/ Changes to the Nutrition Facts Label Figure 12.5 Food Serving Sizes 728 | Discovering Nutrition Facts
Chunk token count: 19.5 | Text: 2009). Dietary Glycemic Index: Digestion and Absorption of Carbohydrates | 247


In [21]:
# Filter DataFrame for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df['chunk_token_count'] > min_token_len].to_dict(orient='records')
pages_and_chunks_over_min_token_len[:2]

[{'page_no': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_no': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [22]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_no': 888,
  'sentence_chunk': 'Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities.\xa0 These activities are available in the web-based textbook and not available in the downloadable versions (EPUB, Digital PDF, Print_PDF, or Open Document). Learning activities may be used across various mobile devices, however, for the best user experience it is strongly recommended that users complete these activities using a desktop or laptop computer and in Google Chrome. \xa0 An interactive or media element has been excluded from this version of the text. You can view it online here: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=469 \xa0 888 | Adolescence',
  'chunk_char_count': 727,
  'chunk_word_count': 103,
  'chunk_token_count': 181.75}]

### Embedding Text Chunks

Embeddings are a broad but powerful concept.

While humans understand text, machines understand numbers.

What we'd like to do:
- Turn our text chunks into numbers, specifically embeddings.

A useful numerical representation.

The best part about embeddings is that are a *leanred* representation.

eg. (in reality is very high in dimensions)
```
'a': 0
'the': 1
```

In [23]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


In [24]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path='all-mpnet-base-v2',
                                      device='cpu')

# create a list of sentences
sentences = ['The Sentence Transformer libary provides an easy way to create embeddings.',
             'Sentences can be embedded one by one or in a list.',
             'I like horses!']

# sentences are encoded / embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# see the embeddings
for sentence, embedding in embeddings_dict.items():
  print(f'Sentence: {sentence}')
  print(f'Embedding: {embedding}')
  print()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence: The Sentence Transformer libary provides an easy way to create embeddings.
Embedding: [-2.99810041e-02  2.60943677e-02 -2.29293108e-02  6.35951087e-02
 -1.93734914e-02 -4.49133106e-03  9.92148370e-03 -4.67883199e-02
  1.22560589e-02 -2.75072306e-02  2.65685376e-02  5.37345894e-02
 -3.85486335e-02  1.32278493e-02  4.85129729e-02 -4.82199155e-02
  4.89330776e-02  1.24478452e-02 -3.11446693e-02 -2.84732872e-04
  3.22389007e-02  2.24109069e-02  2.44735926e-02  4.07998487e-02
 -1.42253265e-02 -1.04100816e-02  9.76728392e-04 -4.08065096e-02
  4.98060323e-02 -6.61039073e-03 -3.11634112e-02 -9.80593637e-03
  5.56001887e-02  1.03648228e-03  1.02035688e-06  5.70027344e-03
 -3.94802354e-02 -6.44749170e-03  1.08795492e-02 -4.85746795e-03
  4.14262228e-02 -6.11538552e-02  1.98641513e-02  5.36945611e-02
 -4.52734940e-02 -1.35530392e-02  4.97607291e-02  1.83713101e-02
  9.01330784e-02  5.36868535e-02 -2.36761309e-02 -4.49780822e-02
  7.29141803e-03 -2.20344625e-02 -1.63788702e-02  2.3345733

In [25]:
embeddings[0].shape

(768,)

In [26]:
embedding = embedding_model.encode('My favourite animal is the cow!')
embedding

array([-1.45473834e-02,  7.66726956e-02, -2.85872258e-02, -3.31283063e-02,
        3.65210213e-02,  4.78570424e-02, -7.08107948e-02,  1.62834004e-02,
        1.93443689e-02, -2.80482266e-02, -2.91747209e-02,  5.11309654e-02,
       -3.28720324e-02, -8.98755714e-03, -1.03672966e-02, -3.15488502e-02,
        4.22783755e-02, -9.13285278e-03, -1.94017198e-02,  4.35689613e-02,
       -2.31998134e-02,  4.29883078e-02, -1.72393341e-02, -2.01372430e-02,
       -3.13574113e-02,  8.08165129e-03, -2.06725020e-02, -2.27869749e-02,
        2.44812425e-02,  1.71968192e-02, -6.26672879e-02, -7.54797533e-02,
        3.57421599e-02, -5.46570029e-03,  1.24730320e-06, -7.63198826e-03,
       -3.53221968e-02,  1.91327017e-02,  3.99045721e-02,  2.11737561e-03,
        1.64565910e-02,  9.84057318e-03, -1.80701055e-02,  9.33837332e-03,
        3.23482789e-02,  5.84785417e-02,  4.23187092e-02,  1.62091255e-02,
       -9.14910734e-02,  1.82305351e-02, -5.25730150e-03, -7.81022478e-03,
       -3.47644649e-02, -

In [27]:
# %%time

# embedding_model.to('cpu')

# # embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item['embedding'] = embedding_model.encode(item['sentence_chunk'])

In [28]:
%%time

embedding_model.to('cuda')

# embed each chunk one by one
for item in tqdm(pages_and_chunks_over_min_token_len):
    item['embedding'] = embedding_model.encode(item['sentence_chunk'])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: user 33.5 s, sys: 535 ms, total: 34 s
Wall time: 41.4 s


In [29]:
%%time

text_chunks = [item['sentence_chunk'] for item in pages_and_chunks_over_min_token_len]
text_chunks[419]

CPU times: user 577 µs, sys: 0 ns, total: 577 µs
Wall time: 577 µs


'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture to their fascinating food creations. Add

In [30]:
len(text_chunks)

1680

In [31]:
%%time

# embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32,
                                               convert_to_tensor=True)
text_chunk_embeddings

CPU times: user 23.3 s, sys: 45 ms, total: 23.4 s
Wall time: 23 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

### Save embeddings to file

In [32]:
pages_and_chunks_over_min_token_len[419]

{'page_no': 277,
 'sentence_chunk': 'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture to t

In [33]:
# save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = 'text_chunks_and_embeddings_df.csv'
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [34]:
# import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_no,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242675e-02 9.02281404e-02 -5.09548886e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156419e-02 5.92139773e-02 -1.66167244e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,[ 2.79801842e-02 3.39813754e-02 -2.06426680e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,[ 6.82566911e-02 3.81275006e-02 -8.46854132e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49763490e-03 9.57159605e-...


If your embedding database is really large (eg. over 100k-1M samples), you might want to look into using a vector database for storage.

## RAG - Search & Answer

RAG goal: Retrieved relevant passages based on a query and use those passages to augment an input to an LLM so it can generate output based on those relevant passages.

### Similarity Search

Embeddings can be used for almost any type of data.

For example, you can turn images into embeddings, sound into embeddings, text into embeddings, etc...

Comparing embeddings is known as similarity search, vector search, semantic search.

In our case, we want to query our nutrition textbook passages based on semantics or *vibe*.

So, if search for 'macronutrient functions', the relevant passages to that text should return, but may not contain exactly the words 'macronutrient functions'.

Whereas with keyword search, if 'apple' is searched, the passages return with specifically 'apple'.

In [42]:
import random
import numpy as np
import pandas as pd
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# import texts and embedding df
text_chunks_and_embeddings_df = pd.read_csv('text_chunks_and_embeddings_df.csv')

# convert embedding column back to np.array (it got converted to string when it was saved to csv)
text_chunks_and_embeddings_df['embedding'] = text_chunks_and_embeddings_df['embedding'].apply(lambda x: np.fromstring(x.strip('[]'), sep=' '))

# convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embeddings_df.to_dict(orient='records')

text_chunks_and_embeddings_df.head()

Unnamed: 0,page_no,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,"[0.0674242675, 0.0902281404, -0.00509548886, -..."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,"[0.0552156419, 0.0592139773, -0.0166167244, -0..."
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,"[0.0279801842, 0.0339813754, -0.020642668, 0.0..."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25,"[0.0682566911, 0.0381275006, -0.00846854132, -..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,"[0.0330264494, -0.0084976349, 0.00957159605, -..."


In [43]:
text_chunks_and_embeddings_df['embedding']

Unnamed: 0,embedding
0,"[0.0674242675, 0.0902281404, -0.00509548886, -..."
1,"[0.0552156419, 0.0592139773, -0.0166167244, -0..."
2,"[0.0279801842, 0.0339813754, -0.020642668, 0.0..."
3,"[0.0682566911, 0.0381275006, -0.00846854132, -..."
4,"[0.0330264494, -0.0084976349, 0.00957159605, -..."
...,...
1675,"[0.0185622536, -0.0164277665, -0.0127045633, -..."
1676,"[0.0334720612, -0.0570440851, 0.0151489386, -0..."
1677,"[0.0770515501, 0.00978557579, -0.0121817412, 0..."
1678,"[0.103045158, -0.0164701864, 0.00826846063, 0...."


In [45]:
embeddings = np.stack(text_chunks_and_embeddings_df['embedding'].tolist(), axis=0)
embeddings[:10]

array([[ 6.74242675e-02,  9.02281404e-02, -5.09548886e-03, ...,
        -2.21155025e-02, -2.32136492e-02,  1.25690866e-02],
       [ 5.52156419e-02,  5.92139773e-02, -1.66167244e-02, ...,
        -1.20406421e-02, -1.02847274e-02,  2.27396358e-02],
       [ 2.79801842e-02,  3.39813754e-02, -2.06426680e-02, ...,
        -5.36187319e-03,  2.12560110e-02,  3.13055031e-02],
       ...,
       [ 5.77196702e-02,  4.03853692e-02,  3.68254795e-03, ...,
        -1.42200831e-02, -4.67004674e-03, -9.98311117e-03],
       [ 6.42098412e-02,  2.41014995e-02, -2.16656341e-03, ...,
        -2.43524034e-02, -6.30805080e-05, -1.12714572e-02],
       [ 6.53467849e-02,  1.77536588e-02,  4.34765406e-03, ...,
        -4.22488786e-02,  6.30401541e-04,  1.44888656e-02]])