In [7]:
#Create and run a RAG pipleline from scratch

## What we're going to build

* https://github.com/mrdbourke/simple-local-rag
* https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

We're going to build NutriChat to "chat with a nutrition textbook".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical reprentations (embedding) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All locally!

1. Steps 1-3: Document preprocessing and embedding creation.
2. Steps 4-6: Search and answer.

In [8]:
#Import a pdf as the data source

In [9]:
import os
import requests

pdf_path = 'human-nutrition-text.pdf'

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {reponse.status_code}")

else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [10]:
import fitz # requires: !pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # Potentially more text formatting functions can go here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_setence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4, # 1 token = ~4 characters
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

1208it [00:01, 660.94it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_setence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_setence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [11]:
import random
random.sample(pages_and_texts,2)

[{'page_number': 166,
  'page_char_count': 1783,
  'page_word_count': 315,
  'page_setence_count_raw': 21,
  'page_token_count': 445.75,
  'text': 'Thirst Mechanism: Why Do We Drink?  Thirst is an osmoregulatory mechanism to increase water input.  The thirst mechanism is activated in response to changes in water  volume in the blood, but is even more sensitive to changes in blood  osmolality. Blood osmolality is primarily driven by the concentration  of sodium cations. The urge to drink results from a complex  interplay of hormones and neuronal responses that coordinate to  increase water input and contribute toward fluid balance and  composition in the body. The “thirst center” is contained within  the hypothalamus, a portion of the brain that lies just above the  brainstem. In older people the thirst mechanism is not as responsive  and as we age there is a higher risk for dehydration. Thirst happens  in the following sequence of physiological events:  1. Receptor proteins in the kidn

In [12]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [14]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15
std,348.86,560.44,95.75,6.19,140.11
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,4.0,190.69
50%,562.5,1232.5,215.0,10.0,308.12
75%,864.25,1605.25,271.25,14.0,401.31
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about because:
1. Embedding models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example an embedding model may have been trained to embed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://www.sbert.net/docs/pretrained_models.html).

As for LLMs, they can't accept infinite tokens in their context window, plus it would be cost ineffective to send 100,000s of tokens to an LLM every time.

We want the tokens we send to an LLM to valuable tokens.

In [15]:
#split text into chunks of ~10 sentences using nltk and spacy
from spacy.lang.en import English

nlp = English()

#add a sentencizer pipeline
nlp.add_pipe('sentencizer')

doc = nlp("This is a sentence. This is another. Welcome to RAG.")

assert len(list(doc.sents))==3
print(list(doc.sents))

[This is a sentence., This is another., Welcome to RAG.]


In [16]:
for item in tqdm(pages_and_texts):
  #make sure all sentences are strings instead of spacy datatypes
  item['sentences'] = [str(sentence) for sentence in list(nlp(item['text']).sents)]
  #count the sentences
  item['page_sentence_count_spacy'] = len(item['sentences'])

100%|██████████| 1208/1208 [00:01<00:00, 775.67it/s]


In [17]:
import random
random.seed(14)
random.sample(pages_and_texts, 1)

[{'page_number': 177,
  'page_char_count': 1915,
  'page_word_count': 323,
  'page_setence_count_raw': 23,
  'page_token_count': 478.75,
  'text': 'Sodium Imbalances  Sweating is a homeostatic mechanism for maintaining body  temperature, which influences fluid and electrolyte balance. Sweat  is mostly water but also contains some electrolytes, mostly sodium  and chloride. Under normal environmental conditions (i.e., not hot,  humid days) water and sodium loss through sweat is negligible,  but is highly variable among individuals. It is estimated that sixty  minutes of high-intensity physical activity, like playing a game of  tennis, can produce approximately one liter of sweat; however the  amount of sweat produced is highly dependent on environmental  conditions. A liter of sweat typically contains between 1 and 2 grams  of sodium and therefore exercising for multiple hours can result in a  high amount of sodium loss in some people. Additionally, hard labor  can produce substantial so

In [18]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32
std,348.86,560.44,95.75,6.19,140.11,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We'll keep it simple and split into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however, we'll stick with Python for now: https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:
1. So our texts are easier to filter (smaller groups of text can be easier to inspect that large passages of text).
2. So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).
3. So our contexts passed to an LLM can be more specific and focused.

In [19]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recursively into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [20]:
for item in tqdm(pages_and_texts):
  item['sentence_chunks'] = split_list(item['sentences'], slice_size = num_sentence_chunk_size)
  item['num_chunks'] = len(item['sentence_chunks'])
  # del item['chunk_count']




100%|██████████| 1208/1208 [00:00<00:00, 709425.82it/s]


In [21]:
random.sample(pages_and_texts,1)

[{'page_number': 1038,
  'page_char_count': 1156,
  'page_word_count': 209,
  'page_setence_count_raw': 9,
  'page_token_count': 289.0,
  'text': 'to any mold spores hanging in the air. Use plastic wrap to cover  foods that you want to remain moist, such as fresh fruits, vegetables,  and salads. After a meal, do not keep leftovers at room temperature  for more than two hours. They should be refrigerated as promptly  as possible. It is also helpful to date leftovers, so they can be used  within a safe time, which is generally three to five days when stored  in a refrigerator.  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user

In [22]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32,1.53
std,348.86,560.44,95.75,6.19,140.11,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0,1.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0,1.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into the text sample that was used in our model.

In [23]:
import re

#split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
  for sentence_chunk in item['sentence_chunks']:
    chunk_dict = {}
    chunk_dict['page_number'] = item['page_number']
    #join sentences into one paragraph
    joined_sentence_chunk = ''.join(sentence_chunk).replace("  ", " ").strip()
    joined_sentence_chunk = re.sub(r'\.([A-Z])', r'.  \1', joined_sentence_chunk) # ".A" => ". A" (will work for any capital letter)

    chunk_dict['sentence_chunk'] = joined_sentence_chunk
    #get some stats
    chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
    chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
    chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

    pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)


  0%|          | 0/1208 [00:00<?, ?it/s]

100%|██████████| 1208/1208 [00:00<00:00, 40337.87it/s]


1843

In [24]:
random.sample(pages_and_chunks,1)

[{'page_number': 310,
  'sentence_chunk': 'an unsaturated fatty acid, can result in different structures for the same fatty acid composition.  When the hydrogen atoms are bonded to the same side of the carbon chain, it is called a cis fatty acid. Because the hydrogen atoms are on the same side, the carbon chain has a bent structure.  Naturally occurring fatty acids usually have a cis configuration. In a trans fatty acid, the hydrogen atoms are attached on opposite sides of the carbon chain.  Unlike cis fatty acids, most trans fatty acids are not found naturally in foods, but are a result of a process called hydrogenation.  Hydrogenation is the process of adding hydrogen to the carbon double bonds, thus making the fatty acid saturated (or less unsaturated, in the case of partial hydrogenation). This is how vegetable oils are converted into semisolid fats for use in the manufacturing process. According to the ongoing Harvard Nurses’ Health Study, trans fatty acids have been associated wi

In [25]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)
#note some chunks have more than 384 chunks so might get cut

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,738.16,116.05,184.54
std,347.79,449.33,72.99,112.33
min,-41.0,12.0,3.0,3.0
25%,280.5,317.5,46.0,79.38
50%,586.0,749.0,117.0,187.25
75%,890.0,1125.5,178.0,281.38
max,1166.0,1838.0,304.0,459.5


Filter chunks of texts that are too short. These chunks may not contain much useful information

In [26]:
min_token_len = 30
for row in df[df['chunk_token_count']<=min_token_len].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')#     print(row)

# print(df.head())
# df.describe().round(2)



Chunk token count: 9.75 | Text: Older Adulthood: The Golden Years | 925
Chunk token count: 16.25 | Text: Complementary foods include baby meats, vegetables, Infancy | 837
Chunk token count: 3.5 | Text: 190 | Chloride
Chunk token count: 19.5 | Text: 2009). Dietary Glycemic Index: Digestion and Absorption of Carbohydrates | 247
Chunk token count: 3.25 | Text: 814 | Infancy


In [27]:
pages_and_chunks_over_min_len  = df[df['chunk_token_count']>min_token_len].to_dict(orient='records')
pages_and_chunks_over_min_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [28]:
random.sample(pages_and_chunks_over_min_len,k=1)

[{'page_number': 370,
  'sentence_chunk': 'Proteins are similar to carbohydrates and lipids in that they are polymers of simple repeating units; however, proteins are much more structurally complex.  In contrast to carbohydrates, which have identical repeating units, proteins are made up of amino acids that are different from one another.  Furthermore, a protein is organized into four different structural levels. Primary: The first level is the one-dimensional sequence of amino acids that are held together by peptide bonds.  Carbohydrates and lipids also are one-dimensional sequences of their respective monomers, which may be branched, coiled, fibrous, or globular, but their conformation is much more random and is not organized by their sequence of monomers. Secondary: The second level of protein structure is dependent on the chemical interactions between amino acids, which cause the protein to fold into a specific shape, such as a helix (like a coiled spring) or sheet. 370 | Defining 

### Embedding our text chunks

While humans understand text, machines understand numbers best.

An [embedding](https://vickiboykis.com/what_are_embeddings/index.html) is a broad concept.

But one of my favourite and simple definitions is "a useful numerical representation".

The most powerful thing about modern embeddings is that they are *learned* representations.

Meaning rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.

> **Note:** Most modern NLP models deal with "tokens" which can be considered as multiple different sizes and combinations of words and characters rather than always whole words or single characters. For example, the string `"hello world!"` gets mapped to the token values `{15339: b'hello', 1917: b' world', 0: b'!'}` using [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (or BPE via OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) library). Google has a tokenization library called [SentencePiece](https://github.com/google/sentencepiece).

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

Once our text samples are in embedding vectors, us humans will no longer be able to understand them.

However, we don't need to.

The embedding vectors are for our computers to understand.

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

Enough talking, how about we import a text embedding model and see what an embedding looks like.

To do so, we'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [68]:
from mlx_embedding_models.embedding import EmbeddingModel
model = EmbeddingModel.from_registry("bge-small")
texts = [
    "isn't it nice to be inside such a fancy computer",
    "the horse raced past the barn fell"
]
embs = model.encode(texts)
print(embs.shape)
# 2, 384

100%|██████████| 1/1 [00:00<00:00,  2.10it/s]

(2, 384)





In [69]:
from mlx_embedding_models.embedding import EmbeddingModel
embedding_model = EmbeddingModel.from_registry("bge-base")

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

100%|██████████| 1/1 [00:01<00:00,  1.13s/it]

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-5.51846577e-03 -6.47100732e-02 -2.27735355e-03  1.53941074e-02
  1.98685471e-02 -1.44866332e-02  3.05495854e-03  2.49807499e-02
 -6.16510073e-03 -7.83707656e-05 -1.02855237e-02  3.63104418e-02
 -6.07884936e-02 -3.88314296e-03 -5.82998283e-02  2.87660267e-02
 -1.36547014e-02 -8.43168143e-03  4.05378174e-03  3.25725749e-02
 -8.57108738e-04  4.44165841e-02  4.00496423e-02  5.43163624e-03
  2.26071905e-02 -2.40620561e-02  5.05977347e-02  3.35783549e-02
 -4.70999628e-02  4.14791554e-02  2.02941857e-02 -9.56162438e-03
 -2.14908570e-02 -3.82919163e-02  6.41154416e-04 -3.39001305e-02
 -7.60734007e-02 -1.43217305e-02 -7.50351744e-03 -1.81966163e-02
 -2.46403296e-03  1.16081443e-02  1.34070190e-02  1.84941385e-02
 -3.62511240e-02  9.65335965e-03 -9.29942951e-02  3.76412459e-02
 -1.14869848e-02 -5.49627952e-02 -7.00905547e-02  3.90855642e-03
 -1.07010836e-02 -6.95637846e-03 -5.3967




In [32]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="mps") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981002e-02  3.03164218e-02 -2.01218333e-02  6.86483607e-02
 -2.55255681e-02 -8.47688783e-03 -2.07074991e-04 -6.32376373e-02
  2.81606670e-02 -3.33353393e-02  3.02634798e-02  5.30720688e-02
 -5.03526963e-02  2.62287669e-02  3.33313756e-02 -4.51578386e-02
  3.63044068e-02 -1.37108692e-03 -1.20170591e-02  1.14947157e-02
  5.04510514e-02  4.70857583e-02  2.11913139e-02  5.14607467e-02
 -2.03746296e-02 -3.58889587e-02 -6.67867833e-04 -2.94392928e-02
  4.95858751e-02 -1.05640218e-02 -1.52013293e-02 -1.31751422e-03
  4.48196568e-02  1.56023316e-02  8.60379942e-07 -1.21393392e-03
 -2.37978380e-02 -9.09428170e-04  7.34483870e-03 -2.53930432e-03
  5.23370281e-02 -4.68042754e-02  1.66215152e-02  4.71578725e-02
 -4.15599868e-02  9.01964260e-04  3.60279046e-02  3.42215002e-02
  9.68226567e-02  5.94828874e-02 -1.64984781e-02 -3.51249427e-02
  5.92513289e-03 -7.07964587e-04 -2.4103

In [37]:
emb = embedding_model.encode(['My name is your problem'])
emb

100%|██████████| 1/1 [00:00<00:00,  4.72it/s]


array([[-3.35747041e-02,  3.89149450e-02, -3.52746770e-02,
        -2.10953727e-02,  1.12882406e-02,  4.10134196e-02,
         8.47852323e-03,  5.36031276e-03, -5.20757353e-03,
        -2.52985582e-02, -6.08957261e-02,  3.88779268e-02,
        -5.53545915e-02,  5.72436452e-02, -4.02058624e-02,
         4.12053689e-02,  2.05153581e-02,  5.14449254e-02,
        -2.78987037e-03,  1.55184930e-02,  1.65427905e-02,
         5.20124435e-02,  3.59962210e-02,  3.40543389e-02,
        -2.30918769e-02, -9.96335968e-03, -3.69936153e-02,
         1.85862053e-02, -9.23186094e-02, -1.88498534e-02,
         4.92023826e-02, -2.74767596e-02,  1.13612283e-02,
        -3.05947028e-02,  2.41323821e-02, -1.39139714e-02,
         3.13020609e-02,  7.14839203e-03, -4.36471635e-03,
         1.18261911e-02, -5.89422463e-03, -1.05523206e-02,
        -8.56208056e-02, -3.97295430e-02, -1.79162845e-02,
        -2.73144487e-02, -3.46093401e-02, -3.17586027e-02,
        -1.88401863e-02, -6.59625083e-02, -4.34057824e-0

In [38]:
pages_and_chunks_over_min_len[0].keys()

dict_keys(['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count', 'chunk_token_count'])

In [40]:
%%time
# embedding_model.to('cpu')

for item in tqdm(pages_and_chunks_over_min_len):
    item['embedding'] = embedding_model.encode([item['sentence_chunk']])

100%|██████████| 1/1 [00:00<00:00, 28.33it/s]
100%|██████████| 1/1 [00:00<00:00, 72.34it/s]
100%|██████████| 1/1 [00:00<00:00, 56.82it/s]
100%|██████████| 1/1 [00:00<00:00, 63.96it/s]
100%|██████████| 1/1 [00:00<00:00, 64.96it/s]
100%|██████████| 1/1 [00:00<00:00, 61.92it/s]/s]
100%|██████████| 1/1 [00:00<00:00, 17.83it/s]
100%|██████████| 1/1 [00:00<00:00, 63.15it/s]
100%|██████████| 1/1 [00:00<00:00, 70.64it/s]
100%|██████████| 1/1 [00:00<00:00, 69.91it/s]
100%|██████████| 1/1 [00:00<00:00, 71.25it/s]t/s]
100%|██████████| 1/1 [00:00<00:00, 68.01it/s]
100%|██████████| 1/1 [00:00<00:00, 74.06it/s]
100%|██████████| 1/1 [00:00<00:00, 77.16it/s]
100%|██████████| 1/1 [00:00<00:00, 72.45it/s]
100%|██████████| 1/1 [00:00<00:00, 72.28it/s]
100%|██████████| 1/1 [00:00<00:00, 76.70it/s]t/s]
100%|██████████| 1/1 [00:00<00:00, 73.21it/s]
100%|██████████| 1/1 [00:00<00:00, 70.89it/s]
100%|██████████| 1/1 [00:00<00:00, 71.26it/s]
100%|██████████| 1/1 [00:00<00:00, 69.06it/s]
100%|██████████| 1/1 [0

CPU times: user 14.6 s, sys: 7.23 s, total: 21.9 s
Wall time: 30.1 s





In [41]:
text_chunks = [item['sentence_chunk'] for item in pages_and_chunks_over_min_len]
 

In [48]:
len(text_chunks)

1681

In [71]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=64, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings.shape

100%|██████████| 27/27 [00:11<00:00,  2.41it/s]

CPU times: user 1e+03 ms, sys: 4.36 s, total: 5.36 s
Wall time: 11.9 s





(1681, 768)

In [49]:
#save embeddings to file
import pandas as pd
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_len)
embeddings_df_save_path = 'text_chunks_and_embeddings_df.csv'
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)


In [50]:
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[[-1.29149691e-03 -1.95020176e-02 -1.70224458e...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[[ 3.40403058e-02 2.80001163e-02 -4.80141714e...
2,-37,Contents Preface University of Hawai‘i at Māno...,767,115,191.75,[[ 4.99454373e-03 1.81751139e-02 -8.38201120e...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,942,143,235.5,[[ 1.48556866e-02 3.00708711e-02 -9.68560353e...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[[ 2.62509473e-02 4.04834226e-02 -8.41170847e...


### Chunking and embedding questions

> **Which embedding model should I use?**

This depends on many factors. My best advice is to experiment, experiment, experiment! 

If you want the model to run locally, you'll have to make sure it's feasible to run on your own hardware. 

A good place to see how different models perform on a wide range of embedding tasks is the [Hugging Face Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

> **What other forms of text chunking/splitting are there?**

There are a fair few options here too. We've kept it simple with groups of sentences.

For more, [Pinecone has a great guide on different kinds of chunking](https://www.pinecone.io/learn/chunking-strategies/) including for different kinds of data such as markdown and LaTeX.

Libraries such as [LangChain also have a good amount of in-built text splitting options](https://python.langchain.com/docs/modules/data_connection/document_transformers/).

> **What should I think about when creating my embeddings?**

Our model turns text inputs up to 384 tokens long in embedding vectors of size 768.

Generally, the larger the vector size, the more information that gets encoded into the embedding (however, this is not always the case, as smaller, better models can outperform larger ones).

Though with larger vector sizes comes larger storage and compute requirements.

Our model is also relatively small (420MB) in size compared to larger models that are available.

Larger models may result in better performance but will also require more compute.

So some things to think about:
* Size of input - If you need to embed longer sequences, choose a model with a larger input capacity.
* Size of embedding vector - Larger is generally a better representation but requires more compute/storage.
* Size of model - Larger models generally result in better embeddings but require more compute power/time to run.
* Open or closed - Open models allow you to run them on your own hardware whereas closed models can be easier to setup but require an API call to get embeddings.

> **Where should I store my embeddings?**

If you've got a relatively small dataset, for example, under 100,000 examples (this number is rough and only based on first hand experience), `np.array` or `torch.tensor` can work just fine as your dataset.

But if you've got a production system and want to work with 100,000+ embeddings, you may want to look into a [vector database]( https://en.wikipedia.org/wiki/Vector_database) (these have become very popular lately and there are many offerings).

### Document Ingestion and Embedding Creation Extensions

One major extension to the workflow above would to functionize it.

Or turn it into a script.

As in, take all the functionality we've created and package it into a single process (e.g. go from document -> embeddings file).

So you could input a document on one end and have embeddings come out the other end. The hardest part of this is knowing what kind of preprocessing your text may need before it's turned into embeddings. Cleaner text generally means better results.


## 2. RAG - Search and Answer

We discussed RAG briefly in the beginning but let's quickly recap.

RAG stands for Retrieval Augmented Generation.

Which is another way of saying "given a query, search for relevant resources and answer based on those resources".

Let's breakdown each step:
* **Retrieval** - Get relevant resources given a query. For example, if the query is "what are the macronutrients?" the ideal results will contain information about protein, carbohydrates and fats (and possibly alcohol) rather than information about which tractors are the best for farming (though that is also cool information).
* **Augmentation** - LLMs are capable of generating text given a prompt. However, this generated text is designed to *look* right. And it often has some correct information, however, they are prone to hallucination (generating a result that *looks* like legit text but is factually wrong). In augmentation, we pass relevant information into the prompt and get an LLM to use that relevant information as the basis of its generation.
* **Generation** - This is where the LLM will generate a response that has been flavoured/augmented with the retrieved resources. In turn, this not only gives us a potentially more correct answer, it also gives us resources to investigate more (since we know which resources went into the prompt).

The whole idea of RAG is to get an LLM to be more factually correct based on your own input as well as have a reference to where the generated output may have come from.

This is an incredibly helpful tool.

Let's say you had 1000s of customer support documents.

You could use RAG to generate direct answers to questions with links to relevant documentation.

Or you were an insurance company with large chains of claims emails.

You could use RAG to answer questions about the emails with sources.

One helpful analogy is to think of LLMs as calculators for words.

With good inputs, the LLM can sort them into helpful outputs.

How? 

It starts with better search.
### Similarity search

Similarity search or semantic search or vector search is the idea of searching on *vibe*.

If this sounds like woo, woo. It's not.

Perhaps searching via *meaning* is a better analogy.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".

And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.

> **Example:** Using similarity search on our textbook data with the query "macronutrients function" returns a paragraph that starts with: 
>
>*There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.*
> 
> as the first result. How cool!

If you've ever used Google, you know this kind of workflow.

But now we'd like to perform that across our own data.

Let's import our embeddings we created earlier (tk -link to embedding file) and prepare them for use by turning them into a tensor.

In [64]:
import mlx.core as mx
import numpy as np
import pandas as pd

In [101]:
text_chunks_and_embeddings_df = pd.read_csv('text_chunks_and_embeddings_df.csv')
#convert to numpy array 
text_chunks_and_embeddings_df['embedding'] = text_chunks_and_embeddings_df['embedding'].apply(lambda x: np.fromstring(x.strip('[]'), sep = ' ', dtype=float))
#convert dataframe to list of dicts
pages_and_chunks = text_chunks_and_embeddings_df.to_dict('records')
#get just embedding array as mx array
embeddings = mx.array(np.array(text_chunks_and_embeddings_df['embedding'].tolist())) #shape (1681, 768)

embeddings.shape

(1681, 768)

In [109]:
from mlx_embedding_models.embedding import EmbeddingModel
import mlx.nn as nn
model = EmbeddingModel.from_registry("bge-base")


Embedding model ready!

Time to perform a semantic search.

Let's say you were studying the macronutrients.

And wanted to search your textbook for "macronutrients functions".

Well, we can do so with the following steps:
1. Define a query string (e.g. `"macronutrients functions"`) - note: this could be anything, specific or not.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding (we'll get to what these are shortly) to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts. 

Easy!


In [161]:
query = ["what to eat when playing sport"]
print(f"query = {query}")
#embed query with same model as the passages/document
query_embedding = mx.array(model.encode(query))
#calculate dot product between query and passages/document or use cosine similarity if model not normalised
from time import perf_counter as timer
start = timer()
cosine_scores = nn.losses.cosine_similarity_loss(query_embedding, embeddings)
end = timer() - start
print(f"time taken for similarity search between query and {embeddings.shape[0]} embeddings = {end:.5f} seconds")
n_res= 5
top_results = mx.topk(cosine_scores, k=n_res)[::-1] #largest to smallest
top_indices = mx.argpartition(cosine_scores, -n_res)[-n_res:][::-1] #argpartition seems to already be sort from smallest to largest
most_relevant_chunk = pages_and_chunks[top_indices[0].item()]
most_relevant_chunk.keys()
print(f"most relevant page: {most_relevant_chunk['page_number']}\n{most_relevant_chunk['sentence_chunk']}")

query = ['what to eat when playing sport']


100%|██████████| 1/1 [00:00<00:00, 35.93it/s]

time taken for similarity search between query and 1681 embeddings = 0.00010 seconds
most relevant page: 406
amino acids at adequate levels.  Growing children and the elderly need to ensure they get enough protein in their diet to help build and maintain muscle strength.  Even if you’re a hardcore athlete, get your proteins from nutrient-dense foods as you need more than just protein to power up for an event.  Nuts are one nutrient-dense food with a whole lot of protein.  One ounce of pistachios, which is about fifty nuts, has the same amount of protein as an egg and contains a lot of vitamins, minerals, healthy polyunsaturated fats, and antioxidants.  Moreover, the FDA says that eating one ounce of nuts per day can lower your risk for heart disease.  Can you be a hardcore athlete and a vegetarian? The analysis of vegetarian diets by the Dietary Guidelines Advisory Committee (DGAC) did not find professional athletes were inadequate in any nutrients, but did state that people who obtain




In [185]:
#test with much larger embeddings
large_embeddings = mx.random.normal((1000000*embedding.shape[0], 768))
print(large_embeddings.shape)
start = timer()
cosine_scores = nn.losses.cosine_similarity_loss(query_embedding, large_embeddings)
end = timer() - start
print(cosine_scores.shape, f"{end}")

(768000000, 768)
(768000000,) 0.00015495799016207457


Wow. That's quick!

That means we can get pretty far by just storing our embeddings in `torch.tensor` for now.

However, for *much* larger datasets, we'd likely look at a dedicated vector database/indexing libraries such as [Faiss](https://github.com/facebookresearch/faiss).


In [187]:
import textwrap
def print_wrapped(text, width = 80):
    print(textwrap.fill(text, width))

In [190]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results.tolist(), top_indices.tolist()):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: '['what to eat when playing sport']'

Results:
Score: 0.7483
Text:
amino acids at adequate levels.  Growing children and the elderly need to ensure
they get enough protein in their diet to help build and maintain muscle
strength.  Even if you’re a hardcore athlete, get your proteins from nutrient-
dense foods as you need more than just protein to power up for an event.  Nuts
are one nutrient-dense food with a whole lot of protein.  One ounce of
pistachios, which is about fifty nuts, has the same amount of protein as an egg
and contains a lot of vitamins, minerals, healthy polyunsaturated fats, and
antioxidants.  Moreover, the FDA says that eating one ounce of nuts per day can
lower your risk for heart disease.  Can you be a hardcore athlete and a
vegetarian? The analysis of vegetarian diets by the Dietary Guidelines Advisory
Committee (DGAC) did not find professional athletes were inadequate in any
nutrients, but did state that people who obtain proteins solely from plants
shoul