## Create and run a local RAG pipeline from scratch

### What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "what are the macronutrients and what do they do?"  -> retrieve passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information and augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first two steps and pass them to an LLM for generative outputs.

If you want to read where RAG came from, see the paper from Facebook AI: https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

> This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

### Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text does not mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does mean a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).


### What can RAG be used for?

* Customer support Q&A chat - Treat your existing support documents as a resource and when a customer asks a question, you could have a retrieval system , retrieve relevant documentation snippets and then have an LLM craft those snippets into and answer. Think of this as a "chatbot" for your documentation.

* Email chain analysis - Let's say you're a large insurance company and you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails and then use an LLM to process that information into structured data.

* Company internal documentation chat

* Textbook Q&A - Let's say you are a nutrition student and you have got a 1200 page textbook to read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: take your relevant documents to a query and process them with an LLM.

From this angle, you can consider an LLM as a calculator for words.


### Why Local ?

Fun.

Privacy, speed and cost.

* Privacy - If you have private documentation, may be you do not want to send that to an API. You want to setup an LLM and run it on your own hardware.

* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we do not have to wait for transfers of data.

* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you do not have to keep paying API fees.

* No vendor lockin - If you run your own software/hardware. If OpenAI/another large internet company shut down tomorrow, you can still run your business.

In [26]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


### What we are going to build

* https://github.com/mrdbourke/simple-local-rag
* https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

We are going to build NutriChat to "chat with a nutrition textbook".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).

2. Format the text of the PDF textbook ready for an embedding model.

3. Embed all of the chunks of text in the textbook and turn them into numerical representations which we can store for later.

4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.

5. Create a prompt that incorporates the retrieved pieces of text.

6. Generate an answer to a query based on the passages of the textbook with an LLM.

1. Steps 1-3: Document preprocessing and embedding creation.

2. Steps 4-6: Search and answer

### 1. Document/text processing and embedding creation

Ingredients:

* PDF document of choice (note: this could be almost any kind of document, I have just chosen to focus on PDFs for now).

* Embedding model of choice.

Steps:

1. Import PDF document.

2. Process text for embedding (e.g. split into chunks of sentences).

3. Embed text chunks with embedding model.

4. Save embeddings file for later (embeddings will store on file for many years or until you lose your hard drive).



In [27]:
#### Import PDF document
# !wget https://github.com/mrdbourke/simple-local-rag/blob/main/human-nutrition-text.pdf

In [28]:
import os
import requests

# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
  print(f"File does not exist, downloading...")

  # Enter the URL of the pdf
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request is successful
  if response.status_code == 200:
    # Open the file and save it
    with open(filename, "wb") as file:
      file.write(response.content)
    print(f"[INFO] the file has been downloaded and saved as {filename}")
  else:
    print(f"[INFO] failed to download the file. Status code: {response.status_code}")

else:
  print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


We have got a PDF, let's open it!

In [29]:
!pip install PyMuPDF



In [30]:
import fitz  # requires pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

In [31]:
def text_formatter(text: str) -> str:
  """
  Performs minor formatting on text.
  """
  cleaned_text = text.replace("\n", " ").strip()

  # Potentially more text formatting function can go here
  return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
  doc = fitz.open(pdf_path)
  pages_and_texts = []
  for page_number, page in tqdm(enumerate(doc)):
    text = page.get_text()
    text = text_formatter(text=text)
    pages_and_texts.append({"page_number": page_number - 41,
                            "page_char_count": len(text),
                            "page_word_count": len(text.split(" ")),
                            "page_sentence_count_raw": len(text.split(". ")),
                            "page_token_count": len(text) / 4, # 1 token ~ 4 characters
                            "text": text
                          })
  return pages_and_texts


pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

0it [00:00, ?it/s]

In [32]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [33]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 258,
  'page_char_count': 1648,
  'page_word_count': 290,
  'page_sentence_count_raw': 15,
  'page_token_count': 412.0,
  'text': 'inhibited. Thus, glucose additionally has a “fat-sparing” effect. This  is because an increase in blood glucose stimulates release of the  hormone insulin, which tells cells to use glucose (instead of lipids) to  make energy. Adequate glucose levels in the blood also prevent the  development of ketosis. Ketosis is a metabolic condition resulting  from an elevation of ketone bodies in the blood. Ketone bodies are  an alternative energy source that cells can use when glucose supply  is insufficient, such as during fasting. Ketone bodies are acidic and  high elevations in the blood can cause it to become too acidic. This  is rare in healthy adults, but can occur in alcoholics, people who  are malnourished, and in individuals who have Type 1 diabetes. The  minimum amount of carbohydrate in the diet required to inhibit  ketosis in adults is 50 g

In [34]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [35]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about, because:

1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens.

For example, an embedding model may have been trained to embed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://sbert.net/docs/cross_encoder/pretrained_models.html)

As for LLMs, they can't accept infinite tokens in their context window.

### Further text processing (splitting pages into sentences)

Two ways to do this:

1. We have done this by splitting on `". "`.

2. We can do this with a NLP library such as spaCy (https://spacy.io/usage) and nltk (https://www.nltk.org/).


In [36]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This another sentence. I like elephants.")
assert len(list(doc.sents)) == 3

# Print out our sentences split
list(doc.sents)

[This is a sentence., This another sentence., I like elephants.]

In [37]:
pages_and_texts[600]

{'page_number': 559,
 'page_char_count': 863,
 'page_word_count': 136,
 'page_sentence_count_raw': 8,
 'page_token_count': 215.75,
 'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Korsakoff syndrome can cause similar symptoms as beriberi such  as confusion, loss of coordination, vision changes, hallucinations,  and may progress to coma and death. This condition is specific  to alcoholics as diets high in alcohol can cause thiamin deficiency.  Other individuals at risk include individuals who also consume diets  typically low in micronutrients such as those with eating disorders,  elderly, and individuals who have gone through gastric bypass  surgery.5  Figure 9.10 The Role of Thiamin  Figure 9.11 Beriberi, Thiamin Deficiency  5.\xa0Fact Sheets for Health Professionals: Thiamin. National  Institute of Health, Office of Dietary Supplements.  \xa0https://ods.od.nih.gov/factsheets/Thiamin- HealthProfessional/. Updated Feburary 11, 2016.  Accessed October 22, 2017.  Water-Soluble Vitami

In [38]:
for item in tqdm(pages_and_texts):
  item["sentences"] = list(nlp(item["text"]).sents)

  # Make sure all sentences are strings (default type is a spaCy datatype)
  item["sentences"] = [str(sentence) for sentence in item["sentences"]]

  # Count the sentences
  item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [39]:
random.sample(pages_and_texts, k=1)

[{'page_number': 626,
  'page_char_count': 1416,
  'page_word_count': 247,
  'page_sentence_count_raw': 9,
  'page_token_count': 354.0,
  'text': 'Tools for Change  If you need to increase calcium intake, are a vegan, or  have a food allergy to dairy products, it is helpful to know  that there are some plant-based foods that are high in  calcium. Tofu (made with calcium sulfate), turnip greens,  mustard greens, and chinese cabbage are good sources. For  a list of non-dairy sources you can find the calcium content  for thousands of foods by visiting the USDA National  Nutrient Database (http://www.nal.usda.gov/fnic/ foodcomp/search/). When obtaining your calcium from a  vegan diet, it is important to know that some plant-based  foods significantly impair the absorption of calcium. These  include spinach, Swiss chard, rhubarb, beets, cashews, and  peanuts. With careful planning and good selections, you  can ensure that you are getting enough calcium in your diet  even if you do not drink

In [40]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32
std,348.86,560.38,95.76,6.19,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We will keep it simple and split it into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however stick with Python for now:
https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:

1. So our texts are easier to filter (smaller groups of text can be easier to inspect than large passages of text).

2. So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).

3. So our contexts passed to an LLM can be more specific and focused.

In [41]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recursively into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]

def split_list(input_list: list,
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i: i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [42]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
  item["sentence_chunks"] = split_list(input_list = item["sentences"],
                                       slice_size = num_sentence_chunk_size)
  item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [43]:
random.sample(pages_and_texts, k=1)

[{'page_number': 238,
  'page_char_count': 505,
  'page_word_count': 86,
  'page_sentence_count_raw': 2,
  'page_token_count': 126.25,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user experience it is strongly  238  |  Introduction',
  'sentences': ['Image by  Allison  Calabrese /  CC BY 4.0  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.',
   '\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (

In [44]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0,10.32,1.53
std,348.86,560.38,95.76,6.19,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.88,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.88,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We would like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into text sample that was used in our model.

In [48]:
import re

# Split each chunk into its own item
pages_and_chunks = []

for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into paragraph like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)  #  ".A" => ". A" (will work for any capital letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4  # 1 token = ~ 4 chunks

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [49]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 1056,
  'sentence_chunk': 'butter on your toast, making your own salad dressing using olive oil, vinegar or lemon juice, and herbs, cooking with olive oil exclusively, or simply adding a dose of it to your favorite meal.11 The Raw Food Diet The raw food diet is followed by those who avoid cooking as much as possible in order to take advantage of the full nutrient content of foods. The principle behind raw foodism is that plant foods in their natural state are the most wholesome for the body. The raw food diet is not a weight-loss plan, it is a lifestyle choice. People who practice raw foodism eat only uncooked and unprocessed foods, emphasizing whole fruits and vegetables. Staples of the raw food diet include whole grains, beans, dried fruits, seeds and nuts, seaweed, sprouts, and unprocessed produce. As a result, food preparation mostly involves peeling, chopping, blending, straining, and dehydrating fruits and vegetables. The positive aspects of this eating method in

In [50]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.61
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


#### Filter chunks of text for short chunks

These chunks may not contain much useful information.

In [51]:
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-41,Human Nutrition: 2020 Edition,29,4,7.25
1,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0
2,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5
3,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5
4,-36,Lifestyles and Nutrition University of Hawai‘i...,941,142,235.25


In [53]:
# Show random chunks with under 30 tokens in length
min_token_length = 30

for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 15.75 | Text: PART IV CHAPTER 4. CARBOHYDRATES Chapter 4. Carbohydrates | 227
Chunk token count: 27.75 | Text: https://jamanetwork.com/journals/jama/ fullarticle/195531. Accessed October 5, 2017. 538 | Fat-Soluble Vitamins
Chunk token count: 19.5 | Text: 2009). Dietary Glycemic Index: Digestion and Absorption of Carbohydrates | 247
Chunk token count: 16.0 | Text: Accessed January 20, 2018. The Effect of New Technologies | 1031
Chunk token count: 16.5 | Text: Updated March 12, 2015. Accessed December 5, 2017. 882 | Childhood


In [54]:
# Filter our Dataframe for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [57]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 474,
  'sentence_chunk': 'If this is not feasible, walk while you are at work. • Take the stairs when you come upon them or better yet, seek them out. • Walk your neighborhood and know your surroundings. This benefits both health and safety. • Watch less television. Community Level • Request that your college/workplace provides more access to healthy low-cost foods. • Support changes in school lunch programs. • Participate in cleaning up local green spaces and then enjoy them during your leisure time. 474 | Weight Management',
  'chunk_char_count': 504,
  'chunk_word_count': 85,
  'chunk_token_count': 126.0}]

### Embedding our text chunks