# Create & Run a Local RAG Pipeline from Scratch

## What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM, so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "What are the macronutrients & what do they do?" -> retrieves passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information & augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first 2 steps & pass them to an LLM for generative outputs.

Where RAG came from - Facebook / Meta AI Paper: *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*
> This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text doesn't mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).

## What can RAG be used for?

* Customer Support Q&A Chat - Treat your existing customer support documents as a resource and when a customer asks a question, you could have a retrieval system, retrieve relevant documentation snippets & then have a LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation".

* Email Chain Analysis - Let's say you are a large insurance company & you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails & then use an LLM to process that information into structured data.

* Company Interval Documentation Chat

* Textbook Q&A - Let's say you are a nutrition student and you've got a 1200 pages textbook read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: Take your relevant documents to a query & process them with an LLM.

From this angle, consider LLM as a calculator for words.

## Why Local?

Fun. 

Privacy, Speed, Cost.

* Privacy - If you have private documentation, maybe you don't want to send that to an API. You want to setup an LLM and run it on your own hardware.
* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we don't have to wait for transfers of data.
* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you don't have to keep paying API fees.
* No Vendor Lock-in - If you run your own software/ hardware. If Large company shuts down tomorrow, you can still run your business.

## What Will Be Built?

Build NutriChat to "chat with a nutrition document".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embbed all of the chunks of text in the textbook, and turn them into numerical representations (embeddings) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All Locally!

1. Steps 1 - 3: Document Preprocessing & Embedding Creation.
2. Steps 4 - 6: Search & Answer.

## 1. Document / Text Preprocessing & Embedding Creation

Ingredients:
* PDF document of choice (note: this could be almost any kind of document, just that PDFs are focused for now).
* Embedding model of choice

Steps:
1. Import PDF Document.
2. Preprocess Text for Embedding (e.g. Split into Chunks of Sentences).
3. Embbed Text Chunks with Embedding Model.
4. Save Embeddings to File for Later (Embeddings will store on files for many years or until you lose your hard drive).

## Import PDF Document

In [1]:
import os
import requests

In [2]:
# path to document
pdf_path = 'human-nutrition-text.pdf'

# download PDF
if not os.path.exists(pdf_path):
    print(f'[INFO] File does not exist, downloading...')
    
    # url of the pdf
    url = 'https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf'
    
    # the local file name to save the downloaded file
    fname = pdf_path
    
    # GET request
    res = requests.get(url)
    
    # check if the request is successful
    if res.status_code == 200:
        # open the file & save it
        with open(fname, 'wb') as f:
            f.write(res.content)
        print(f'[INFO] The file has been downloaded & saved as {fname}.')
    else:
        print(f'[INFO] Failed to download the file. Status Code: {res.status_code}')
else:
    print(f'[INFO] File {pdf_path} exists.')

[INFO] File human-nutrition-text.pdf exists.


PDF is now available, let's open it.

In [3]:
import fitz # from PyMuPDF
from tqdm.auto import tqdm 

def text_formatter(text: str) -> str:
    '''Performs minor formatting on text.'''
    cleaned_text = text.replace('\n', ' ').strip()
    
    return cleaned_text

def open_and_read_pdf(path: str) -> list[dict]:
    doc = fitz.open(path)
    pages_and_texts = []
    
    for page_no, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({'page_no': page_no - 41,
                                'page_char_cnt': len(text),
                                'page_word_cnt': len(text.split(' ')),
                                'page_sentence_cnt_raw': len(text.split('. ')),
                                'page_token_cnt': len(text) / 4, # 1 token ~ 4 chars
                                'text': text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_no': -41,
  'page_char_cnt': 29,
  'page_word_cnt': 4,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_no': -40,
  'page_char_cnt': 0,
  'page_word_cnt': 1,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 0.0,
  'text': ''}]

In [4]:
import random

random.sample(pages_and_texts, k=3)

[{'page_no': 60,
  'page_char_cnt': 1013,
  'page_word_cnt': 168,
  'page_sentence_cnt_raw': 6,
  'page_token_cnt': 253.25,
  'text': 'all other organ systems in the human body. We will learn the  process of nutrient digestion and absorption, which further  reiterates the importance of developing a healthy diet to maintain  a healthier you. The evidence abounds that food can indeed be “thy  medicine.”  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user experience it is strongly  recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.  \xa0 An interactive or media element has 

In [5]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [6]:
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15
std,348.86,560.44,95.75,6.19,140.11
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,4.0,190.69
50%,562.5,1232.5,215.0,10.0,308.12
75%,864.25,1605.25,271.25,14.0,401.31
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about because:

1. Embedding models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example, an embedding model may have been trained to embbed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

As for LLMs, they can't accept infinite tokens in their context window.

## Further Text Preprocessing

Splitting pages into sentences.

2 Ways to do this:

1. Done this by splitting on `'.'`.
2. Do this by NLP library, such as spaCy and nltk.

In [7]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline (turning texts into sentences)
nlp.add_pipe('sentencizer')

# Create document instance as an example
doc = nlp('This is a sentence. This is another sentence. I like elephants.')
assert len(list(doc.sents)) == 3

# Print sentences split
list(doc.sents)

[This is a sentence., This is another sentence., I like elephants.]

In [8]:
pages_and_texts[0]

{'page_no': -41,
 'page_char_cnt': 29,
 'page_word_cnt': 4,
 'page_sentence_cnt_raw': 1,
 'page_token_cnt': 7.25,
 'text': 'Human Nutrition: 2020 Edition'}

In [9]:
for item in tqdm(pages_and_texts):
    item['sentences'] = list(nlp(item['text']).sents)

    # Make sure all sentences are string (default type is spaCy data type)
    item['sentences'] = [str(sentence) for sentence in item['sentences']]
    
    # Count the sentences
    item['page_sentence_cnt_spacy'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [10]:
random.sample(pages_and_texts, k=1)

[{'page_no': 1166,
  'page_char_cnt': 257,
  'page_word_cnt': 44,
  'page_sentence_cnt_raw': 3,
  'page_token_cnt': 64.25,
  'text': '23. Vitamin D reused “The Functions of Vitamin D” by Allison  Calabrese / Attribution – Sharealike  24. Vitamin K reused “Kale Lacinato Lacinato Kale” by BlackRiv\xa0/  Pixabay License; “Phylloquinone structure” by Mysid\xa0/ Public  Domain  1166  |  Attributions',
  'sentences': ['23.',
   'Vitamin D reused “The Functions of Vitamin D” by Allison  Calabrese / Attribution – Sharealike  24.',
   'Vitamin K reused “Kale Lacinato Lacinato Kale” by BlackRiv\xa0/  Pixabay License; “Phylloquinone structure” by Mysid\xa0/ Public  Domain  1166  |  Attributions'],
  'page_sentence_cnt_spacy': 3}]

In [11]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,page_sentence_cnt_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32
std,348.86,560.44,95.75,6.19,140.11,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of texts into smaller ones, often refer to as `text splitting` or `chunking`.

There is no 100% of correct way to do this - experiment!

To keep it simple, it will split into groups of 10 sentences.

There are frameworks such as `langchain` which can help with this, but we will use `python` for now.
- https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

Why do we do this:
1. So the texts are easier to filter (smaller group of texts can be easier to inspect than large passages of texts).
2. So the text chunks can fit into the embedding model of context. (eg. 384 tokens has a limit).
3. So the contexts passed into LLM can be more specific and focused.

In [15]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recurively into chunk size
# eg. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               split_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i : i + split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [16]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item['sentence_chunks'] = split_list(input_list=item['sentences'],
                                         split_size=num_sentence_chunk_size)
    item['num_chunks'] = len(item['sentence_chunks'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [18]:
random.sample(pages_and_texts, k=1)

[{'page_no': 400,
  'page_char_cnt': 2070,
  'page_word_cnt': 357,
  'page_sentence_cnt_raw': 15,
  'page_token_cnt': 517.5,
  'text': 'as those that derive more than 30 percent of calories from protein.  Many people follow high-protein diets because marketers tout  protein’s ability to stimulate weight loss. It is true that following  high-protein diets increases weight loss in some people. However  the number of individuals that remain on this type of diet is low  and many people who try the diet and stop regain the weight they  had lost. Additionally, there is a scientific hypothesis that there may  be health consequences of remaining on high-protein diets for the  long-term, but clinical trials are ongoing or scheduled to examine  this hypothesis further. As the high-protein diet trend arose so  did the intensely debated issue of whether there are any health  consequences of eating too much protein. Observational studies  conducted in the general population suggest diets high in an

In [19]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,page_sentence_cnt_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32,1.53
std,348.86,560.44,95.75,6.19,140.11,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0,1.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0,1.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We'd like to embbed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into text sample that was used in our model.

In [23]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_no'] = item['page_no']
        
        # Join the sentences together into paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = ''.join(sentence_chunk).replace('  ', ' ').strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # '.A' -> '. A'
        
        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        
        # get some stats on the chunks
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(' ')])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars
        
        pages_and_chunks.append(chunk_dict)
        
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [25]:
random.sample(pages_and_chunks, k=1)

[{'page_no': 434,
  'sentence_chunk': 'and the impairment and ability to perform certain activities, as in driving a car. As a general rule, the liver can metabolize one standard drink (defined as 12 ounces of beer, 5 ounces of wine, or 1 ½ ounces of hard liquor) per hour. Drinking more than this, or more quickly, will cause BAC to rise to potentially unsafe levels. Table 10.1 “Mental and Physical Effects of Different BAC Levels” summarizes the mental and physical effects associated with different BAC levels. \xa0 Table 7.1 Mental and Physical Effects of Different BAC Levels BAC Percent Typical Effects 0.02 Some loss of judgment, altered mood, relaxation, increased body warmth 0.05 Exaggerated behavior, impaired judgment, may have some loss of muscle control (focusing eyes), usually good feeling, lowered alertness, release of inhibition 0.08 Poor muscle coordination (balance, speech, vision, reaction time), difficulty detecting danger, and impaired judgment, self-control, reasoning, an

In [26]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_no,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.83,112.72,183.71
std,347.79,447.43,71.07,111.86
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.62
max,1166.0,1831.0,297.0,457.75


### Filter chunks of texts for short chunks

These chunks may not contain much useful information.

In [29]:
# Show random chunks with under 30 tokens in length
min_token_len = 30

for row in df[df['chunk_token_count'] <= min_token_len].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 27.75 | Text: In exchange, for the reabsorption of sodium and water, potassium is excreted. Regulation of Water Balance | 169
Chunk token count: 12.75 | Text: PART VI CHAPTER 6. PROTEIN Chapter 6. Protein | 357
Chunk token count: 15.25 | Text: Accessed November 30, 2017. Discovering Nutrition Facts | 737
Chunk token count: 17.75 | Text: Table 6.1 Essential and Nonessential Amino Acids Defining Protein | 365
Chunk token count: 16.25 | Text: Table 14.2  Micronutrient Levels during Puberty 886 | Adolescence


In [30]:
# Filter DataFrame for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df['chunk_token_count'] > min_token_len].to_dict(orient='records')
pages_and_chunks_over_min_token_len[:2]

[{'page_no': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_no': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [31]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_no': 1127,
  'sentence_chunk': 'Donini LM, Marsili D, Graziani MP, Imbriale M, Cannella C. (2004). Orthorexia nervosa: a preliminary study with a proposal for diagnosis and an attempt to measure the dimension of the phenomenon. Eating and Weight Disorders,\xa09(2), 151‐157. 9.\xa0Orthorexia. (2017, February 26). National Eating Disorders Association. https:/ /www.nationaleatingdisorders.org/learn/by- eating-disorder/other/orthorexia 10.\xa0Mathieu J. (2005). What is orthorexia?',
  'chunk_char_count': 441,
  'chunk_word_count': 53,
  'chunk_token_count': 110.25}]

### Embedding Text Chunks

Embeddings are a broad but powerful concept.

While humans understand text, machines understand numbers.

What we'd like to do:
- Turn our text chunks into numbers, specifically embeddings.

A useful numerical representation.

The best part about embeddings is that are a *leanred* representation.

eg. (in reality is very high in dimensions)
```
'a': 0
'the': 1
```