# Create & Run a Local RAG Pipeline from Scratch

## What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM, so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "What are the macronutrients & what do they do?" -> retrieves passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information & augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first 2 steps & pass them to an LLM for generative outputs.

Where RAG came from - Facebook / Meta AI Paper: *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*
> This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text doesn't mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).

## What can RAG be used for?

* Customer Support Q&A Chat - Treat your existing customer support documents as a resource and when a customer asks a question, you could have a retrieval system, retrieve relevant documentation snippets & then have a LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation".

* Email Chain Analysis - Let's say you are a large insurance company & you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails & then use an LLM to process that information into structured data.

* Company Interval Documentation Chat

* Textbook Q&A - Let's say you are a nutrition student and you've got a 1200 pages textbook read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: Take your relevant documents to a query & process them with an LLM.

From this angle, consider LLM as a calculator for words.

## Why Local?

Fun. 

Privacy, Speed, Cost.

* Privacy - If you have private documentation, maybe you don't want to send that to an API. You want to setup an LLM and run it on your own hardware.
* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we don't have to wait for transfers of data.
* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you don't have to keep paying API fees.
* No Vendor Lock-in - If you run your own software/ hardware. If Large company shuts down tomorrow, you can still run your business.

## What Will Be Built?

Build NutriChat to "chat with a nutrition document".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embbed all of the chunks of text in the textbook, and turn them into numerical representations (embeddings) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All Locally!

1. Steps 1 - 3: Document Preprocessing & Embedding Creation.
2. Steps 4 - 6: Search & Answer.

## 1. Document / Text Preprocessing & Embedding Creation

Ingredients:
* PDF document of choice (note: this could be almost any kind of document, just that PDFs are focused for now).
* Embedding model of choice

Steps:
1. Import PDF Document.
2. Preprocess Text for Embedding (e.g. Split into Chunks of Sentences).
3. Embbed Text Chunks with Embedding Model.
4. Save Embeddings to File for Later (Embeddings will store on files for many years or until you lose your hard drive).

## Import PDF Document

In [1]:
import os
import requests

In [2]:
# path to document
pdf_path = 'human-nutrition-text.pdf'

# download PDF
if not os.path.exists(pdf_path):
    print(f'[INFO] File does not exist, downloading...')
    
    # url of the pdf
    url = 'https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf'
    
    # the local file name to save the downloaded file
    fname = pdf_path
    
    # GET request
    res = requests.get(url)
    
    # check if the request is successful
    if res.status_code == 200:
        # open the file & save it
        with open(fname, 'wb') as f:
            f.write(res.content)
        print(f'[INFO] The file has been downloaded & saved as {fname}.')
    else:
        print(f'[INFO] Failed to download the file. Status Code: {res.status_code}')
else:
    print(f'[INFO] File {pdf_path} exists.')

[INFO] File does not exist, downloading...
[INFO] The file has been downloaded & saved as human-nutrition-text.pdf.


PDF is now available, let's open it.

In [4]:
import fitz # from PyMuPDF
from tqdm.auto import tqdm 

def text_formatter(text: str) -> str:
    '''Performs minor formatting on text.'''
    cleaned_text = text.replace('\n', ' ').strip()
    
    return cleaned_text

def open_and_read_pdf(path: str) -> list[dict]:
    doc = fitz.open(path)
    pages_and_texts = []
    
    for page_no, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({'page_no': page_no - 41,
                                'page_char_cnt': len(text),
                                'page_word_cnt': len(text.split(' ')),
                                'page_sentence_cnt_raw': len(text.split('. ')),
                                'page_token_cnt': len(text) / 4, # 1 token ~ 4 chars
                                'text': text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_no': -41,
  'page_char_cnt': 29,
  'page_word_cnt': 4,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_no': -40,
  'page_char_cnt': 0,
  'page_word_cnt': 1,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 0.0,
  'text': ''}]

In [5]:
import random

random.sample(pages_and_texts, k=3)

[{'page_no': 404,
  'page_char_cnt': 116,
  'page_word_cnt': 16,
  'page_sentence_cnt_raw': 1,
  'page_token_cnt': 29.0,
  'text': 'view it online here:  http:/ /pressbooks.oer.hawaii.edu/ humannutrition2/?p=268  404  |  Diseases Involving Proteins'},
 {'page_no': 1090,
  'page_char_cnt': 536,
  'page_word_cnt': 98,
  'page_sentence_cnt_raw': 6,
  'page_token_cnt': 134.0,
  'text': 'Image by  BruceBlaus/  CC BY 4.0  When the vertebral bone tissue is weakened, it can cause the spine  to curve. The increase in spine curvature not only causes pain,  but also decreases a person’s height. Curvature of the upper spine  produces what is called Dowager’s hump, also known as kyphosis.  Severe upper-spine deformity can compress the chest cavity and  cause difficulty breathing. It may also cause abdominal pain and loss  of appetite because of the increased pressure on the abdomen.  1090  |  Nutrition, Health and Disease'},
 {'page_no': 166,
  'page_char_cnt': 1783,
  'page_word_cnt': 315,
  'page

In [6]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [8]:
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15
std,348.86,560.44,95.75,6.19,140.11
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,4.0,190.69
50%,562.5,1232.5,215.0,10.0,308.12
75%,864.25,1605.25,271.25,14.0,401.31
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about because:

1. Embedding models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example, an embedding model may have been trained to embbed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)

As for LLMs, they can't accept infinite tokens in their context window.

## Further Text Preprocessing

Splitting pages into sentences.

2 Ways to do this:

1. Done this by splitting on `'.'`.
2. Do this by NLP library, such as spaCy and nltk.

In [11]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline (turning texts into sentences)
nlp.add_pipe('sentencizer')

# Create document instance as an example
doc = nlp('This is a sentence. This is another sentence. I like elephants.')
assert len(list(doc.sents)) == 3

# Print sentences split
list(doc.sents)

[This is a sentence., This is another sentence., I like elephants.]

In [12]:
pages_and_texts[0]

{'page_no': -41,
 'page_char_cnt': 29,
 'page_word_cnt': 4,
 'page_sentence_cnt_raw': 1,
 'page_token_cnt': 7.25,
 'text': 'Human Nutrition: 2020 Edition'}

In [13]:
for item in tqdm(pages_and_texts):
    item['sentences'] = list(nlp(item['text']).sents)

    # Make sure all sentences are string (default type is spaCy data type)
    item['sentences'] = [str(sentence) for sentence in item['sentences']]
    
    # Count the sentences
    item['page_sentence_cnt_spacy'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [15]:
random.sample(pages_and_texts, k=1)

[{'page_no': 196,
  'page_char_cnt': 1598,
  'page_word_cnt': 279,
  'page_sentence_cnt_raw': 13,
  'page_token_cnt': 399.5,
  'text': 'Potassium also is involved in protein synthesis, energy metabolism,  and platelet function, and acts as a buffer in blood, playing a role in  acid-base balance.  Imbalances of Potassium  Insufficient potassium levels in the body (hypokalemia) can be  caused by a low dietary intake of potassium or by high sodium  intakes, but more commonly it results from medications that  increase water excretion, mainly diuretics. The signs and symptoms  of hypokalemia are related to the functions of potassium in nerve  cells and consequently skeletal and smooth-muscle contraction.  The signs and symptoms include muscle weakness and cramps,  respiratory distress, and constipation. Severe potassium depletion  can cause the heart to have abnormal contractions and can even  be fatal. High levels of potassium in the blood, or hyperkalemia,  also affects the heart. It is a

In [17]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_no,page_char_cnt,page_word_cnt,page_sentence_cnt_raw,page_token_cnt,page_sentence_cnt_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32
std,348.86,560.44,95.75,6.19,140.11,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of texts into smaller ones, often refer to as `text splitting` or `chunking`.

There is no 100% of correct way to do this - experiment!

To keep it simple, it will split into groups of 10 sentences.

There are frameworks such as `langchain` which can help with this, but we will use `python` for now.
- https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

Why do we do this:
1. So the texts are easier to filter (smaller group of texts can be easier to inspect than large passages of texts).
2. So the text chunks can fit into the embedding model of context. (eg. 384 tokens has a limit).
3. So the contexts passed into LLM can be more specific and focused.