# Create & Run a Local RAG Pipeline from Scratch

## What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM, so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "What are the macronutrients & what do they do?" -> retrieves passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information & augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first 2 steps & pass them to an LLM for generative outputs.

Where RAG came from - Facebook / Meta AI Paper: *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*
> This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text doesn't mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).

## What can RAG be used for?

* Customer Support Q&A Chat - Treat your existing customer support documents as a resource and when a customer asks a question, you could have a retrieval system, retrieve relevant documentation snippets & then have a LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation".

* Email Chain Analysis - Let's say you are a large insurance company & you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails & then use an LLM to process that information into structured data.

* Company Interval Documentation Chat

* Textbook Q&A - Let's say you are a nutrition student and you've got a 1200 pages textbook read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: Take your relevant documents to a query & process them with an LLM.

From this angle, consider LLM as a calculator for words.

## Why Local?

Fun. 

Privacy, Speed, Cost.

* Privacy - If you have private documentation, maybe you don't want to send that to an API. You want to setup an LLM and run it on your own hardware.
* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we don't have to wait for transfers of data.
* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you don't have to keep paying API fees.
* No Vendor Lock-in - If you run your own software/ hardware. If Large company shuts down tomorrow, you can still run your business.

## What Will Be Built?

Build NutriChat to "chat with a nutrition document".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embbed all of the chunks of text in the textbook, and turn them into numerical representations (embeddings) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All Locally!

1. Steps 1 - 3: Document Preprocessing & Embedding Creation.
2. Steps 4 - 6: Search & Answer.

## 1. Document / Text Preprocessing & Embedding Creation

Ingredients:
* PDF document of choice (note: this could be almost any kind of document, just that PDFs are focused for now).
* Embedding model of choice

Steps:
1. Import PDF Document.
2. Preprocess Text for Embedding (e.g. Split into Chunks of Sentences).
3. Embbed Text Chunks with Embedding Model.
4. Save Embeddings to File for Later (Embeddings will store on files for many years or until you lose your hard drive).

## Import PDF Document

In [1]:
import os
import requests

In [5]:
# path to document
pdf_path = 'human-nutrition-text.pdf'

# download PDF
if not os.path.exists(pdf_path):
    print(f'[INFO] File does not exist, downloading...')
    
    # url of the pdf
    url = 'https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf'
    
    # the local file name to save the downloaded file
    fname = pdf_path
    
    # GET request
    res = requests.get(url)
    
    # check if the request is successful
    if res.status_code == 200:
        # open the file & save it
        with open(fname, 'wb') as f:
            f.write(res.content)
        print(f'[INFO] The file has been downloaded & saved as {fname}.')
    else:
        print(f'[INFO] Failed to download the file. Status Code: {res.status_code}')
else:
    print(f'[INFO] File {pdf_path} exists.')

[INFO] File does not exist, download...
[INFO] The file has been downloaded & saved as human-nutrition-text.pdf.


PDF is now available, let's open it.

In [None]:
import fitz # from PyMuPDF
from tqdm.auto import tqdm 

def text_formatter(text: str) -> str:
    '''Performs minor formatting on text.'''
    cleaned_text = text.replace('\n', ' ').strip()
    
    return cleaned_text

def open_and_read_pdf(path: str) -> list[dict]:
    doc = fitz.open(path)
    pages_and_texts = []
    
    for page_no, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({'page_no': page_no - 41,
                                'page_char_cnt': len(text),
                                'page_word_cnt': len(text.split(' ')),
                                'page_sentence_cnt_raw': len(text.split('. ')),
                                'page_token_cnt': len(text) / 4 # 1 token ~ 4 chars})