## Create and run a local RAG pipeline from scratch

### What is RAG?

RAG stands for Retrieval Augmented Generation.

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

* Retrieval - Find relevant information given a query, e.g. "what are the macronutrients and what do they do?"  -> retrieve passages of text related to the macronutrients from a nutrition textbook.

* Augmented - We want to take the relevant information and augment our input (prompt) to an LLM with that relevant information.

* Generation - Take the first two steps and pass them to an LLM for generative outputs.

If you want to read where RAG came from, see the paper from Facebook AI: https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

> This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it “hallucinate” less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

### Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs.

1. Prevent hallucinations - LLMs are incredibly good at generating good *looking* text, however, this text does not mean that it is factual. RAG can help LLMs generate information based on relevant passages that are factual.

2. Work with custom data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general. However, it also does mean a lot of their responses can be generic in nature. RAG helps to create specific responses based on specific documents (e.g. your own companies customer support documents).


### What can RAG be used for?

* Customer support Q&A chat - Treat your existing support documents as a resource and when a customer asks a question, you could have a retrieval system , retrieve relevant documentation snippets and then have an LLM craft those snippets into and answer. Think of this as a "chatbot" for your documentation.

* Email chain analysis - Let's say you're a large insurance company and you have chains and chains of emails of customer claims. You could use a RAG pipeline to find relevant information from those emails and then use an LLM to process that information into structured data.

* Company internal documentation chat

* Textbook Q&A - Let's say you are a nutrition student and you have got a 1200 page textbook to read, you could build a RAG pipeline to go through the textbook and find relevant passages to the questions you have.

Common theme here: take your relevant documents to a query and process them with an LLM.

From this angle, you can consider an LLM as a calculator for words.


### Why Local ?

Fun.

Privacy, speed and cost.

* Privacy - If you have private documentation, may be you do not want to send that to an API. You want to setup an LLM and run it on your own hardware.

* Speed - Whenever you use an API, you have to send some kind of data across the internet. This takes time. Running locally means we do not have to wait for transfers of data.

* Cost - If you own your hardware, the cost is paid. It may have a large cost to begin with. But overtime, you do not have to keep paying API fees.

* No vendor lockin - If you run your own software/hardware. If OpenAI/another large internet company shut down tomorrow, you can still run your business.

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


### What we are going to build

* https://github.com/mrdbourke/simple-local-rag
* https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

We are going to build NutriChat to "chat with a nutrition textbook".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).

2. Format the text of the PDF textbook ready for an embedding model.

3. Embed all of the chunks of text in the textbook and turn them into numerical representations which we can store for later.

4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.

5. Create a prompt that incorporates the retrieved pieces of text.

6. Generate an answer to a query based on the passages of the textbook with an LLM.

1. Steps 1-3: Document preprocessing and embedding creation.

2. Steps 4-6: Search and answer

### 1. Document/text processing and embedding creation

Ingredients:

* PDF document of choice (note: this could be almost any kind of document, I have just chosen to focus on PDFs for now).

* Embedding model of choice.

Steps:

1. Import PDF document.

2. Process text for embedding (e.g. split into chunks of sentences).

3. Embed text chunks with embedding model.

4. Save embeddings file for later (embeddings will store on file for many years or until you lose your hard drive).



In [32]:
#### Import PDF document
# !wget https://github.com/mrdbourke/simple-local-rag/blob/main/human-nutrition-text.pdf

In [35]:
import os
import requests

# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
  print(f"File does not exist, downloading...")

  # Enter the URL of the pdf
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request is successful
  if response.status_code == 200:
    # Open the file and save it
    with open(filename, "wb") as file:
      file.write(response.content)
    print(f"[INFO] the file has been downloaded and saved as {filename}")
  else:
    print(f"[INFO] failed to download the file. Status code: {response.status_code}")

else:
  print(f"File {pdf_path} exists.")

File does not exist, downloading...
[INFO] the file has been downloaded and saved as human-nutrition-text.pdf


We have got a PDF, let's open it!

In [None]:
# !pip install PyMuPDF

In [6]:
import fitz  # requires pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

In [36]:
def text_formatter(text: str) -> str:
  """
  Performs minor formatting on text.
  """
  cleaned_text = text.replace("\n", " ").strip()

  # Potentially more text formatting function can go here
  return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
  doc = fitz.open(pdf_path)
  pages_and_texts = []
  for page_number, page in tqdm(enumerate(doc)):
    text = page.get_text()
    text = text_formatter(text=text)
    pages_and_texts.append({"page_number": page_number - 41,
                            "page_char_count": len(text),
                            "page_word_count": len(text.split(" ")),
                            "page_sentence_count_raw": len(text.split(". ")),
                            "page_token_count": len(text) / 4, # 1 token ~ 4 characters
                            "text": text
                          })
  return pages_and_texts


pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)

0it [00:00, ?it/s]

In [38]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [39]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 755,
  'page_char_count': 1300,
  'page_word_count': 198,
  'page_sentence_count_raw': 17,
  'page_token_count': 325.0,
  'text': 'whole-grain foods, fish, poultry, and nuts are emphasized while red  meats, sweets, and sugar-containing beverages are mostly avoided.  Results from a follow-up study published in the December 2009  issue of the Journal of Human Hypertension suggest the low- sodium DASH diet reduces oxidative stress, which may have  contributed to the improved blood vessel function observed in salt- sensitive people (between 10 to 20 percent of the population)6.  Diets high in fruits and vegetables. An analysis of The Nurses’  Health Study and the Health Professionals’ Follow-up Study  reported that for every increased serving of fruits or vegetables  per day, especially green leafy vegetables and vitamin C-rich fruits,  there was a 4 percent lower risk for heart disease7.  6.\xa0Al-Solaiman Y, et al. (2008). Low-Sodium DASH Reduces  Oxidative Stress and Im

In [40]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [42]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,287.0
std,348.86,560.38,95.76,6.19,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,14.0,400.88
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about, because:

1. Embedding models do not deal with infinite tokens.
2. LLMs do not deal with infinite tokens.

For example, an embedding model may have been trained to embed sequences of 384 tokens into numerical space.