# Create and run a RAG model pipeline from scratch.

## What does RAG stand for?

Stands for Retrieval Augmented Generation.

Essentially takes info and passes it to LLM, then LLM generates output based on that information.

* Retrieval - Find info given a certain question, in this example with a PDF discussing nutrition, "What are micronutrients, and what do they do?". The LLM retrieves text related to micronutrients from the textbook.  
* Augmented - Take relevant info and augment are input (The prompt we are giving) to an LLM with that info.
* Generation - Takes first two steps, passes to an LLM for generative outputs. 


## What is the usage of Retrieval Augmented Generation?

RAG improves generation output of LLM.

1. Prevents hallucinations - LLMs are given information factually - they're less likely to hallucinate when they are given actual info as opposed to making up their own information.
2.  Allows LLM to work with custom data - Since LLMs are already trained with internet scaled data, they have a decent understanding of language in general but because of this their responses can be fairly generic. RAG essentially helps create specific responses based on input documents (i.e. your own companies customer support docs, etc.)




## How can RAG be used?

* Customer support chat - Have an existing LLM supported by documentation from the respective company. Retrieves documents already created on how to do certain things, have the LLM use that data when answering. Essentially chatbot for given documentation.
* Email chain analysis - If you're a company with a lost of emails of customer chains, could use a RAG by feeding all of these emails into an LLM, then using said LLM to process that info into more structured data for you to parse through. Maybe turn to JSON.
* Company internal documentation chat
* Textbook Q&A

Essentially: Take relevent documents to a query, and process with an LLM.

Could think of LLM as a calculator for words in this instance.


## Why run locally?

1. No API calls - Potentially faster speed since you're not calling some other LLM.
2. Privacy - Perhaps you're using internal documents you don't want to feed somewhere.
3. Cost - No pricing for API calls.

## Build goals: 

Build RAG pipeline that runs locally on my device. It will do the following:

1. Open up a document I pass it, whether it be a PDF, .MD file, etc.
2. Format the text for an embedding model.
3. Embed all the neccesary chunks of text and turn it into a numerical representation which can be stored.
4. Build a retrieval system (vector search?) to find relevant chunks of text based on the query.
5. Generate a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to the query based on the passage of the document with an LLM.

1. Steps 1-3: Document preprocessing and embedding creation.
2. Step 4-6: Search and answer.

### 1. Document/Text processing and embedding creation

Neccesary:
* PDF document of choice (Not neccesarily PDF - could be Markdown, .txt, etc.)
* Embedding model of choice.

Steps:
1. Import document.
2. Proces text for embedding (split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use. (Will store on file for many years, however long you need.) 

### Import PDF document


In [4]:
import os
import requests

# Grab PDF path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # Local filename to save the file we just downloaded.

    filename = pdf_path

    # Send a GET request to the URL. 
    response = requests.get(url)

    # See if request was successful 
    if response.status_code == 200:
        # Open file and save it.
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")

else:
    print(f"File {pdf_path} already exists.")

    

[INFO] File doesn't exist, downloading...
[INFO] The file has been downloaded and saved as human-nutrition-text.pdf


We now have the PDF! So we can open it.

In [33]:
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # Potentially more text formatting functions if you need them can go here.

    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,
                               "page_char_count": len(text), 
                               "page_word_count": len(text.split(" ")),
                               "page_sentence_count_raw": len(text.split(". ")),
                               "page_token_count": len(text) / 4, # 1 token = ~4 characters.
                               "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]
                               

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [34]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 1056,
  'page_char_count': 1545,
  'page_word_count': 269,
  'page_sentence_count_raw': 16,
  'page_token_count': 386.25,
  'text': 'butter on your toast, making your own salad dressing using  olive oil, vinegar or lemon juice, and herbs, cooking with  olive oil exclusively, or simply adding a dose of it to your  favorite meal.11  The Raw Food Diet  The raw food diet is followed by those who avoid cooking as much  as possible in order to take advantage of the full nutrient content  of foods. The principle behind raw foodism is that plant foods in  their natural state are the most wholesome for the body. The raw  food diet is not a weight-loss plan, it is a lifestyle choice. People who  practice raw foodism eat only uncooked and unprocessed foods,  emphasizing whole fruits and vegetables. Staples of the raw food  diet include whole grains, beans, dried fruits, seeds and nuts,  seaweed, sprouts, and unprocessed produce. As a result, food  preparation mostly involves peel

In [7]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition


In [8]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1.0,1.0,1.0,1.0,1.0
mean,-41.0,29.0,4.0,1.0,7.25
std,,,,,
min,-41.0,29.0,4.0,1.0,7.25
25%,-41.0,29.0,4.0,1.0,7.25
50%,-41.0,29.0,4.0,1.0,7.25
75%,-41.0,29.0,4.0,1.0,7.25
max,-41.0,29.0,4.0,1.0,7.25


### Further Text processing (Split pages into certain amount of sentences)

Two ways to do this:
1. Can do this by splitting on `". "`.
2. We can do this with a NLP library such as spaCy and nltk. 

In [11]:
from spacy.lang.en import English

# Create an instance of English.

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create a document instance. 
doc = nlp("This is a sentence. This is a second sentence. Even a third sentence.")
assert len(list(doc.sents)) == 3

# Print the sentences.
list(doc.sents)

[This is a sentence., This is a second sentence., Even a third sentence.]

In [24]:
pages_and_texts[0]

{'page_number': -41,
 'page_char_count': 29,
 'page_word_count': 4,
 'page_sentence_count_raw': 1,
 'page_token_count': 7.25,
 'text': 'Human Nutrition: 2020 Edition',
 'sentences': ['Human Nutrition: 2020 Edition'],
 'page_sentence_count_spacy': 1}

In [35]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all of our sentences are strings.
    # Default is spaCy datatype.)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences.
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [36]:
random.sample(pages_and_texts, k=1)

[{'page_number': 163,
  'page_char_count': 664,
  'page_word_count': 140,
  'page_sentence_count_raw': 3,
  'page_token_count': 166.0,
  'text': 'CO2 + H20 +  ATP); sources  of water loss:  Skin and  lungs  (insensible  water loss  0.9 L/day),  Urine 1.5 L/ day, Feces 0.1  L/day.  TOTAL  intake 2.2 L/ day +  Metabolic  Production  0.3 L/day –  Output  (0.9+1.5=0.1)  L/day = 0 ”  class=”wp-i mage-141  size-full”  width=”629″  height=”777″ > Daily Fluid  Loss and  Gain    Dietary Recommendations  The Food and Nutrition Board of the Institute of Medicine (IOM) has  set the Adequate Intake (AI) for water for adult males at 3.7 liters  (15.6 cups) and at 2.7 liters (11 cups) for adult females.1 These intakes  1. Institute of Medicine Panel on Dietary Reference Intakes  Regulation of Water Balance  |  163',
  'sentences': ['CO2 + H20 +  ATP); sources  of water loss:  Skin and  lungs  (insensible  water loss  0.9 L/day),  Urine 1.5 L/ day, Feces 0.1  L/day.',
   ' TOTAL  intake 2.2 L/ day +  

In [37]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,10.52,287.0,10.32
std,348.86,560.38,95.83,6.55,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0
50%,562.5,1231.5,216.0,10.0,307.88,10.0
75%,864.25,1603.5,272.0,15.0,400.88,15.0
max,1166.0,2308.0,430.0,39.0,577.0,28.0


### Time to chunk our sentences together.

Essentially want to split large pieces of text into smaller ones. It's referred to as chunking or text splitting.

There is no 'correct' way to do this.

To keep it simple, splitting into groups of 10 sentences. However, could try any other, like 5, or 7, etc.

There's frameworks ffor 