# Retrieval Augmented Generation (RAG)

**Note:** These notes and code are based off of the code provided in [this GitHub repo](https://github.com/dinocodesx/simple-local-rag/tree/master). I highly suggest you take a look at it and his other works.

## What is RAG?
* RAG is a method to improve the outputs of large language models
    * **Prevents hallucinations:** RAG provides factual information as inputs to an LLM. This prevents (or at least reduces) hallucinations
    * **Allows LLMs to work with custom data:** RAG provides a quick method to have LLMs produce an outout on data outside of its training scope. This is much faster than the alternative which is to continue the training of the LLM on the custom data.
* Overview: RAG combines the generative abilities of LLMs with relevant text from trusted sources. These can be used to make sure the information that are produced by the LLMs are correct or can be used to generate outputs using ideas or data the LLM was not trained on.

## Breakdown
1. **Retrieval**: An algorithm, based on embedding, is used to search a provided text to provide information within the text which is related to the query being passed to the LLM. In this example the provided text is a PDF copy of a textbook but in practice it can be any text retrieved from any source that is taken to be factual or true.
2. **Augmentation**: Create a custom prompt for an LLM containing both the user query and the related information retrieved from the provided text which is taken to be factual.
3. **Generation:** Use the LLM to generate an answer to the user query.

## How it Works
1. **Retrieve the known text**: Import the text into a format readable by Python. This can be from importing a PDF, scraping webpages, or any other method to import text into Python.
2. **Chunk the text:** The next step is to divide the text into small(ish) pieces called chunks. There is no correct way to do this but you want chunks which are large enough that they will contain enough relevant information to augment the LLM's knowledge, but not so large that they contain unrelated information (or are so long they slow down the LLM's processing of the prompt).
3. **Embedding:** Once the text is chunked, embed the chunks using an embedding method. This can be Word2Vec which we have discussed in class or a newer embedding method such as the one used in this notebook. This means that all of the chunks of text are now vectors, whose directions can tell us how related their contents are.
4. **Retrieval:** Embed the user query in the same embedding space as the text. Determine which chunks are most related to the query using a similarity metric (which vectors point in similar directions to query). Save the top k chunks which are the most related.
5. **Prompt Engineering:** Pass the user query and the chunks of related text to an LLM. Use a custom prompt to tell the LLM to answer the query using the related text as known truth.

**NOTE:** If is possible to run the below notebook on a normal laptop, but it runs much faster (and you have access to larger LLM models) with an Nvidia GPU. If you do not have a GPU on your laptop or desktop you could transfer this notebook to Google Colab (note the education account is free) or the Kaggle.

## Getting Started with Hugging Face

Hugging Face contains a collection of pre-trained LLMs of various sizes. We will be using one of these models later in the code. First, you need to make an account with [Hugging Face](https://huggingface.co/). In order to use the LLMs you must have an account an a token from that account associated with your computer. After you create an account run the below code cell.

In [1]:
# Install the huggingface_hub package to interact with Hugging Face from Python.
# Replace pip3 with pip as needed.
!pip3 install huggingface_hub

# Import the login function and call it to authenticate your Hugging Face account.
from huggingface_hub import login

# Call the login function to authenticate. This will prompt for your Hugging 
# Face token in the console below. There is a link to the webpage where you can
# retrieve your token. In theory you should be able to paste it directly, but if 
# that doesn't work, try typing it out manually.
login() 



Defaulting to user installation because normal site-packages is not writeable


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## RAG Code

In [2]:
###################
## INSTALLATIONS ##
###################
# Install required packages. These are likely new installations for most of you. If you
# have issues with any of the libraries, please let me know. You may also want to make
# a virtual environment for this installation process. If you do so, add numpy and
# torch to the install list.

# Change from pip3 to pip as needed. NOTE: Your Python version must be 3.10 or higher for
# the bitsandbytes library to work properly.
! pip3 install fitz frontend tools pymupdf spacy sentence_transformers tf_keras bitsandbytes accelerate

Defaulting to user installation because normal site-packages is not writeable


In [3]:
#############
## IMPORTS ##
#############
# Alphabetical ordered imports for all libraries used in this notebook.
import numpy as np
import os
import pandas as pd
# Imports PDFs into Python
import pymupdf 
import random
# Regular expressions (used for text cleaning)
import re
# Used to pull the pdf from the web
import requests
# Used to create sentence embeddings
from sentence_transformers import SentenceTransformer, util
# Used to split the text into chunks/sentences
from spacy.lang.en import English 
# For displaying long text outputs
import textwrap
# For timing code execution
from time import perf_counter as timer
# Transformer libraries for LLMs
import torch
# Progress bar for loops
from tqdm.auto import tqdm
# Hugging Face transformers 
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


### Retrieval

First, we need to be able to retrieve relevant text from a document in order to augment the results from the LLM. Below we will use the concepts of embedding to determine which portions of text are the most relevant to our question on the text.

In [4]:
###############################
## DOWNLOAD THE PDF DOCUMENT ##
###############################

# We will be using the human nutrition textbook provided in the original GitHub
# as our document for this example. If you have your own PDF document you would 
# like to use, feel free to change this. You will get better results if the document
# is longer and more complex. You can also use multiple documents if you like.

# If you want to change the document you will also need to change the below code 
# depending on if your document is already downloaded or not.

# Get PDF document
pdf_path = "human-nutrition-text.pdf"

# Download PDF if it doesn't already exist
if not os.path.exists(pdf_path):
  print("File doesn't exist, downloading...")

  # The URL of the PDF you want to download
  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request was successful
  if response.status_code == 200:
      # Open a file in binary write mode and save the content to it
      with open(filename, "wb") as file:
          file.write(response.content)
      print(f"The file has been downloaded and saved as {filename}")
  else:
      print(f"Failed to download the file. Status code: {response.status_code}")
else:
  print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [5]:
################################
## IMPORT AND FORMAT PDF TEXT ##
################################

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    # note: this might be different for each doc (best to experiment)
    cleaned_text = text.replace("\n", " ").strip() 

    # Other potential text formatting functions can go here. If you change the document
    # being used, you may need to modify this function.
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = pymupdf.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

# Read the PDF and get the pages and texts, show the first two entries
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [6]:
##########################
## DISPLAY RANDOM PAGES ##
##########################

# Change the k value to display more or fewer random pages
random.sample(pages_and_texts, k=3)

[{'page_number': 566,
  'page_char_count': 114,
  'page_word_count': 24,
  'page_sentence_count_raw': 1,
  'page_token_count': 28.5,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Figure 9.13 Niacin Deficiency, Pellagra  566  |  Water-Soluble Vitamins'},
 {'page_number': 183,
  'page_char_count': 165,
  'page_word_count': 37,
  'page_sentence_count_raw': 2,
  'page_token_count': 41.25,
  'text': 'Sodium  levels in  milligrams is  a required  listing on a  Nutrition  Facts label.  Sodium on the Nutrition Facts Panel  Figure 3.10 Nutrition Label  Sodium  |  183'},
 {'page_number': 215,
  'page_char_count': 195,
  'page_word_count': 33,
  'page_sentence_count_raw': 2,
  'page_token_count': 48.75,
  'text': 'An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=162  \xa0 Water Concerns  |  215'}]

In [7]:
######################################
## STATISTICAL SUMMARY OF PDF PAGES ##
#######################################

# Convert the list of dictionaries to a DataFrame and display summary statistics
# This helps us understand the distribution of text lengths across the pages.

# This is important because we will be embedding the text later. Embedding models
# have maximum input lengths, so we need to understand how long the text chunks are.
# The embedding model we will be using later has a maximum input length of 384 tokens.
# The text is currently chunked by the page, but it seems that the average number
# of tokens per page is 287, but the 75% percentile is 400 tokens, which is above 
# the input limit. This means we will need to further chunk the text later. Note that 
# 1 token is approximately 4 characters, or 0.75 words on average.

pd.DataFrame(pages_and_texts).describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.004139,198.299669,9.972682,287.001035
std,348.86387,560.382275,95.759336,6.187226,140.095569
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.5
50%,562.5,1231.5,214.5,10.0,307.875
75%,864.25,1603.5,271.0,14.0,400.875
max,1166.0,2308.0,429.0,32.0,577.0


In [8]:
###################################
## SPLIT THE TEXT INTO SENTENCES ##
###################################
# We will use the spaCy library to split the text into sentences. Later we will group
# the sentences into chunks that fit within the embedding model's input limits.
nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

# EXAMPLE
# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
# Access the sentences of the document
print(list(doc.sents))

# For every page, split the text into sentences and store them back in the 
# pages_and_texts list
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

# Display a random page to see the sentences.
random.sample(pages_and_texts, k=1)    

[This is a sentence., This another sentence.]


  0%|          | 0/1208 [00:00<?, ?it/s]

[{'page_number': 819,
  'page_char_count': 746,
  'page_word_count': 134,
  'page_sentence_count_raw': 8,
  'page_token_count': 186.5,
  'text': 'breastfed infants, 6 months of age is a good time to introduce  sources of highly bioavailable iron and zinc such as baby meats.  Iron-fortified cereals and beans can boost the iron intake as well.  Fluids  Infants have a high need for fluids, 1.5 milliliters per kilocalorie  consumed compared to 1.0 milliliters per kilocalorie consumed for  adults. This is because children have larger body surface area per  unit of body weight and a higher metabolic rate. Therefore, they are  at greater risk of dehydration. However, parents or other caregivers  can meet an infant’s fluid needs with breast milk or formula. As  solids are introduced, parents must make sure that young children  continue to drink fluids throughout the day.  Infancy  |  819',
  'sentences': ['breastfed infants, 6 months of age is a good time to introduce  sources of highly bioava

In [9]:
# Inspect an example


In [10]:
# Check updated statistics after sentence splitting. There is a new column
# for the number of sentences based on spaCy's sentencizer.
pd.DataFrame(pages_and_texts).describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.004139,198.299669,9.972682,287.001035,10.319536
std,348.86387,560.382275,95.759336,6.187226,140.095569,6.300843
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0
50%,562.5,1231.5,214.5,10.0,307.875,10.0
75%,864.25,1603.5,271.0,14.0,400.875,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


In [11]:
##############
## CHUNKING ##
##############
# Define split size to turn groups of sentences into chunks. This number may need
# to be adjusted based on the document and the embedding model being used. Try different
# values and see how it affects the chunk sizes and the results.
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [12]:
# Generate new statistics after chunking, now with chunk counts
pd.DataFrame(pages_and_texts).describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.004139,198.299669,9.972682,287.001035,10.319536,1.525662
std,348.86387,560.382275,95.759336,6.187226,140.095569,6.300843,0.644397
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.5,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.875,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.875,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


In [13]:
##########################
## SPLITTING THE CHUNKS ##
##########################
# Split each chunk into its own item. The data is currently separated by pages, but that is not
# needed now that we have the chunks.
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [14]:
# Generate a final set of statistics about the chunks
pd.DataFrame(pages_and_chunks).describe()

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.381443,734.442756,112.333152,183.610689
std,347.78867,447.541546,71.220313,111.885387
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.75
50%,586.0,746.0,114.0,186.5
75%,890.0,1118.5,173.0,279.625
max,1166.0,1831.0,297.0,457.75


In [15]:
###############
## EMBEDDING ##
###############
# Create the embedding model using Sentence Transformers. This is similar to the 
# embedding layers used in deep learning models. The embedding model here is 
# all-mpnet-base-v2, which is a general-purpose embedding model that works well
# for many tasks. Word2Vec is another popular embedding model, but it is
# somewhat outdated compared to more recent models like those in Sentence Transformers.

# all-mpnet-base-v2 has a maximum input length of 384 tokens and produces embedded vectors
# which have a length of 768, no matter the length of the input text (as long as it is within the limit).

# If you have a GPU available, you can change device="cpu" to device="cuda" for faster processing.
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu") 

## EXAMPLES
# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07983255e-02  3.03164460e-02 -2.01218147e-02  6.86484799e-02
 -2.55256239e-02 -8.47688317e-03 -2.07214689e-04 -6.32378086e-02
  2.81606764e-02 -3.33354436e-02  3.02633625e-02  5.30721843e-02
 -5.03527448e-02  2.62288898e-02  3.33314165e-02 -4.51577418e-02
  3.63045111e-02 -1.37122045e-03 -1.20170712e-02  1.14947269e-02
  5.04510924e-02  4.70856726e-02  2.11913958e-02  5.14606722e-02
 -2.03747228e-02 -3.58889922e-02 -6.67759916e-04 -2.94394121e-02
  4.95859236e-02 -1.05640236e-02 -1.52013823e-02 -1.31756917e-03
  4.48197611e-02  1.56023633e-02  8.60379600e-07 -1.21388992e-03
 -2.37977933e-02 -9.09364375e-04  7.34490482e-03 -2.53936765e-03
  5.23370318e-02 -4.68043648e-02  1.66215319e-02  4.71579246e-02
 -4.15599458e-02  9.01898893e-04  3.60277742e-02  3.42214964e-02
  9.68227163e-02  5.94829470e-02 -1.64984688e-02 -3.51249278e-02
  5.92519110e-03 -7.07932282e-04 -2.4103

In [16]:
# Send the model to the CPU. If you have a GPU, you can use "cuda" instead.
# This will speed up the embedding process significantly.
embedding_model.to("cpu") 

# Create embeddings one by one
for item in tqdm(pages_and_chunks):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks]    

# This takes 1.5 minutes on my Mac

  0%|          | 0/1843 [00:00<?, ?it/s]

In [17]:
# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

print(text_chunk_embeddings.shape)

device = "cuda" if torch.cuda.is_available() else "cpu"
# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunk_embeddings.tolist()), dtype=torch.float32).to(device)
print(embeddings.shape)

## This takes about two minutes on my Mac

torch.Size([1843, 768])
torch.Size([1843, 768])


In [18]:
###############################
## TEST THE RETRIEVAL SYSTEM ##
################################
# Define helper function to print wrapped text. Just makes long outputs easier to read.
# This uses the textwrap library to wrap long text outputs.
def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

# Define the query, i.e., what we want to search for in the text chunks
# Note: This could be anything. But since we're working with a nutrition textbook, 
# we'll stick with nutrition-based queries.
query = "daily value iron"
print("Query:", query)

# Embed the query to the same numerical space as the text examples
# Note: It's important to embed your query with the same model you embedded your examples with.
# This is how we will determine what chunks are most similar to the query.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# Get similarity scores with the dot product. A dot product is a common way to measure similarity
# as it determines how closely aligned two vectors are in space. The higher the dot product, the
# closer the vectors are. Two overlapping vectors will have a dot product of 1 (i.e. they point in the 
# same direction). 


# Also measure the time taken to compute the scores.
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print("Time take to get scores on", len(embeddings), "embeddings:", end_time-start_time, "seconds.")

# 4. Get the top-k results (we'll keep this to 5)
# You can change k to get more or fewer results.
top_results_dot_product = torch.topk(dot_scores, k=5)
print(top_results_dot_product)

# Print the results (in text not tokens) of the top-k most similar chunks
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print("Score:", score)
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    print()
    # Print the page number too so we can reference the textbook further (and check the results)
    print("Page number:", pages_and_chunks[idx]['page_number'])
    print("\n\n\n")

Query: daily value iron
Time take to get scores on 1843 embeddings: 0.0015535410020675045 seconds.
torch.return_types.topk(
values=tensor([0.4900, 0.4880, 0.4782, 0.4698, 0.4340]),
indices=tensor([1027, 1509, 1025, 1508, 1031]))
Results:
Score: tensor(0.4900)
Text:
Centers for Disease Control and Prevention.http://www.cdc.gov/nutrition/ Iron |
661

Page number: 661




Score: tensor(0.4880)
Text:
Potential Iron Loss in Endurance Athletes” for the potential amounts of iron
loss each day in male and female athletes. An increased recommendation for both
genders are shown below. These recommendations are based on the assumption that
iron has a 10%  absorption efficiency. As noted above, women athletes have a
greater iron loss due to menstruation and therefore must increase their dietary
needs more than male athletes. Table 16.3 The Potential Iron Loss in Endurance
Athletes   Approximate Daily Iron Losses in Endurance Athletes (mg/day)and
Increased Dietary Need Male Female Sedentary 1 1.5 A

### Generation

In [19]:
######################
## LLM MODEL SET UP ##
######################

# If you do not have GPU keep this as true, if you do have GPU you can set to false.
# This has to do with how the model runs on the reduced resource settings of a CPU.
use_quantization_config = True

# Below are two model options. The 270M parameter model is smaller and faster, but the
# 1B parameter model is more powerful. If you have a GPU you should be able to run a larger
# model, just be careful about VRAM usage. The 1B parameter model can use up to 8GB of VRAM
# during inference, so make sure you have enough available. If you run into issues, try
# the smaller model. If you do not have a GPU, stick with the smaller model (1 billion parameters
# or less).

# The below models are both from Google and are open-source. You can find them on Hugging Face:
# https://huggingface.co/google/functiongemma-270m-it
# https://huggingface.co/google/gemma-3-1b-it
# If you have never used Hugging Face before, you may need to create a free account
# and accept the model license before you can download them. Specifically, use one of the
# above links to access the model page, then accept the license terms from Google. You also
# need to be logged in via the huggingface_hub library (see the start of this notebook).
# Feel free to experiment with other models as well, but make sure they fit within your
# hardware limits.

#model_id = "google/functiongemma-270m-it"
model_id = "google/gemma-3-1b-it"

In [20]:
# Create quantization config for smaller model loading (optional)
# For models that require 4-bit quantization (use this if you have low GPU memory available)
# You do not need to use this if you have a more powerful GPU with more VRAM.
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Scaled Dot-Product Attention (other options maybe avaliable)
# Flash is avaliable for the Gemma models with an Nvidia GPU
attn_implementation = "sdpa"
print("[INFO] Using attention implementation:", attn_implementation)

print("[INFO] Using model_id:", model_id)

# Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory (CHANGE THIS IF NEEDED)
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

# This takes approximately 5 minutes on my Mac for the 270k parameter model for the first time.
# After the first time, loading is much faster since the model is cached locally.

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-3-1b-it


`torch_dtype` is deprecated! Use `dtype` instead!


In [21]:
#########################################
## TEST THE MODEL WITHOUT AUGMENTATION ##
##########################################
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print("Input text:\n", input_text)

# Create prompt template for instruction-tuned model
# This should be the same for most models chosen form Hugging Face that are instruction-tuned
# but check the model card to be sure.
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template to format the prompt (needs to be tokenized)
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print("\nPrompt (formatted):\n", prompt)

Input text:
 What are the macronutrients, and what roles do they play in the human body?

Prompt (formatted):
 <bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model



In [None]:
# Send model to bfloat16 for faster inference (optional, depending on model and hardware)
# I got numerical errors for the probability calculations without this.
llm_model.bfloat16()

# Tokenize the input text (turn it into numbers) and send it to CPU/GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cpu")
print("Model input (tokenized):\n", input_ids, "\n")

# Generate outputs passed on the tokenized input
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create
                                                 # This can be changed as needed.
# Print the raw output from the LLM, which is in token form
print("Model output (tokens):\n", outputs[0], "\n")

# Takes approximately 6 minutes on my Mac for the 1 billion parameter model.

Model input (tokenized):
 {'input_ids': tensor([[     2,      2,    105,   2364,    107,   3689,    659,    506, 216955,
         151268, 236764,    532,   1144,  13616,    776,    901,   1441,    528,
            506,   3246,   2742, 236881,    106,    107,    105,   4368,    107]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]])} 

Model output (tokens):
 tensor([     2,      2,    105,   2364,    107,   3689,    659,    506, 216955,
        151268, 236764,    532,   1144,  13616,    776,    901,   1441,    528,
           506,   3246,   2742, 236881,    106,    107,    105,   4368,    107,
         19058, 236764,   1531, 236789, 236751,   2541,   1679,    506,   3853,
           529, 216955, 151268,    528,    506,   3246,   2742, 236888,    108,
          1018,  19253,   1863, 151268,    753,    669,  16087,  82801,   1018,
           108,  19253,   1863, 151268,    659,    506,  28935,    822,   2742,
        

In [23]:
# Decode the output tokens to text (so we can read it)
# Note that most LLMs produce output using markdown formatting, so the output
# may contain markdown syntax (e.g., **bold**, _italic_, etc.). There are also
# special characters in brackets that tell the model to do certain things. The
# user prompts is also here.
outputs_decoded = tokenizer.decode(outputs[0])
print("Model output (decoded):\n", outputs_decoded, "\n")

Model output (decoded):
 <bos><bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model
Okay, let's break down the role of macronutrients in the human body!

**Macronutrients - The Building Blocks**

Macronutrients are the nutrients your body needs in large amounts. They are the primary food groups that provide energy and support all the functions of your body.

Here's a breakdown of each:

**1. Carbohydrates**

* **What are they?**  Sugars (like glucose, fructose, etc.) and starches.
* **Role:**  Provide energy. They're broken down into glucose, which fuels your cells for daily activities.
* **Types:**
    * **Simple Carbohydrates:** (like table sugar - glucose) Provide quick energy but offer little sustained energy.
    * **Complex Carbohydrates:** (like whole grains, vegetables) Provide more stable energy and fiber.
* **Important:**  Essential for brain function and blood sugar control.


**2. Proteins**

*

### Augmentation

In [24]:
###########################################
## TEST THE MODEL WITH AUGMENTED INPUTS  ##
###########################################
# Define a list of nutrition-style questions to test the retrieval-augmented generation (RAG

# Nutrition-style questions generated with GPT4
gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

In [25]:
# Create a function to do the retrieval and print the results
# Note that you can change n_resources_to_return to get more or fewer results.
# All returned results will be used to augment the prompt for the LLM.
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,
                                   convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print("[INFO] Time taken to get scores on", len(embeddings), "embeddings:", end_time-start_time, "seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

# Function to format and print the top results in a nice manner
def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """

    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    print("Query:", query, "\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print("Score:", score, "\n")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print("Page number:", pages_and_chunks[index]["page_number"])
        print("\n")

# Define a prompt formatter function to create augmented prompts based on the retrieved information
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.

    # Change this as you like to get different answer styles. Look up examples of prompt engineering for more ideas.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

In [26]:
# Choose a random query from the list
query = random.choice(query_list)
print("Query:", query)

# Get relevant text for the query using the default values, so five
# pieces of context will be returned.
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)
print("\n\n\n\n")

# Tokenize the prompt and send to CPU/GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cpu")

# Generate an output of tokens
outputs = llm_model.generate(**input_ids,
                             temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                             do_sample=True, # whether or not to use sampling, see https://huyenchip.com/2024/01/16/sampling.html for more
                             max_new_tokens=256) # how many new tokens to generate from prompt

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print("RAG answer:\n", output_text.replace(prompt, ''))

Query: What is the RDI for protein per day?
[INFO] Time taken to get scores on 1843 embeddings: 0.0010961250009131618 seconds.
<bos><start_of_turn>user
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood 

## Additional Information
* [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Original Paper)](https://arxiv.org/abs/2005.11401)
* [Simple Local RAG](https://github.com/dinocodesx/simple-local-rag/tree/master)
* [Building a RAG Pipeline from Scratch with PyTorch and Transformers](https://python.plainenglish.io/building-a-rag-pipeline-from-scratch-with-pytorch-and-transformers-b52e5504cde2)
* [Local Retrieval Augmented Generation (RAG) from Scratch (step by step tutorial) (Video)](https://www.youtube.com/watch?v=qN_2fnOPY-M)