## **Build the RAG**
We have already created a vector store that holds the locally created embeddings, now we need to have Q & A pipeline with augmented-prompts by our data files and LLM that could be hosted locally (on my GPU) or in the cloud using an API (openAI).  

#### **Retrieval**

In [2]:
# load the embeddings from our local vector store
import pandas as pd
import torch
import numpy as np
from sentence_transformers import util, SentenceTransformer

embeddings_df_save_path = "vector_store/embeddings.csv"
embeddings_loaded_df = pd.read_csv(embeddings_df_save_path)
embeddings_loaded_df.head()

Unnamed: 0,file_path,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,data\\attention is all you need.pdf,0,"Provided proper attribution is provided, Googl...",1165,154,291.25,[ 2.07077805e-02 2.70413030e-02 -1.68691296e-...
1,data\\attention is all you need.pdf,0,Our model achieves 28.4 BLEU on the WMT 2014 E...,620,95,155.0,[ 1.10328430e-03 5.08999750e-02 3.29319574e-...
2,data\\attention is all you need.pdf,0,Jakob proposed replacing RNNs with self-attent...,658,90,164.5,[ 1.54208224e-02 1.14086864e-03 -6.22348813e-...
3,data\\attention is all you need.pdf,0,Lukasz and Aidan spent countless long days des...,392,49,98.0,[ 2.14229785e-02 5.34767583e-02 -1.31562511e-...
4,data\\attention is all you need.pdf,1,"1 Introduction Recurrent neural networks, long...",1115,158,278.75,[-2.53893668e-03 4.21451516e-02 -5.52566350e-...


Do some conversions on the embeddings dataframe: 

In [3]:
# set the device
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Convert embedding column back to np.array if they were string
if  isinstance ( embeddings_loaded_df["embedding"][0], str):
    embeddings_loaded_df["embedding"] = embeddings_loaded_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts (useful later)
embeddings_loaded_dict = embeddings_loaded_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device
embeddings = torch.tensor(np.array(embeddings_loaded_df["embedding"].tolist()), dtype=torch.float32).to(DEVICE)
embeddings.shape

torch.Size([450, 768])

In [4]:
# load the embedding model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=DEVICE)



In [5]:
# Do similarity search between a query and the the vector store, return best 3 matches

query = "what are the main layers of the transformer? "
# embedd the query
embedded_query = embedding_model.encode(query, convert_to_tensor=True)
# get a similarity score
dot_scores = util.dot_score(a=embedded_query, b=embeddings)[0]

# get the top 3 matches
top_results_dot_product = torch.topk(dot_scores, k=3)
top_results_dot_product

torch.return_types.topk(
values=tensor([0.5689, 0.4328, 0.3861], device='cuda:0'),
indices=tensor([  9,  86, 100], device='cuda:0'))

In [6]:
# import torch, gc
# gc.collect()
# torch.cuda.empty_cache()

In [7]:
#  use results to map back to our text chunks
print(f"Query: '{query}'\n")
print("Results: \n")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {embeddings_loaded_dict[idx]['page_number']}")
    print(f"File path: {embeddings_loaded_dict[idx]['file_path']}")
    # Print relevant sentence chunk 
    print("Text:")
    print(embeddings_loaded_dict[idx]["sentence_chunk"])

    print("\n")

Query: 'what are the main layers of the transformer? '

Results: 

Score: 0.5689
Page number: 2
File path: data\\attention is all you need.pdf
Text:
Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.


Score: 0.4328
Page number: 3
File path: data\\CNN.pdf
Text:
2.1 Overall architecture CNNs are comprised of three types of layers. These are convolutional layers, pooling layers and fully-connected layers. When these layers are stacked, a CNN architecture has been formed. A simpliﬁed CNN architecture for MNIST classiﬁcati

In [8]:
def rag_retrieve(query, embedding_model, vectore_store, top_k):
    # embedd the query
    embedded_query = embedding_model.encode(query, convert_to_tensor=True)
    # dot product (cosine similarity because vectors are normalized) 
    scores = util.dot_score(a=embedded_query, b=vectore_store)[0]
    # get the top k results
    scores, indices = torch.topk(input=scores, k=top_k)
    return scores, indices

def show_retrieval_results(data_dict, query, scores, indices):
    print(f"Query: {query}\n")
    print("Results:\n")
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print file path the page number 
        print(f"File path: {data_dict[index]['file_path']}")
        print(f"Page number: {data_dict[index]['page_number']}")
        # Print relevant sentence chunk 
        print("Text:")
        print(data_dict[index]["sentence_chunk"])
        print("\n")

In [9]:
query = "vpt: video pre-training, hyperparameters "
scores, indices = rag_retrieve(query, embedding_model, embeddings, 4)
print(scores, "\n",  indices, "\n")
show_retrieval_results(embeddings_loaded_dict, query, scores, indices)

tensor([0.5530, 0.5499, 0.5483, 0.5409], device='cuda:0') 
 tensor([279, 417, 305, 271], device='cuda:0') 

Query: vpt: video pre-training, hyperparameters 

Results:

Score: 0.5530
File path: data\\video pretraining VPT.pdf
Page number: 4
Text:
These searches resulted in ∼270k hours of video, which we filtered down to “clean” video segments yielding an unlabeled dataset of ∼70k hours, which we refer to as web_clean (Appendix A has further details on data scraping and filtering). We then generated pseudo-labels for web_clean with our best IDM (Section 3) and then trained the VPT foundation model with behavioral cloning. Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs. We evaluate our models by measuring validation loss (Fig.4, left) and rolling them out in the Minecraft envir

#### **Loading an LLM locally** 

In [10]:
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 6 GB


In [11]:
# My GPU is Nvidia RTX 3060 with 6GB memory
# Loading 2 Billion  parameters model in full precision needs 2b * 4 ~ 8GB of GPU memory
# I need to do quantization to float-16 or int-8
model_id  = "google/gemma-2b-it"

In [12]:
# Downlaod and load the model for inference
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# load in 4bit precision (boost the inference time significantly) 
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.bfloat16,
                                                 quantization_config=quantization_config,
                                                 low_cpu_mem_usage=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

from this we can notice:
- vocab size is 256k
- hidden size is 2048
- context length from the model card 8192

In [14]:
%%time

input_text = "what is attention as described in the attention all you need paper? "
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256,
                            do_sample=True) 
print(f"Model output (tokens):\n{outputs[0]}\n")

# Decode the output tokens to text
outputs_decoded = tokenizer.decode( outputs[0], skip_special_tokens=True)
print(f"Model output (decoded):\n{outputs_decoded}\n")

Input text:
what is attention as described in the attention all you need paper? 
Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   5049,    603,   6137,    685,
           6547,    575,    573,   6137,    832,    692,   1476,   4368, 235336,
            107,    108,    106,   2516,    108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}



  attn_output = torch.nn.functional.scaled_dot_product_attention(


Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   5049,    603,   6137,    685,
          6547,    575,    573,   6137,    832,    692,   1476,   4368, 235336,
           107,    108,    106,   2516,    108,    886,    573,   4368,    664,
         41261,    685,  20555,  22021, 235281,    731,  24556,  88085,   1008,
           717, 235265,    591, 235284, 235276, 235274, 235324,    823,   6137,
           603,   6547,    685,    476,  15613,    604,  25737,    577,   3383,
          4942,    576,    573,   3772,  10629,    575,   2184,    577,   1501,
          2525,  32794, 235265,   1417,  15613,   8563,    573,   2091,    577,
         36570,   1277,   6044,    611,   4942,    576,    573,   3772,    674,
           708,   9666,    577,    573,   6911,    696,   1634, 235265, 235248,
           109,   4858, 235303, 235256,    476,  25497,    576,    573,   2621,
          3782, 235292,    109, 235287,   5231,  41261,    603,    476,  15613,
         95573,  

#### **Augment prompts** 
We want to add relevant data to the prompt before feeding the LLM. The relevant data will come from our files by vector similarity search on the fly. meaning each time we want to query the LLM we will need to embed the query using our embedding model, bring top k similar chunks of text from our vector store, and then add this text to the prompt as context, then we prompt the LLM to get a generation. 

In [53]:
def prepare_augmented_prompt(query, relevant_chunks):

    """
    function to better format the prompt:
    - use few-shot prompting (in context learning) 
    - add context from relevant chunks (augmentation)
    """
    
    # join relevant chunks in one context string
    chunks  = [chunk["sentence_chunk"] for chunk in relevant_chunks]
    chunks = " -" + "\n -".join(chunks)

    # few-shot prompting
    base_prompt = """Based on the following context items, please answer the query. Give yourself room to think by extracting relevant passages from the context before answering the query. Don't return the thinking, only return the answer. Make sure your answers are as explanatory as possible. Use the following examples as a reference for the ideal answer style.
\nExample 1:
Query: What is the role of backpropagation in neural networks?
Answer: Backpropagation is a key algorithm used for training neural networks by minimizing the error between predicted and actual outputs. It involves a forward pass where the input data is propagated through the network to generate an output, and a backward pass where the error is propagated back through the network to update the weights. This is done using the gradient descent optimization method, which calculates the gradient of the loss function with respect to each weight and adjusts the weights to reduce the error. Backpropagation allows neural networks to learn complex patterns in data by iteratively improving the model's accuracy.
\nExample 2:
Query: How does a convolutional neural network (CNN) process image data?
Answer:  A convolutional neural network (CNN) processes image data by applying a series of convolutional layers that automatically detect and learn features such as edges, textures, and shapes. Each convolutional layer consists of filters (also known as kernels) that slide over the input image, performing element-wise multiplication and summing the results to produce feature maps. These feature maps are then passed through activation functions (like ReLU) and pooling layers to reduce dimensionality while retaining essential features. As the data moves through deeper layers, the CNN captures increasingly abstract and complex patterns, ultimately enabling the model to recognize objects and patterns within the image. CNNs are particularly effective for tasks such as image classification, object detection, and facial recognition due to their ability to learn spatial hierarchies of features.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Add relevant chunks
    base_prompt = base_prompt.format(context=chunks, query=query)
    # final prompt, suited for instruction-tuned models
    template = [{"role": "user", "content": base_prompt}]
    # add_generation_prompt argument tells the template to add tokens that indicate the start of a bot response
    prompt = tokenizer.apply_chat_template(conversation=template, tokenize=False, add_generation_prompt=True)
   
    return prompt



In [54]:
query = "what is a transformer and how it works? what is used for? what is the compute and time complexity of it?"

# get relevant chunks 
scores, indices = rag_retrieve(query=query, embedding_model=embedding_model, 
                               vectore_store=embeddings, top_k=10)
relevant_chunks = [embeddings_loaded_dict[i] for i in indices]

# prepare the prompt
prompt = prepare_augmented_prompt(query=query, relevant_chunks=relevant_chunks)
prompt

'<bos><start_of_turn>user\nBased on the following context items, please answer the query. Give yourself room to think by extracting relevant passages from the context before answering the query. Don\'t return the thinking, only return the answer. Make sure your answers are as explanatory as possible. Use the following examples as a reference for the ideal answer style.\n\nExample 1:\nQuery: What is the role of backpropagation in neural networks?\nAnswer: Backpropagation is a key algorithm used for training neural networks by minimizing the error between predicted and actual outputs. It involves a forward pass where the input data is propagated through the network to generate an output, and a backward pass where the error is propagated back through the network to update the weights. This is done using the gradient descent optimization method, which calculates the gradient of the loss function with respect to each weight and adjusts the weights to reduce the error. Backpropagation allows

In [55]:
%%time 

# prompt the LLM
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate an output of tokens
outputs = llm_model.generate(**input_ids,
                             temperature=0.7, 
                             do_sample=True, 
                             max_new_tokens=256) 

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")


Query: what is a transformer and how it works? what is used for? what is the compute and time complexity of it?
RAG answer:
<bos>**What is a Transformer?**

A Transformer is a novel neural network architecture for sequence-to-sequence tasks. It is an extension of the self-attention mechanism, which is a powerful technique for learning long-range dependencies in sequences.

**What is used for?**

The Transformer is used for tasks that require learning representations of sequences of data, such as natural language processing (NLP) tasks (e.g., machine translation, text summarization, sentiment analysis).

**What is the compute and time complexity?**

The compute complexity of the Transformer is similar to that of a single-head self-attention layer. However, due to the use of multiple attention heads, the total computational cost is reduced. The time complexity of training a Transformer model is typically lower than that of other sequence-to-sequence models.<eos>
CPU times: total: 7.42 s


In [56]:
# add everything to one function
def augmented_generation(query, embedding_model, vector_store, data_index,
                   top_k, llm_model, temperature, max_new_tokens, device):

    # query your RAG to get relevant text
    scores, indices = rag_retrieve(query=query, embedding_model=embedding_model, vectore_store=vector_store, top_k=top_k)
    relevant_chunks = [data_index[i] for i in indices]
    
    # prepare the prompt
    prompt = prepare_augmented_prompt(query=query, relevant_chunks=relevant_chunks)

    # prompt the LLM
    input_ids = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature, 
                                 do_sample=True, 
                                 max_new_tokens=max_new_tokens) 
    # decode
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # format output
    # output_text = 
    
    generated_response = {"completion": output_text,
                          "retrieved_chunks": relevant_chunks}
    return generated_response
    

In [58]:
# test pipeline
query = "What is video pre-training (VPT)? and how they trained the VPT?"
response = augmented_generation(query, embedding_model, embeddings, embeddings_loaded_dict,
                   3, llm_model, 0.7, 512, DEVICE)
print(response["completion"])

user
Based on the following context items, please answer the query. Give yourself room to think by extracting relevant passages from the context before answering the query. Don't return the thinking, only return the answer. Make sure your answers are as explanatory as possible. Use the following examples as a reference for the ideal answer style.

Example 1:
Query: What is the role of backpropagation in neural networks?
Answer: Backpropagation is a key algorithm used for training neural networks by minimizing the error between predicted and actual outputs. It involves a forward pass where the input data is propagated through the network to generate an output, and a backward pass where the error is propagated back through the network to update the weights. This is done using the gradient descent optimization method, which calculates the gradient of the loss function with respect to each weight and adjusts the weights to reduce the error. Backpropagation allows neural networks to learn c