## **Build the RAG**
We have already created a vector store that holds the locally created embeddings, now we need to have Q & A pipeline with augmented-prompts by our data files and LLM that could be hosted locally (on my GPU) or in the cloud using an API (openAI).  

#### **Retrieval**

In [1]:
# load the embeddings from our local vector store
import pandas as pd
import torch
import numpy as np
from sentence_transformers import util, SentenceTransformer

embeddings_df_save_path = "vector_store/embeddings.csv"
embeddings_loaded_df = pd.read_csv(embeddings_df_save_path)
embeddings_loaded_df.head()

Unnamed: 0,file_path,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,data\\attention is all you need.pdf,0,"Provided proper attribution is provided, Googl...",1165,154,291.25,[ 2.07077805e-02 2.70413030e-02 -1.68691296e-...
1,data\\attention is all you need.pdf,0,Our model achieves 28.4 BLEU on the WMT 2014 E...,620,95,155.0,[ 1.10328430e-03 5.08999750e-02 3.29319574e-...
2,data\\attention is all you need.pdf,0,Jakob proposed replacing RNNs with self-attent...,658,90,164.5,[ 1.54208224e-02 1.14086864e-03 -6.22348813e-...
3,data\\attention is all you need.pdf,0,Lukasz and Aidan spent countless long days des...,392,49,98.0,[ 2.14229785e-02 5.34767583e-02 -1.31562511e-...
4,data\\attention is all you need.pdf,1,"1 Introduction Recurrent neural networks, long...",1115,158,278.75,[-2.53893668e-03 4.21451516e-02 -5.52566350e-...


Do some conversions on the embeddings dataframe: 

In [2]:
# set the device
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Convert embedding column back to np.array if they were string
if  isinstance ( embeddings_loaded_df["embedding"][0], str):
    embeddings_loaded_df["embedding"] = embeddings_loaded_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts (useful later)
embeddings_loaded_dict = embeddings_loaded_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device
embeddings = torch.tensor(np.array(embeddings_loaded_df["embedding"].tolist()), dtype=torch.float32).to(DEVICE)
embeddings.shape

torch.Size([450, 768])

In [3]:
# load the embedding model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=DEVICE)



In [4]:
# Do similarity search between a query and the the vector store, return best 3 matches

query = "what are the main layers of the transformer? "
# embedd the query
embedded_query = embedding_model.encode(query, convert_to_tensor=True)
# get a similarity score
dot_scores = util.dot_score(a=embedded_query, b=embeddings)[0]

# get the top 3 matches
top_results_dot_product = torch.topk(dot_scores, k=3)
top_results_dot_product

torch.return_types.topk(
values=tensor([0.5689, 0.4328, 0.3861], device='cuda:0'),
indices=tensor([  9,  86, 100], device='cuda:0'))

In [5]:
# import torch, gc
# gc.collect()
# torch.cuda.empty_cache()

In [6]:
#  use results to map back to our text chunks
print(f"Query: '{query}'\n")
print("Results: \n")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {embeddings_loaded_dict[idx]['page_number']}")
    print(f"File path: {embeddings_loaded_dict[idx]['file_path']}")
    # Print relevant sentence chunk 
    print("Text:")
    print(embeddings_loaded_dict[idx]["sentence_chunk"])

    print("\n")

Query: 'what are the main layers of the transformer? '

Results: 

Score: 0.5689
Page number: 2
File path: data\\attention is all you need.pdf
Text:
Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.


Score: 0.4328
Page number: 3
File path: data\\CNN.pdf
Text:
2.1 Overall architecture CNNs are comprised of three types of layers. These are convolutional layers, pooling layers and fully-connected layers. When these layers are stacked, a CNN architecture has been formed. A simpliﬁed CNN architecture for MNIST classiﬁcati

In [7]:
def rag_retrieve(query, embedding_model, vectore_store, top_k):
    # embedd the query
    embedded_query = embedding_model.encode(query, convert_to_tensor=True)
    # dot product (cosine similarity because vectors are normalized) 
    scores = util.dot_score(a=embedded_query, b=vectore_store)[0]
    # get the top k results
    scores, indices = torch.topk(input=scores, k=top_k)
    return scores, indices

def show_retrieval_results(data_dict, query, scores, indices):
    print(f"Query: {query}\n")
    print("Results:\n")
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print file path the page number 
        print(f"File path: {data_dict[index]['file_path']}")
        print(f"Page number: {data_dict[index]['page_number']}")
        # Print relevant sentence chunk 
        print("Text:")
        print(data_dict[index]["sentence_chunk"])
        print("\n")

In [8]:
query = "vpt: video pre-training, hyperparameters "
scores, indices = rag_retrieve(query, embedding_model, embeddings, 4)
print(scores, "\n",  indices, "\n")
show_retrieval_results(embeddings_loaded_dict, query, scores, indices)

tensor([0.5530, 0.5499, 0.5483, 0.5409], device='cuda:0') 
 tensor([279, 417, 305, 271], device='cuda:0') 

Query: vpt: video pre-training, hyperparameters 

Results:

Score: 0.5530
File path: data\\video pretraining VPT.pdf
Page number: 4
Text:
These searches resulted in ∼270k hours of video, which we filtered down to “clean” video segments yielding an unlabeled dataset of ∼70k hours, which we refer to as web_clean (Appendix A has further details on data scraping and filtering). We then generated pseudo-labels for web_clean with our best IDM (Section 3) and then trained the VPT foundation model with behavioral cloning. Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs. We evaluate our models by measuring validation loss (Fig.4, left) and rolling them out in the Minecraft envir

#### **Loading an LLM locally** 

In [9]:
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 6 GB


In [10]:
# My GPU is Nvidia RTX 3060 with 6GB memory
# Loading 2 Billion  parameters model in full precision needs 2b * 4 ~ 8GB of GPU memory
# I need to do quantization to float-16 or int-8
model_id  = "google/gemma-2b-it"

In [11]:
# Downlaod and load the model for inference
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# load in 4bit precision (boost the inference time significantly) 
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.bfloat16,
                                                 quantization_config=quantization_config,
                                                 low_cpu_mem_usage=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

from this we can notice:
- vocab size is 256k
- hidden size is 2048
- context length from the model card 8192

In [13]:
%%time

input_text = "what is attention as described in the attention all you need paper? "
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256,
                            do_sample=True) 
print(f"Model output (tokens):\n{outputs[0]}\n")

# Decode the output tokens to text
outputs_decoded = tokenizer.decode( outputs[0], skip_special_tokens=True)
print(f"Model output (decoded):\n{outputs_decoded}\n")

Input text:
what is attention as described in the attention all you need paper? 
Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   5049,    603,   6137,    685,
           6547,    575,    573,   6137,    832,    692,   1476,   4368, 235336,
            107,    108,    106,   2516,    108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}



  attn_output = torch.nn.functional.scaled_dot_product_attention(


Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   5049,    603,   6137,    685,
          6547,    575,    573,   6137,    832,    692,   1476,   4368, 235336,
           107,    108,    106,   2516,    108,    886,    573,  19422,    576,
           573,  50289,   2262,   1646,  15213,   4368, 235269,   6137,    603,
          6908,    685,   6397, 235292,    109,    688,  75927,   6137,  66058,
          1417,   8563,    573,   2091,    577,  27104,   2369,   2113,    774,
          2167,   4942,    576,    573,   3772,  10629, 235269,   4998,   1280,
          3185,    573,   8761,   3668,    576,   3907,    575,    573,  10629,
        235265,   1417,   7154,    577,  16446,   1497, 235290,   4201,  43634,
          1865,   3907, 235269,    948,    708,   3695,  20305,    604,  13333,
          1582,    685,   5255,  17183, 235269,   6479,  17183, 235269,    578,
          2872,  39534, 235265,    109,    688, 129780,  12846, 235290,   6088,
          6137,  