## 2. RAG - Search and Answer

RAG goal: Retrieve relevant passages based on query and use those passages to augment an input into llm for more specific and relevant output


### Simmilarity search

Embeddings can be used for any type of data
> you can turn images, sounds and text into embeddings (and etc)

Comparing embeddings is known as simmilarity search, vector search, semantic search.

in our case, we want to auery our htmx book based on the semantic or "vibed".  

So if i search for "htmx application", I should get relavant passages to that text

In [1]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import text and embeddings
text_chunks_and_embeddings_df = pd.read_csv("/content/drive/MyDrive/htmx-helper/text_chunks_and_embeddings_df.csv")

# Convert embeddings column back to np.array (it got convert to stringwhen we saved into csv)
text_chunks_and_embeddings_df['embedding'] = text_chunks_and_embeddings_df['embedding'].apply(lambda x : np.fromstring(x.strip("[]"), sep=" "))

# Convert embeddings into torch.tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embeddings_df['embedding'].tolist(), axis=0), dtype=torch.float32).to(device)

# Convert text and embedding df to list of dicts
chap_and_chunks = text_chunks_and_embeddings_df.to_dict('records')

text_chunks_and_embeddings_df.head()


Unnamed: 0,chapter,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,JSON Data APIs,So far we have been focusing on using hypermed...,881,147,220.25,"[0.0273659118, 0.0403993875, -0.0349355265, 0...."
1,JSON Data APIs,"Now, believe it or not, we have been creating ...",859,152,214.75,"[0.0440846756, 0.0380228162, -0.0166515242, 0...."
2,JSON Data APIs,Should we include a Data API for Contact.app a...,987,156,246.75,"[0.0074886689, 0.0516783707, -0.0307607464, -0..."
3,JSON Data APIs,You want programmatic access to your system vi...,659,119,164.75,"[0.0210245363, 0.0760841295, -0.000929900678, ..."
4,JSON Data APIs,"As with the bulk-import example, this isn’t a ...",1047,181,261.75,"[0.0316230878, 0.0600564703, -0.0132916411, -0..."


In [2]:
text_chunks_and_embeddings_df['embedding']

Unnamed: 0,embedding
0,"[0.0273659118, 0.0403993875, -0.0349355265, 0...."
1,"[0.0440846756, 0.0380228162, -0.0166515242, 0...."
2,"[0.0074886689, 0.0516783707, -0.0307607464, -0..."
3,"[0.0210245363, 0.0760841295, -0.000929900678, ..."
4,"[0.0316230878, 0.0600564703, -0.0132916411, -0..."
...,...
625,"[0.0571991205, -0.0458471216, 0.0115438141, -0..."
626,"[0.0553417318, -0.054534886, -0.0073707113, -0..."
627,"[0.031214226, -0.0322923549, 0.0167680215, -0...."
628,"[0.0563624762, -0.00889210217, -0.00101378269,..."


In [3]:
# embeddings = np.stack(text_chunks_and_embeddings_df['embedding'].tolist(), axis=0)
# embeddings.shape

In [4]:
# Create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path='all-mpnet-base-v2', device=device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embeddings model ready!  

Lets create a small semantic search pipeline

In essence, we want to search for a query  

We can do so with the following steps:
1. Define a query string
2. Turn the query into an emeddings.
3. Perform a dot product or cosine similarity function between the text and the query embedding.
4. Sort the result from 3 in descending order

Note: to use the dot product comparison, ensure vector sizes are of same shape and tensor/datatypes are in same type

In [5]:
# 1 Define query
query = "htmx application"
print(f"Query: {query}")

# 2 Embed the query
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3 Get simmilarity scores with the dot product (se cosine simmilarity if outputs of models is not normalized)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time - start_time:.5f} seconds")

# 4 Get the top k results in descending form
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: htmx application
[INFO] Time taken to get scores on 630 embeddings: 0.00320 seconds


torch.return_types.topk(
values=tensor([0.6847, 0.6566, 0.6539, 0.6492, 0.6072], device='cuda:0'),
indices=tensor([ 39, 447, 317, 305, 319], device='cuda:0'))

In [6]:
chap_and_chunks[39]

{'chapter': 'Tricks Of The Htmx Masters',
 'sentence_chunk': 'In this chapter we are going to look deeper into the htmx toolkit. We’ve accomplished quite a bit with what we’ve learned so far. Still, when you are developing Hypermedia-Driven Applications, there will be times when you need to reach for additional options and techniques. We will go over the more advanced attributes in htmx, as well as expand on the advanced details of attributes we have already used. Additionally, we will look at functionality that htmx offers beyond simple HTML attributes: how htmx extends standard HTTP request and responses, how htmx works with (and produces) events, and how to approach situations where there isn’t a simple, single target on the page to be updated. Finally, we will take a look at practical considerations when doing htmx development: how to debug htmx-based applications effectively, security considerations you will need to take into account when working with htmx, and how to configure th

We can see that searching over embeddings is very fast.  

But if you whad 10M+ embeddings, you likely want to create an index.

An indes is like letters in dictionary.  

For example searching for cat will start with "ca.." etc.  

A popular index library is Faiss (use in facebook search)

one of Faiss technique used in the library is approximate nearest neighbour (ANN)


Now lets make our vector search result pretty!

In [7]:
import textwrap

def print_wrapped(text, wrap_length=80):
  wrapped_text = textwrap.fill(text, wrap_length)
  print(wrapped_text)


In [8]:
print(f"Query: {query}\n")
print("Results: ")
# Loop through zipped together scores and indices from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
  print(f"Score: {score:.4f}")
  print("Text: ")
  print_wrapped(chap_and_chunks[idx]['sentence_chunk'])
  print(f"Chapter: {chap_and_chunks[idx]['chapter']}")
  print("\n")

Query: htmx application

Results: 
Score: 0.6847
Text: 
In this chapter we are going to look deeper into the htmx toolkit. We’ve
accomplished quite a bit with what we’ve learned so far. Still, when you are
developing Hypermedia-Driven Applications, there will be times when you need to
reach for additional options and techniques. We will go over the more advanced
attributes in htmx, as well as expand on the advanced details of attributes we
have already used. Additionally, we will look at functionality that htmx offers
beyond simple HTML attributes: how htmx extends standard HTTP request and
responses, how htmx works with (and produces) events, and how to approach
situations where there isn’t a simple, single target on the page to be updated.
Finally, we will take a look at practical considerations when doing htmx
development: how to debug htmx-based applications effectively, security
considerations you will need to take into account when working with htmx, and
how to configure the beha

Note: we could potentially improve the order of these results with a training model. A model that has been trained specifically to take search results (e.g. the top 25 semantic results) and rank them in order of most likely top-1 to least likely.

for open source options: mixedbread-ai

### Functionizing our semantic search pipeline

Lets pull all of the steps above for semantic search into a function or two so we can repeat the flow smoothly.

In [9]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
  """
  Embeds a query with model and returns top k scores and indices from embeddings.
  """

  # Embed the query
  query_embedding = model.encode(query, convert_to_tensor=True)

  # Get dot product scores on embeddings
  start_time = timer()
  dot_scores = util.dot_score(query_embedding, embeddings)[0]
  end_time = timer()

  if print_time:
    print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time - start_time:.5f} seconds")

  # Get the top k results in descending form
  scores, indices = torch.topk(input=dot_scores,
                               k=n_resources_to_return
                               )

  return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 chap_and_chunks: list[dict]=chap_and_chunks,
                                 n_resources_to_return: int=5):
  """
  Finds relevant passages given a query and print them out along with their scores
  """
  scores, indices = retrieve_relevant_resources(query=query,
                                                embeddings=embeddings,
                                                n_resources_to_return=n_resources_to_return)

  # Loop through zipped together scores and indices from torch.topk
  for scores, idx in zip(scores, indices):
    print(f"Score: {score:.4f}")
    print("Text: ")
    print_wrapped(chap_and_chunks[idx]['sentence_chunk'])
    print("\n")

In [10]:
query = "Steps to integrate htmx"
# retrieve_relevant_resources(query=query, embeddings=embeddings)
print_top_results_and_scores(query=query, embeddings=embeddings)

[INFO] Time taken to get scores on 630 embeddings: 0.00010 seconds
Score: 0.6072
Text: 
Like hx-swap, hx-trigger can often be omitted when you are using htmx, because
the default behavior is typically what you want. Recall the default triggering
events are determined by an element’s type:Requests on input, textarea & select
elements are triggered by the change event. Requests on form elements are
triggered on the submit event. Requests on all other elements are triggered by
the click event. There are times, however, when you want a more elaborate
trigger specification. A classic example is the active search example we
implemented in Contact.app:The active search inputAn elaborate trigger
specification. This example took advantage of two modifiers available for the
hx-trigger attribute:Allows you to specify a delay to wait before a request is
issued.


Score: 0.6072
Text: 
This SHA can be found on the htmx website. We also mark the script as
crossorigin="anonymous" so no credentials wil

### Getting an LLm for our local generation
We want to focus on local generation.  
However, this process could also works well wioth llm via API.  

What is a generative LLM?
- Goes from input -> generate text output.  

Which LLM should we use?
Depends on local GPU capability such as :-
- VRAM space

#### Checking our local GPU VRAM memory availibility

In [11]:
# Get GPU memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb}GB")

Available GPU memory: 15GB


Notes:
- For this project we wanted to use gemma model with instruction tuned specifications.
- we need to first accept terms and conditions on hugging face
- then we must login to huggingface via CLI to sign in and import the model.

In [12]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
  print(f"Your availability GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
  print(f"GPU memory: {gpu_memory_gb}GB | Recommended model: Gemma 2B in 4-bit precision")
  use_quantization_config = True
  model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
  print(f"GPU memory: {gpu_memory_gb}GB | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision")
  use_quantization_config = False
  model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
  print(f"GPU memory: {gpu_memory_gb}GB | Recommended model: Gemma 7B in 4-bit or float16 precision")
  use_quantization_config = False
  model_id = "google/gemma-7b-it"

print(f"  - use_quantization_config: {use_quantization_config}")
print(f"  - model_id: {model_id}")

GPU memory: 15GB | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision
  - use_quantization_config: False
  - model_id: google/gemma-2b-it


### Loading an LLM locally
We can load an LLM using HUgginf face `transformers`.  
The model suitable for google collab free tier is https://huggingface.co/google/gemma-2-2b-it


To get a model running local we are going to need a few things:
1. A quantization config(optional). a config on what precision to load our model in (e.g. 8bit, 4bit etc)
2. A model id - this will tell `transformers` which model/tokenizer to load
3. A tokenizor - this turns text into numbers ready for the LLM (note: a tokenizer is different from an embedding model)
4. An LLM model - This will be what we use for text genertion based on input!

> Note there are tricks on loading /making LLMs work faster. One of the newest one is Flash Attention 2


In [13]:
# !pip install bitsandbytes

In [14]:
# !huggingface-cli login

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# 1. Create a quantization config
# Note: requires !pip install bitsandbytes accelerate
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: flash attention 2 = faster attention mechanism
# Flash Attention 2 reqiers a GPU with a compute capability score of 8.0+
# Our free tier in google colab is not capable for flash attention 2, However we will just functionize this and still write the code for flast attention 2
if (is_flash_attn_2_available) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa" #scaled dot product attention
print(f"Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use
model_id = "google/gemma-2-2b-it" # we cahnged the model a bit with the new release of gemma2

# 3. Instantiate tokenizer (tokenizer turns text into token)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16,
                                                 quantization_config=quantization_config,
                                                 low_cpu_mem_usage=True,
                                                 attn_implementation=attn_implementation,
                                                 device_map='auto',
                                                 )

if not use_quantization_config:
  llm_model.to("cuda")


Using attention implementation: sdpa


tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [16]:
llm_model

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_

In [18]:
def get_model_num_params(model: torch.nn.Module):
  return sum([param.numel() for param in model.parameters()])

print(f"Number of parameters in model: {get_model_num_params(model=llm_model):,}")

Number of parameters in model: 1,602,203,904


In [22]:
def get_model_size(model: torch.nn.Module):
  mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
  mem_buffers = sum([buffer.nelement() * buffer.element_size() for buffer in model.buffers()])

  # Calculate memory sizes
  model_mem_bytes = mem_params + mem_buffers
  model_mem_mb = model_mem_bytes / (2**20)
  model_mem_gb = round(model_mem_bytes / (2**30))

  return {
      "model_mem_bytes": model_mem_bytes,
      "model_mem_mb": round(model_mem_mb, 2),
      "model_mem_gb": round(model_mem_gb, 2),
  }

get_model_size(model=llm_model)

{'model_mem_bytes': 2192270336, 'model_mem_mb': 2090.71, 'model_mem_gb': 2}

We get the size of our model.  

This means to load the current model we need minimum 2 gb vram on GPU

In [17]:
# Note, if you don't want to reinstall BNBs dependencies, append the `--no-deps` flag!
# !pip install --force-reinstall --no-deps 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-0.44.2.dev0-py3-none-manylinux_2_24_x86_64.whl'

### Generating text with our LLM

Let's generate text with our local LLM!

* Note: Some models have been trained/tuned to generate text with a specific template in mind.  

Because `gemma-2-2b-it` has been trained in a instruction-tuned manner, we should follow the instructions template for the best results

In [25]:
input_text = "How does htmx works?"
print(f"Input text:\n{input_text}")

# Create a prompt template for instruction tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False,
                                       add_generation_prompt=True)

print(f"\nPrompt (formartted):\n{prompt}")

Input text:
How does htmx works?

Prompt (formartted):
<bos><start_of_turn>user
How does htmx works?<end_of_turn>
<start_of_turn>model



In [27]:
%%time

# Tokenize the input text (turn it into numbers) and send it to the GPU
input_ids = tokenizer(prompt,
                      return_tensors='pt').to('cuda')

# input_ids

# Generate outputs from local LLM
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256)
print(f"Model output (tokens):\n{outputs[0]}\n")

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   2299,   1721,  64673,  17106,
          3598, 235336,    107,    108,    106,   2516,    108,   3991,  32619,
           603,    476,  22978,   9581,    674, 120646,    573,   3505,    576,
         29295,   2744,   8557, 235265,   1165, 235303, 235256,   6869,    577,
          1501,    665,  10154,    577,   3104,  12415,    578,  39351,   2744,
         12219,   2346,  60206,    611,   9121,  22978,  70774,   1154,   8071,
           689,  62173, 235265, 235248,    109,   4858, 235303, 235256,    476,
         25497,    576,   1368,  11079,  32619,   3598, 235292,    109,    688,
          5826,  54737,  66058,    109, 235287,   5231,   7538, 235290,  13983,
        127551,    591, 101516,   1245,    688,  11079,  32619,  22049,   2208,
           573,   2384,    576,   6934, 235290,   2043,  28256,    591, 101516,
        235275,    577,    953, 235290,   4636,  19319,   3381,    611,    573,
          6934, 2

In [28]:
# Decode the output token into text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded text):\n{outputs_decoded}")

Model output (decoded text):
<bos><bos><start_of_turn>user
How does htmx works?<end_of_turn>
<start_of_turn>model
HTMX is a JavaScript library that simplifies the development of interactive web applications. It's designed to make it easier to create dynamic and responsive web experiences without relying on traditional JavaScript frameworks like React or Angular. 

Here's a breakdown of how HTMX works:

**Core Concepts:**

* **Server-Side Rendering (SSR):** HTMX leverages the power of server-side rendering (SSR) to pre-render HTML content on the server. This means the HTML is generated and sent to the client, ready for the browser to display.
* **Client-Side Interactions:** HTMX uses a combination of HTML elements and JavaScript to handle user interactions. It allows you to create dynamic elements and modify the page content without relying on full-blown JavaScript frameworks.
* **Data Fetching:** HTMX uses a simple and efficient way to fetch data from the server using AJAX requests. Th

In [29]:
# Question generated by GPT4
gpt4_questions = [
    "How does HTMX modify traditional HTML to handle AJAX requests without writing JavaScript?",
    "How does HTMX handle partial page updates, and how do I specify which elements should be replaced?",
    "How does HTMX integrate with backend frameworks like Flask, Django, or Node.js?",
    "How can I use HTMX for real-time updates, such as notifications or live data feeds?",
    "How does HTMX compare to traditional JavaScript frameworks like React or Vue, and when should I choose it over them?"
]

manual_questions = [
    "What is htmx?",
    "Why use htmx?",
    "How does htmx performed compared to other library?",
]

query_list = gpt4_questions + manual_questions
query_list

['How does HTMX modify traditional HTML to handle AJAX requests without writing JavaScript?',
 'How does HTMX handle partial page updates, and how do I specify which elements should be replaced?',
 'How does HTMX integrate with backend frameworks like Flask, Django, or Node.js?',
 'How can I use HTMX for real-time updates, such as notifications or live data feeds?',
 'How does HTMX compare to traditional JavaScript frameworks like React or Vue, and when should I choose it over them?',
 'What is htmx?',
 'Why use htmx?',
 'How does htmx performed compared to other library?']

In [30]:
import random

query = random.choice(query_list)
print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings)

scores, indices

Query: How does HTMX handle partial page updates, and how do I specify which elements should be replaced?
[INFO] Time taken to get scores on 630 embeddings: 0.00010 seconds


(tensor([0.7008, 0.6320, 0.6212, 0.6195, 0.6139], device='cuda:0'),
 tensor([ 60,  61, 218,  63, 215], device='cuda:0'))

### Augmenting our prompt with context items

We've done retrieval.  
We've done generation.  

Time to augment!

The concept of augmenting a prompt with context items is also reffered as prompt engineering.  

Prompt engineering is an active field of research and many new styles and techniques are being found out.  

However, there are a fair new techniques that works quite well.


We're going to use a couple of prompting technique
1. Give clear instructions.
2. Give a few examples of input/outputs (e.g. given this input, I'd like this output).
3. Give them room to think (e.g. create a scratchpad/"show your working space"/"Lets think step by step...")

Lets create a function to format promt with context items

In [54]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
  context = "- " + "\n- ".join([item['sentence_chunk'] for item in context_items])

#                      yt_base_prompt = """
# Based on the following context items, please answer the query.
# Give yourself room to think by extracting relevant passages from the context before answering the query.
# Don't return the thinking, only return the answer.
# Make sure your answers are as explanatory as possible.
# Use the following examples as reference for the ideal answer style.
# \nExample 1:
# Query: What are the fat-soluble vitamins?
# Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
# \nExample 2:
# Query: What are the causes of type 2 diabetes?
# Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
# \nExample 3:
# Query: What is the importance of hydration for physical performance?
# Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
# \nNow use the following context items to answer the user query:
# {context}
# \nRelevant passages: <extract relevant passages from the context here>
# User query: {query}
# Answer:
#                      """
  base_prompt = """
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible. because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:
"""
  base_prompt = base_prompt.format(context=context,
                                                 query=query)

  #  Create a prompt template following the hf documentation (for instruction tuned model)
  dialogue_template = [
      {"role": "user",
      "content": base_prompt}
      ]

  #  Apply the chat template
  prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                                          tokenize=False,
                                                          add_generation_prompt=True)

  return prompt

query = random.choice(query_list)
print(f"Query: {query}")

# Get relevents resorces
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [chap_and_chunks[i] for i in indices]

# Format the prompt
prompt = prompt_formatter(query=query,
                          context_items=context_items)

print(prompt)

Query: How does htmx performed compared to other library?
[INFO] Time taken to get scores on 630 embeddings: 0.00006 seconds
<bos><start_of_turn>user
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible. because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.

Now use the following context items to answer the user query:
- Htmx itself is, of course, written in J

Prompt example:
Based on the following contexts:
-asdasdasdasd
-dsfsdfsdfsdf
-sfgdfggdfgdfgdfgdf
-ghfjfghfghfghfghfgh
-tytryrtyrtyrtyrtyrty

Please answer the following query: Explained htmx to me.  
Answer:

In [55]:
%%time

input_ids = tokenizer(prompt,
                      return_tensors='pt').to('cuda')

# Generate outputs from local LLM
outputs = llm_model.generate(**input_ids,
                             temperature=0.7, # equivalent to halucinations level, 0 -> no hallucination, 1 -> lots of hallucination
                             do_sample=True, # Whether ot not yo use sampling : read-resource :: Google chip-huyen sampling
                             max_new_tokens=256)

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])
print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")

Query: How does htmx performed compared to other library?
RAG answer:
<bos>htmx focuses on simplifying HTML by minimizing the need for scripting. It's not about replacing JavaScript entirely but rather reducing the amount of code and making it easier to manage.  It plays well with other libraries and offers a simplified approach to web development. 
<end_of_turn>
CPU times: user 3.17 s, sys: 99.6 ms, total: 3.26 s
Wall time: 3.26 s


### Functionize our LLM answering feature

Wouldn't it be cool if the RAG pipeline worked from a single function?

E.g. you input a query and you get a generated answer + optionally also get the source documents (the context) where that answer was generated from.  

Lets make a function to do it!

In [60]:
def ask(quer: str,
        temperature: float=0.7,
        max_new_tokens: int=256,
        format_answer_text=True,
        return_answer_only=True,):
  """
  Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
  """

  # RETRIEVAL
  # Get just the scores and indices of top related results
  scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings)

  # Create a list of context items
  context_items = [chap_and_chunks[i] for i in indices]

  # Add score to context items
  for i, item in enumerate(context_items):
    item['score'] = scores[i].cpu()

  # AUGMENTATION
  # Create the prompt and format it with context items
  prompt = prompt_formatter(query=query,
                            context_items=context_items)

  # GENERATION
  input_ids = tokenizer(prompt,
                        return_tensors='pt').to('cuda')

  # Generate outputs from local LLM. Check out hf generation config
  outputs = llm_model.generate(**input_ids,
                               temperature=temperature,
                               do_sample=True,
                               max_new_tokens=max_new_tokens)

  # Decode thr tokens into text
  output_text = tokenizer.decode(outputs[0])

  # Format the answer
  if format_answer_text:
    # Replace prompt and special tokens
    output_text = output_text.replace(prompt, '').replace("<bos>", '').replace("<end_of_turn>", '')

  if return_answer_only:
    return output_text
  else:
    return output_text, context_items


In [57]:
ask("What are the htmx thinking styles")

[INFO] Time taken to get scores on 630 embeddings: 0.00007 seconds


"The passage highlights that htmx  plays well with other libraries, making it easy to integrate them when needed, and it doesn't aim to completely replace existing frameworks by offering a simpler approach. \n<end_of_turn>"

In [61]:
query = random.choice(query_list)
print(f"Query: {query}")
ask(query,
    # temperature=0.2,
    # return_answer_only=False,
    )

Query: How does HTMX modify traditional HTML to handle AJAX requests without writing JavaScript?
[INFO] Time taken to get scores on 630 embeddings: 0.00008 seconds


'HTMX extends HTML to handle AJAX requests without writing JavaScript. It uses attributes to trigger HTTP requests and then replaces the content of the page with the response, rather than replacing the entire page. Here\'s how it works:\n\n1. **HTMX Attributes:** HTMX utilizes special attributes on HTML elements to initiate HTTP requests. These include `hx-get`, `hx-post`, `hx-put`, `hx-patch`,  `hx-delete`, `hx-trigger`.\n2. **HTTP Request:** These attributes are used to trigger HTTP requests, bypassing the need for complex JavaScript code.\n3. **HTML Response:** HTMX expects HTML responses from the server.\n4. **Content Swap:** HTMX utilizes "transclusion" to replace the page content with the HTML response without rewriting the entire page, thus minimizing the load time. \n\nIn essence, HTMX leverages the browser\'s native HTML parser to process the response efficiently. \n'

## Summary
* RAG = powerful technique for generating text based on references documents.
* Hardware use = use GPU where possible to accelerate embedding creation and LLM generation.
  * Keep in mind the limitation on your local hardware
* Many open-source embedding models and LLMs starting to become available, keep experimenting to find which is best.