# **Comparison of Embedding Models & How Other Variables Affect RAG Performance**

This notebooks analyzes the effect different embedding models and variables, such as chunk size, have on RAG performace.

## Why should you read this notebook?  

You want to learn:  
- How to do basic Retrieval Augmented Generation (RAG)  
- How different variables like chunk size, overlap, etc. affect model performance  

## Pre-requisites

If you'd like to reproduce this notebook, you'll need:  

- A Hugging Face account  
- (optional) Google Drive  
- [Access to Llama 2](https://llama.meta.com/llama-downloads/)  
- A GPU (a free NVIDIA T4 on Google Colab is fine for Llama 7B & 13B models)  
- Familiarity with Python  

## Introduction

Retrieval Augmented Generation (RAG), is a technique for improving lanaguage model performance in question answering by providing additional context in a question prompt. The technique involves comparing a given question and a related body of text, then retrieving the most similar results and asking a language model to answer the question with the help of this additional information. Unlike with a simple keyword search, however, we create [embeddings](/handbook/embeddings) of the question and dataset, then perform a *semantic search* to retrieve the most relevant snippets that match the question's *meaning*, regardless of whether there exist exact keyword matches.  

RAG is often a good option when you don't have enough data to fine-tune, but you still want to teach the model about a new topic.  

### RAG Steps
1. **Chunk** - Take a dataset (in this notebook we use a PDF of recent COVID-19 statistics) and split it into chunks of approximately 200-4000 [tokens](/handbook/tokenization) per chunk.
2. **Embed** - Using an embedding model, such as OpenAI's `text-embedding-ada-002`, create embeddings for the chunks and a given question.  
3. **Retreive** - Compare the question embedding to the dataset embeddings, finding the top-k most similar results (typically we use cosine similarity or dot-product).  
4. **Augment** - Using the indices of the top-k embedding results, get the corresponding chunks of text and include these along with the question in a prompt to a lanaguage model.

With the additional context, hopefully the language model can better answer the question.  

### Variables Affecting Performance

There are several variables in the above steps that can significantly affect how well the language model performs in answering questions.  

1. **Data quality** - In the context of these experiments, data quality means that the text format is free of special characters, complex formatting, discontinuities, etc.
2. **Chunk size** - If chunks are too small, they don't contain enough information to provide a prompt with sufficient context. If chunks are too large, the information density is too sparse, and the language model's attention may not focus on the right information.  
3. **Chunk overlap** - We typically configure chunks to overlap, i.e. the end of one sequence is repeated at the beginning of the next sequence. Overlapping helps perserve context between chunks. An overlap of at least 10% is often used.  
4. **Embedding model** - All embedding models are not created equally.
5. **Similarity method** - Given two vectorized embeddings, we can calculate and rank their similarity in a number of ways, but most often we use either cosine—which measures the angle between the vectors—or their dot product—which accounts for both angle and respective magnitudes.  
6. **k-samples** - How many of the most relevant snippets from the dataset should we include as context to the language model? Too many and the information is too diffuse; too few and we might not pass enough context to answer a given question. Finding the right value of **k** is important.
7. **Language model** - We must have a model that is sufficiently advanced to parse the question and context and answer appropriately.
8. **Quantization** - Using [quantized models](/handbook/quantization) can degrade model performance. Additionally, the method of quantization can have an effect.   

## Experimental Setup

### Dataset  

The point of RAG is to augment a language model that might be unfamiliar with a given topic, for example, something obscure, proprietary, or that has occurred since the model was trained. It's this last scenario that we'll focus on in this notebook.    

In these experiements, I use Llama 2 as the language model, which was trained starting in February 2023. To ensure that the model does not already know the answers to the questions I'll be asking, I ask questions about COVID-19 statistics reported between December 2023 and January 2024.  

- PDF of the dataset: [World Health Organization COVID-19 Epidemiological Update - Edition 163 (pdf)](https://www.who.int/publications/m/item/covid-19-epidemiological-update---19-january-2024)  

- The chunked dataset is available on [my Hugging Face](https://huggingface.co/datasets/gadkins/who-covid-19-epidemiological-update-edition-163) in CSV format.  
- You can find the functions I used for preparing this dataset in the [Utility Functions notebook](/notebooks/data-processing/utility-functions)  

### Variables Tested

In this notebook, I do not attempt to perform an exhaustive testing of each permutation of variables listed below. Instead, I select a few illustrative configurations and leave it up to the reader to experiment when reproducing my results.  


| Chunk Size (tokens) | Overlap | Embedding Model | Similarity method     | Top-k | Language Model | Quantization |
|------------|---------------|-----------------|------------------|-----------|----------------|----------------|
| 200        | 10%           | Llama 2         | Cosine | 1         | Llama 2 7B     | None (full-precision) |
| 500        | 25%           | OpenAI         | Dot Product | 3         | Llama 2 13B     | GPTQ |
| 1000        | 50%           | MS MARCO         | - | 7         | Llama 2 70B     | `bitsandbytes` (NF4) |

---

## Summary of Results

1. **Data quality** - Cleaning the dataset with the help of an LLM like `gpt-3.5-turbo` helped the chunking process and, thus, Llama in answering questions.  
1. **Chunk size** - 500-token chunks yielded more correct answers than 200-token chunks, although inference was significantly slower. This makes sense, as a larger prompt requires more CPU and memory and can incure greater latency. Interestingly, some questions Llama got right with 200-token chunks, it missed with 500-token chunks. I did not try 1000-token chunks since, my dataset is only 24 pages and the 500-tokens chunks were already taking almost 10 minutes to answer 10 questions on a free T4.    
2. **Chunk overlap** - 50% overlap performed slightly better than 10%. Larger overlap results in more rows of the training set, so it might be interesting to see how a large overlap and a larger k-value compare.  
3. **Embedding model** - OpenAI's `text-embedding-ada-002` just barely beat out MS MARCO. Llama 2's embedding model performed noticably worse. If open-source is important to you, MS MARCO is a good alternative to OpenAI.    
4. **Similarity method** - Given two vectorized embeddings, we can calculate and rank their similarity in a number of ways, but most often we use either cosine—which measures the angle between the vectors—or their dot product—which accounts for both angle and respective magnitudes.  
5. **k-samples** - Returning the top-3 most similar samples seemed to be the sweet spot.
6. **Language model** - Unsuprisingly, the larger 13B model was able to understand larger contexts better than smaller 7B model. I did not test with the 70B model, but I suspect it would have done better, espcially on questions where the 13B model got confused (e.g. answered a question I didn't ask).  
7. **Quantization** - Quantized models faired worse than full-precision models. GPTQ faired slightly worse than `bitsandbytes`, although not by much.  



# Setup

In [1]:
# Authenticate to Hugging Face to pull and push models
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Connect Google Drive

Optional but saves time by caching the model and allows for training data to be saved on Drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import os
cache_dir = "/content/drive/My Drive/huggingface_cache"
os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists

In [6]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# Installation and Loading

## GPTQ
This quantization method is faster than `bitsandbytes`, although probably a little less accurate

In [None]:
!pip install -q -U transformers peft accelerate optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install -q datasets

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# model_id = "TheBloke/Llama-2-7B-chat-GPTQ"
model_id = "TheBloke/Llama-2-13B-chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    cache_dir=cache_dir
    )

tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

## `bitsandbytes` NF4

This is the same quantizatoin method described in my [QLoRA notebook](/notebooks/fine-tuning/qlora).

In [None]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q datasets

In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# # model_id = "meta-llama/Llama-2-7b-chat-hf"
# model_id = "meta-llama/Llama-2-13b-chat-hf"
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     quantization_config=bnb_config,
#     device_map={"":0},
#     cache_dir=cache_dir)

#Load the Dataset

Note: I've noticed that [cleaning up the dataset](/notebooks/data-processing/utility-functions#Clean-training-data-with-GPT-3.5-turbo-and-create-test-data) using an LLM can improve embedding retrieval peformance slighly. In this example I used a cleaned COVID-19 statistics dataset.

In [15]:
from datasets import load_dataset

data = load_dataset("gadkins/who-covid-19-epidemiological-update-edition-163")

# Print first row of 'train' and 'test'
print("First row of train:", data['train'][0])
print("First row of test:", data['test'][0])

data = data.map(lambda samples: tokenizer(samples["Text"]), batched=True)

Downloading readme:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/40.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

First row of train: {'Text': '1 \n COVID -19 Epidemiological Update  \nEdition  163 published  19 January  2024 \nIn this edition:   \n• Key highlights   \n• Global overview  \n• Hospitalizations and ICU admissions  \n• SARS -CoV-2 variants of interest and variants under monitoring  \n• WHO regional overviews   \n \n \nKey highlights  \n \n• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -\n19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of \ncountries where cases or deaths are occurring , as many  countries have stopped or changed frequency of \nreporting .  \n• From the available data , the number of reported cases has increased while deaths have decreased during \nthe 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease \nof 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). 

Map:   0%|          | 0/29 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

# Test Embeddings
Here, we're calculating the embeddings for each row of training data, by averaging across the token embeddings for each token in each row.

We're also using a test question to pick out and print the three most relevant snippets.

## Llama Embeddings

In [None]:
import torch
from torch.nn.functional import cosine_similarity
import numpy as np

n_samples=3

# Function to get average embeddings
def get_avg_embedding(input_ids):
    with torch.no_grad():
        embeddings = model.get_input_embeddings()(input_ids)
    avg_embedding = torch.mean(embeddings, dim=1)
    return avg_embedding

# Calculate average embeddings for each row in 'train' dataset
train_embeddings = []
for example in data['train']:
    input_ids = torch.tensor(example['input_ids']).unsqueeze(0).to("cuda")  # Moved to CUDA
    avg_embedding = get_avg_embedding(input_ids)
    train_embeddings.append(avg_embedding)
train_embeddings = torch.stack(train_embeddings).squeeze()

# Tokenize and get embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_input_ids = tokenizer(question, return_tensors="pt", truncation=True, \
                               max_length=500)["input_ids"].to("cuda")  # Moved to CUDA
question_embedding = get_avg_embedding(question_input_ids).squeeze()

# Calculate cosine similarity between question and each row in 'train'
# Convert to float32 and move to CPU
cosine_similarities = cosine_similarity(question_embedding.unsqueeze(0), \
                                        train_embeddings, dim=1).cpu().float()

# Sort and find top n_samples most similar rows
top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()

for idx in top_indices:
    print("Similarity Score:", cosine_similarities[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")


Similarity Score: 0.8447265625
Text: V-2 sequence data and metadata from GISAID, from 3 July to 31 December 2023.  
 
 
 
 
A 
B  
22 
 Additional resources  
 
• Tracking SARS -CoV-2 Variants  
• WHO statement on updated tracking system on SARS -CoV-2 variants of concern and variants of interest   
• SARS -CoV-2 variant risk evaluation framework, 30 August 2023  
• WHO JN.1 Initial Risk Evaluation, 1 3 December 2023  
• WHO BA.2.86 Initial Risk Evaluation, 21 November 2023  
• WHO EG.5 Updated Risk Evaluation, 21 November 2023  
• WHO XBB.1.5 Updated Risk Assessment, 20 June 2023  
• WHO XBB.1.16 Updated Risk Assessment, 5 June 2023    
22 WHO  regional  overviews   
Data for 11 December  2023  to 7 January 2024  
 
African  Region 
The African Region reported over 3354 new cases, a 63% decrease as compared to the previous 28 -day period. Five (10%) of the 50 countries for which data are 
available reported increases in new cases of 20% or greater, with the highest proportional increa

## OpenAI embeddings

In [None]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
from google.colab import drive

# User inputs the API key here
api_key = input("Please enter your OpenAI API key:")
os.environ["OPENAI_API_KEY"] = api_key

Please enter your OpenAI API key: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


In [None]:
!pip install openai

In [None]:
import os
from openai import OpenAI

# Get the API key from environment variable
client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],
)

In [None]:
!pip install tqdm



In [None]:
# Reload embeddings if previously saved to drive
import numpy as np

def reload_embeddings(cache_dir, embeddings_filename):
    csv_path = os.path.join(cache_dir, embeddings_filename)

    # Initialize an empty list to hold the read data
    train_embeddings_list = []

    # Read from CSV
    with open(csv_path, 'r', newline='') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            train_embeddings_list.append(list(map(float, row)))

    # Convert list of lists to a numpy array and then to a tensor
    train_embeddings = torch.tensor(np.array(train_embeddings_list))

    # Convert tensor to float32
    train_embeddings = train_embeddings.to(torch.float32)

    print("Reloaded train_embeddings:", train_embeddings)
    return train_embeddings

def get_openai_embedding(text):
    embedding = client.embeddings.create(
        input = [text],
        model="text-embedding-ada-002").data[0].embedding
    return torch.tensor(embedding)

def embeddings_to_csv(train_embeddings, cache_dir, embeddings_filename):
    # Check if train_embeddings is already a tensor
    if not isinstance(train_embeddings, torch.Tensor):
        # Convert list of tensors to a single tensor
        train_embeddings = torch.stack(train_embeddings)

    # Save train_embeddings to a CSV file
    import csv

    # Convert tensor to numpy array and then to list
    train_embeddings_list = train_embeddings.cpu().numpy().tolist()

    csv_path = os.path.join(cache_dir, embeddings_filename)
    with open(csv_path, 'w', newline='') as csvfile:
        csvwriter = csv.writer(csvfile)
        csvwriter.writerows(train_embeddings_list)

    print(f"Saved train_embeddings to {csv_path}")

def create_embeddings():
    from tqdm import tqdm
    import time

    # Initialize train_embeddings as an empty list
    train_embeddings = []

    # Fetch embeddings for training data
    for i in tqdm(range(0, len(data['train']), 1)):  # Assume each API call handles one item
        example = data['train'][i]
        text = example['Text']
        embedding = get_openai_embedding(text)
        train_embeddings.append(embedding)
        time.sleep(2)  # Sleep for 2 seconds to avoid rate limiting

    return train_embeddings

# Check if embeddings file already exists in cache, else create new embeddings
embeddings_filename="train_embeddings.csv"
if os.path.exists(os.path.join(cache_dir, embeddings_filename)):
    train_embeddings = reload_embeddings(cache_dir, embeddings_filename)
    print("Loaded train_embeddings from cache")
else:
    train_embeddings = create_embeddings()
    embeddings_to_csv(train_embeddings, cache_dir, embeddings_filename)
    print(f"Created new train_embeddings. Saved embeddings to {os.path.join(cache_dir, embeddings_filename)}")

In [None]:
# from tqdm import tqdm
# import time

# def get_openai_embedding(text):
#     embedding = client.embeddings.create(
#         input = [text],
#         model="text-embedding-ada-002").data[0].embedding
#     return torch.tensor(embedding)

# # Initialize train_embeddings as an empty list
# train_embeddings = []

# # Fetch embeddings for training data
# for i in tqdm(range(0, len(data['train']), 1)):  # Assume each API call handles one item
#     example = data['train'][i]
#     text = example['Text']
#     embedding = get_openai_embedding(text)
#     train_embeddings.append(embedding)
#     time.sleep(2)  # Sleep for 2 seconds to avoid rate limiting

100%|██████████| 64/64 [02:18<00:00,  2.17s/it]


In [None]:
# # Convert list to float32 tensor
# train_embeddings = torch.tensor(np.array(train_embeddings))
# train_embeddings = train_embeddings.to(torch.float32)

In [None]:
# Fetch embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_embedding = get_openai_embedding(question)

# Calculate cosine similarity between question and each row in 'train'
cosine_similarities = cosine_similarity(question_embedding, train_embedddings, dim=-1)

# Sort and find top n_samples most similar rows
top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()

# Output similar texts
for idx in top_indices:
    print("Similarity Score:", cosine_similarities[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")

Similarity Score: 0.8974971175193787
Text: cases remained stable during the 28-day period of 11 December 2023 to 7 January 2024 as compared to the previous 28-day period, with over 1.1 million new cases reported. The number of new weekly deaths decreased by 26% as compared to the previous 28-day period, with 8700 new fatalities reported. As of 7 January 2024, over 774 million confirmed cases and over 7 million deaths have been reported globally. According to estimates obtained from viral loads in wastewater surveillance, clinical detection of cases underestimated the real burden 2 to 19-fold. Reported cases do not accurately represent infection rates due to the reduction in testing and reporting globally. During this 28-day period, only 45% of countries reported at
---
Similarity Score: 0.8882509469985962
Text: COVID-19 Epidemiological Update Edition 163 published 19 January 2024

In this edition:
• Key highlights
• Global overview
• Hospitalizations and ICU admissions
• SARS-CoV-2 var

In [None]:
# import numpy as np

# # Check if train_embeddings is already a tensor
# if not isinstance(train_embeddings, torch.Tensor):
#     # Convert list of tensors to a single tensor
#     train_embeddings = torch.stack(train_embeddings)

# # Save train_embeddings to a CSV file
# import csv

# # Convert tensor to numpy array and then to list
# train_embeddings_list = train_embeddings.cpu().numpy().tolist()

# csv_path = os.path.join(cache_dir, "train_embeddings.csv")
# with open(csv_path, 'w', newline='') as csvfile:
#     csvwriter = csv.writer(csvfile)
#     csvwriter.writerows(train_embeddings_list)

# print(f"Saved train_embeddings to {csv_path}")

Saved train_embeddings to /content/drive/My Drive/huggingface_cache/train_embeddings.csv


## MS MARCO embeddings

MS MARCO is an embedding model from the [SentenceTransformers](https://www.sbert.net/index.html) framework that was trained on user search queries using Bing search engine. It is fine-tuned for either cosine similarity or dot product. In this section, I use the `msmarco-distilbert-base-tas-b` model, which is tuned for dot product and gives better results than Llama embeddings.

In [None]:
!pip install sentence_transformers

In [16]:
from sentence_transformers import SentenceTransformer
import torch

n_samples = 3

# Initialize the model
marco_model = SentenceTransformer('msmarco-distilbert-base-tas-b').to("cuda")  # Moved to CUDA

# Function to get embeddings
def get_marco_embedding(text):
    return torch.tensor(marco_model.encode(text)).to("cuda")  # Moved to CUDA

# Calculate embeddings for each row in 'train' dataset
train_embeddings = []
for example in data['train']:
    train_embeddings.append(get_marco_embedding(example['Text']))
train_embeddings = torch.stack(train_embeddings)

# Get embedding for the question
question = "For the 28-day period from 11 December 2023 to 7 January 2024, how " + \
"many new cases of COVID-19 were reported globally?"
question_embedding = get_marco_embedding(question)

# Calculate dot product between question and each row in 'train'
dot_products = torch.matmul(question_embedding, train_embeddings.T)

# Sort and find top n_samples most similar rows
top_indices = torch.topk(dot_products, n_samples).indices.tolist()

for idx in top_indices:
    print("Similarity Score:", dot_products[idx].item())
    print("Text:", data['train'][idx]['Text'])
    print("---")

Similarity Score: 107.57888793945312
Text: 1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the available data , the number of reported cases has increased while deaths have decreased during 
the 28 -day period , with over 1.1 million new cases and 8700  new deaths , an increase of 4% and a decrease 
of 26%, respectively, compared to the previous 28 days ( 13 November to 10 December 2023 ). Trends

# Evaluate with and without embeddings

In [17]:
from transformers import TextStreamer
from torch.nn.functional import cosine_similarity
import torch

# Assume that train_embeddings has been computed and is available

# Define a stream
def stream(user_prompt):
    model.config.use_cache = True
    system_prompt = "You are an expert on COVID-19 statistics. You provide very " + \
    "succinct answers no longer than ten words."

    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST}{B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    streamer = TextStreamer(tokenizer)

    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=100)

def evaluation(questions, answers, use_embeddings=False, n_samples=3, embedding_type='marco'):
    for i, (question, answer) in enumerate(zip(questions, answers)):
        if use_embeddings:
          print(f'Using the {embedding_type} embedding model')
          # Tokenize and get embedding for the question
          if embedding_type == 'llama':
              question_input_ids = tokenizer(question, return_tensors="pt", truncation=True, max_length=512)["input_ids"].to("cuda")
              question_embedding = get_avg_embedding(question_input_ids).squeeze()
              question_embedding = question_embedding.unsqueeze(0) #average over all tokens.
          elif embedding_type == 'openai':
              question_embedding = get_openai_embedding(question)
          elif embedding_type == 'marco':
              question_embedding = get_marco_embedding(question)
          else:
              print("Invalid embedding_type. It should be either 'llama', 'openai', or 'marco'.")

          # # Cosine Similarity
          # # Calculate cosine similarity between question and each row in 'train'
          # cosine_similarities = cosine_similarity(question_embedding, train_embeddings, dim=1).cpu().float()

          # # Sort and find top n_samples most similar rows
          # top_indices = torch.topk(cosine_similarities, n_samples).indices.tolist()
          # print(f'Top indices are {top_indices}')

          # Dot Product Similarity
          # Calculate dot product similarity between question and each row in 'train'
          dot_product_similarities = torch.matmul(question_embedding, train_embeddings.T).cpu().float()

          # Sort and find top n_samples most similar rows
          top_indices = torch.topk(dot_product_similarities, n_samples).indices.tolist()
          # print(f'Top indices are {top_indices}')

          similar_rows = []
          for idx in top_indices:
              text = data['train'][idx]['Text']
              if isinstance(text, list):
                  # If text is a list of strings, join the strings into a single string with a newline in between
                  text = "\n".join(text)
              # Append the string to similar_rows
              similar_rows.append(text)

          # print(similar_rows)
          similar_rows_text = "\n".join(similar_rows)
          user_prompt = "\n\nHere are some snippets from the World Health Organization " + \
          "Epidemiological Update - Edition 163 regarding COVID-19 statistics from " + \
          f"11 December 2023 to 7 January 2024:\n\n{similar_rows_text}\n\nAnswer the " + \
          f"following question succinctly, solely using the above snippets of text.\n\n{question}"
        else:
          user_prompt = question

        print(f"Question {i + 1}:")  # Printing the question number
        stream(user_prompt)
        print("Correct Answer:", answer)
        print('\n\n')

questions = [
    "During the 28-day period from 11 December 2023 to 7 January, how many " + \
    "countries reported new COVID-19 cases?",
    "To date, how many confirmed cases of COVID-19 have been reported globally?",
    "During the 28-day period from 11 December 2023 to 7 January, by what percent " + \
    "did new hospitalization increase globally due to COVID-19?",
    "During the 28-day period from 11 December 2023 to 7 January, which countries " + \
    "in the Americas reported the highest number of new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, which Eastern " + \
    "Mediterranean country reported the highest proportional increase in new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, which World Health " + \
    "Organization Region reported the largest decrease in new COVID-19 cases?",
    "During the 28-day period from 11 December 2023 to 7 January, what was the percent " + \
    "increase in new deaths in the South-East Asia Region?",
    "During the 28-day period from 11 December 2023 to 7 January, which countries " + \
    "reported no new hospitalizations?",
    "When did the ICU-to-hospitalizations ratio peak during the COVID-19 pandemic?",
    "During the 28-day period from 11 December 2023 to 7 January, which SARS-CoV-2 " + \
    "variants of interest (VOIs) were the WHO tracking?"
]

answers = [
    "106 countries.",
    "774 million.",
    "40%.",
    "Canada, Chile, and Peru.",
    "Kuwait.",
    "The African Region.",
    "564%",
    "Mauritania, Mali, Turks and Caicos Islands, Saint Lucia, and Honduras.",
    "July 2021",
    "XBB.1.5, XBB.1.16, EG.5, BA.2.86 and JN.1."
]

In [18]:
print("Question embedding shape:", question_embedding.shape)
print("Train embeddings shape:", train_embeddings.shape)

Question embedding shape: torch.Size([768])
Train embeddings shape: torch.Size([29, 768])


## Evaluate without Embeddings (Raw Llama)

Given that this COVID-19 information is recent, Llama 2 13B answers 10 of 10 questions incorrectly without the help of embeddings.

In [None]:
# Call evaluation without embeddings
evaluation(questions, answers, use_embeddings=False)

Question 1:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, how many countries reported new COVID-19 cases? [/INST]

47 countries reported new COVID-19 cases.</s>
Correct Answer: 106 countries.



Question 2:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

To date, how many confirmed cases of COVID-19 have been reported globally? [/INST]

Over 25 million confirmed cases worldwide.</s>
Correct Answer: 774 million.



Question 3:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

During the 28-day period from 11 December 2023 to 7 January, by what percent did new hospitalization increase globally due to COVID-19? [/INST]

12.3% increase.</s>
Correct Answer: 40%.



Question 4:
<s> [INST]<<

## Evaluate with the help of Embeddings

In [19]:
# Call evaluation with embeddings
# evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='llama')
# evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='openai')
evaluation(questions, answers, use_embeddings=True, n_samples=3, embedding_type='marco')

Using the marco embedding model
Question 1:
<s> [INST]<<SYS>>
You are an expert on COVID-19 statistics. You provide very succinct answers no longer than ten words.
<</SYS>>

Here are some snippets from the World Health Organization Epidemiological Update - Edition 163 regarding COVID-19 statistics from 11 December 2023 to 7 January 2024:

1 
 COVID -19 Epidemiological Update  
Edition  163 published  19 January  2024 
In this edition:   
• Key highlights   
• Global overview  
• Hospitalizations and ICU admissions  
• SARS -CoV-2 variants of interest and variants under monitoring  
• WHO regional overviews   
 
 
Key highlights  
 
• Globally, during the 28 -day period from 11  December 2023  to 7 January 2024 , 106 countries report ed COVID -
19 cases and 51 countries  report ed COVID -19 deaths. Note  that this does not reflect the actual number of 
countries where cases or deaths are occurring , as many  countries have stopped or changed frequency of 
reporting .  
• From the availa

## Scoring:

Note that decimal scores indicate the correct answer to the question was multiple items in a list, and the model only answered correctly part of the list.  

### NF4 Quantization with BNB

Llama 2, 7B - RAW NF4:
- 0/10

Llama 2, 7B NF4 - with embeddings, 200-token chunks 50% overlap:
- 2/10

Llama 2, 13B - RAW NF4:
- 0/10

Llama 2, 13B NF4 - with llama embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:
- 4/10

Llama 2, 13B NF4 - with *openai* embeddings, cosine similarity, 500 tok long 50% overlap, 3 samples:
- 4.2/10

Llama 2, 13B NF4 - with *openai* embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:
- 5/10

Llama 2, 13B NF4 - with *marco* embeddings, dot product similarity, 500 tok long 50% overlap, 3 samples:
- 5/10

### GPTQ Quantization

Llama 2, 7B - RAW GPTQ:
- 0/10

Llama 2, 7B GPTQ - with llama embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 1.2/10

Llama 2, 7B GPTQ - with openai embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 3/10

Llama 2, 7B GPTQ - with marco embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 4/10

Llama 2, 13B - RAW GPTQ:
- 0/10

Llama 2, 13B GPTQ - with llama embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 2.4/10

Llama 2, 13B GPTQ - with openai embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 4.2/10

Llama 2, 13B GPTQ - with marco embeddings, 200-token chunks w/ 50% overlap, top-3 similar embeddings:
- 4/10

Llama 2, 13B GPTQ - with marco embeddings, 500-token chunks w/ 50% overlap, top-3 similar embeddings:
- 5/10

Llama 2, 13B GPTQ - with marco embeddings, 500-token chunks w/ 10% overlap, top-3 similar embeddings:
- 4.6/10