# Embeddings & You - A Brief Introduction to Embeddings in Machine Learning

If you've toyed with LangChain, LlamaIndex, or even OpenAI's `ada` model - you've likely run into the word: "Embeddings" a few time.

They've had a recent surge in popularity due to the profliferation of Retrieval Augmented Generation, but they've been around for a very long time.

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

## Notebook Table of Contents:

- Breakout Room #1: Training Word2Vec from Scratch
  - Task 1: Dependencies
  - Task 2: Data Collection
  - Task 3: Data Preprocessing
    - ❓ Question #1
    - 👪❓ Discussion Question #1
  - Task 4: Training Word2Vec
    - 🏗️ Activity #1
    - ❓ Question #2
- Breakout Room #2:
  - Task 1: Fine-tuning Our Embedding Model
    - ❓ Question #3
    - 🏗️ Activity #2
  - Task 2: Evaluating our Embedding Model
    - 👪❓ Discussion Question #2

### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them:

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

*They need numeric inputs.*

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

1. Convert non-numeric data into numeric-data
2. Capture potential semantic relationships between individual pieces of data

## Breakout Room #1: Training Word2Vec from Scratch

Now that we have a bit of background on Embeddings - let's look at what it takes to create our own embeddings using Word2Vec!

We'll be leveraging the `gensim` library, which you can read all about [here](https://pypi.org/project/gensim/).

Before we begin training, however, we need some data!

Let's use the Wikipedia pages for Wicked and Gladiator as examples.

### Task 1: Dependencies
We'll leverage the `wikipedia` library, and `langchain`s `WikipediaLoader` to obtain our Wikipedia data!

In [1]:
!pip install -U -q wikipedia langchain langchain_community lxml datasets

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.5/409.5 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.6 MB/s[0m eta [36

> NOTE: Please reset the Colab environment after running the install cells.

### Task 2: Data Collection



In [2]:
from langchain_community.document_loaders import WikipediaLoader

wicked_docs = WikipediaLoader(
    query="Wicked (2024 film)",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()



  lis = BeautifulSoup(html).find_all('li')


In [3]:
len(wicked_docs)

4

In [4]:
gladiator_2_docs = WikipediaLoader(
    query="Gladiator II",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()

In [5]:
len(gladiator_2_docs)

5

### Task 3: Data Preprocessing

Now that we have some text, we need to do some preprocessing! That's right - classic NLP!

Let's begin by cleaning up our text, we'll:

- Remove special characters
- Remove stop words
- Remove links
- Convert to lowercase
- Strip whitespace

To do this, we'll need two main modules:

- The `re` standard library module
- `spacy`, another NLP library

In [6]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Let's take a peek at what these "stopwords" are - for traditional embedding models and NLP.

In [7]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

#### Text Normalization

The first step is to make a helper function that normalizes our text.

In [8]:
import re
from typing import List
from nltk.tokenize import word_tokenize

def preprocess_text(text: str) -> List[str]:
  # remove links
  text = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text)
  # remove all special characters (keep alphabet characters)
  text = re.sub("[^a-zA-Z ]", " ", text)
  # tokenize text, make lowercase, and remove stop words
  stop_words = set(stopwords.words('english'))
  tokens = word_tokenize(text)
  filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words]
  return filtered_tokens

Let's see how this works on some of our Wikipedia data!

In [9]:
preprocess_text(wicked_docs[0].page_content[:100])

['wicked',
 'titled',
 'onscreen',
 'wicked',
 'part',
 'american',
 'musical',
 'fantasy',
 'film',
 'directed',
 'jon']

#### Sentence Tokenization:

Now we'll turn our corpus into sets of sentences and apply our pre-processing function to each sentence individually.

In [10]:
from nltk.tokenize import sent_tokenize

def sentence_tokenization(text: str) -> List[List[str]]:
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Tokenize each sentence into words and store them in a list of lists
    sentence_tokens = [preprocess_text(sentence) for sentence in sentences]
    return sentence_tokens

In [11]:
sentence_tokenization(wicked_docs[0].page_content[:200])

[['wicked',
  'titled',
  'onscreen',
  'wicked',
  'part',
  'american',
  'musical',
  'fantasy',
  'film',
  'directed',
  'jon',
  'chu',
  'written',
  'winnie',
  'holzman',
  'dana',
  'fox',
  'songs',
  'stephen',
  'schwartz'],
 ['first']]

Perfect, with that, we're ready to create our corpus!

In [12]:
corpus = []

for doc in wicked_docs:
  corpus += sentence_tokenization(doc.page_content)

for doc in gladiator_2_docs:
  corpus += sentence_tokenization(doc.page_content)

##### ❓ Question #1:

Why is this normalization and tokenization necessary to train a Word2Vec Embedding Model?

##### 👪❓ Discussion Question #1:

When creating training data for Large Language Models, do we need to/should we use text normalization?

What arguments for or against text normalization exist at LLM-scale datasets?

### Task 4: Training Word2Vec

Now that we have our corpus set up, we can train our Word2Vec model.

Training is straightforward, thanks to `gensim`, and more can be understood about the process by reading the paper - but let's see it in code!

It's also worth considering/playing around with the `gensim` parameters.

In [13]:
!pip install -q -U gensim

### An Aside on Skip-gram (SG) and Continuous Bag of Words (CBOW):

**Skip-gram**:

Skip-gram is an approach to teaching computers the meaning of words by predicting the surrounding context from a given word. Think of it as a student who learns by taking a single word and trying to guess what words might appear around it. For example, given the word "sun," Skip-gram would learn to predict related words like "bright," "sky," and "shine." This method is particularly effective at handling rare words in the vocabulary and capturing multiple meanings of words, though it typically requires more training time. The key insight is that words appearing in similar contexts often have related meanings.

**Continuous Bag of Words (CBOW)**:

CBOW takes the opposite approach to Skip-gram by predicting a target word based on its surrounding context words. Imagine playing a fill-in-the-blank game where you see "The ___ is barking at the mailman" and need to predict "dog" based on the surrounding words. CBOW looks at multiple context words at once and tries to understand what word would make sense in the middle. This method tends to be faster to train than Skip-gram and performs particularly well with frequent words in the vocabulary. However, it might not be as effective at handling rare words or capturing multiple word meanings since it averages the context.

#####🏗️ Activity #1:

Set appropriate hyperparameters for the gensim `Word2Vec` model.

> NOTE: Documentation is available [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)

##### ❓ Question #2:

What do each of the hyper-parameters mean or relate to:

- `VECTOR_SIZE` -> YOUR ANSWER HERE
- `WINDOW` -> YOUR ANSWER HERE
- `MIN_COUNT` -> YOUR ANSWER HERE

In [14]:
from gensim.models import Word2Vec

### Modify These Hyper Parameters
VECTOR_SIZE = 128
WINDOW = 10
MIN_COUNT = 2

### Leave this Hyper Parameter
SG = 1

model = Word2Vec(
    sentences=corpus,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=SG
    )

Blink and you'll miss it. You just trained an embeddings model!

Let's try it out and see what we did!

In [15]:
model.wv["elphaba"]

array([-0.01647268, -0.05579745, -0.05371885,  0.20872699,  0.26404533,
        0.07942157,  0.14951208, -0.05713009, -0.05722675,  0.09519398,
       -0.01493426, -0.00949061, -0.13161813,  0.0983023 ,  0.1252948 ,
        0.26750708, -0.10325535,  0.06487989, -0.17538625,  0.26842296,
       -0.10175849,  0.09815633, -0.21272291, -0.27283755, -0.29781756,
        0.02653749, -0.18122639, -0.03993816, -0.04062214, -0.13352792,
       -0.04847975, -0.02046371, -0.01597715, -0.00195248, -0.05497558,
        0.13171007,  0.11007937, -0.05881838,  0.08789193,  0.06449124,
       -0.01659209,  0.08143873,  0.02870691, -0.03546425,  0.3423432 ,
        0.04861787, -0.21866477,  0.02380689,  0.04406934,  0.17926669,
        0.04415169,  0.10751387,  0.09540914,  0.13030998, -0.00906207,
        0.00285842,  0.05891665,  0.11090235, -0.07538593,  0.10516503,
       -0.2840302 ,  0.12155498,  0.12498969, -0.10673953,  0.3452448 ,
       -0.12075447, -0.03450308,  0.018638  , -0.23050109, -0.26

Finally! We see it: An embedding in the wild.

Notice how we input a word, in this case "Elphaba", and we got back a 100-dimensional vector of floats.

Let's see if we can't get back a list of similar vectors to the vector for "Elphaba", and "Maximus"!

In [16]:
model.wv.most_similar(positive=["elphaba"], topn=3)

[('glinda', 0.9975523352622986),
 ('grande', 0.9959437251091003),
 ('ariana', 0.995334267616272)]

In [17]:
model.wv.most_similar(positive=["maximus"], topn=3)

[('acacius', 0.9980629682540894),
 ('commodus', 0.9979823231697083),
 ('son', 0.9979185461997986)]

Now, for the moment of truth - let's do some vector math and see what happens!

In [18]:
galinda_vec = model.wv["galinda"]
good_vec = model.wv["good"]
mystery_vector = galinda_vec - good_vec

In [19]:
model.wv.most_similar(positive=[mystery_vector], topn=3)

[('galinda', 0.698475182056427),
 ('glinda', 0.6933204531669617),
 ('elphaba', 0.6876640915870667)]

And there we have it - embeddings, and a demonstration of what makes them so powerful!

> Note: This is a very small sample size, and while this result is what we'd hope for - it is largely coincidental - this behaviour is expressed better in much larger corpus' of text.

#### Visualization:

In [20]:
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go

def create_2d_word_cloud(word, model, num_neighbors=100):
    """
    Creates a 2D visualization of a word and its nearest neighbors using existing Word2Vec embeddings.
    Target word is centered in the visualization.

    Parameters:
    word (str): Target word to visualize
    model: Existing Word2Vec model
    num_neighbors (int): Number of nearest neighbors to display
    """
    # Get nearest neighbors
    try:
        neighbors = model.wv.most_similar(word, topn=num_neighbors)
    except KeyError:
        print(f"'{word}' not found in vocabulary")
        return

    # Get vectors for the word and its neighbors
    words = [word] + [n[0] for n in neighbors]
    vectors = np.vstack([model.wv[w] for w in words])

    # Reduce dimensionality to 2D using t-SNE
    print("Reducing dimensionality...")
    tsne = TSNE(n_components=2, random_state=42)
    vectors_2d = tsne.fit_transform(vectors)

    # Center the target word by subtracting its coordinates
    center_point = vectors_2d[0]
    vectors_2d = vectors_2d - center_point

    # Calculate similarity scores for color mapping
    similarities = [1.0] + [n[1] for n in neighbors]

    # Adjust marker sizes based on similarity
    sizes = [20] + [10 + 10 * sim for sim in similarities[1:]]

    # Create the 2D scatter plot
    trace = go.Scatter(
        x=vectors_2d[:, 0],
        y=vectors_2d[:, 1],
        mode='markers+text',
        text=words,
        hoverinfo='text+text',
        hovertext=[f"{w} (sim: {s:.3f})" for w, s in zip(words, similarities)],
        marker=dict(
            size=sizes,
            color=similarities,
            colorscale='Viridis',
            opacity=0.8,
            colorbar=dict(title='Similarity Score')
        ),
        textposition='top center',
        textfont=dict(
            size=[14 if i == 0 else 10 for i in range(len(words))],
            color=['red' if i == 0 else 'black' for i in range(len(words))]
        )
    )

    # Create layout
    layout = go.Layout(
        title=f'2D Word Cloud for "{word}" and {num_neighbors} Nearest Neighbors',
        xaxis=dict(title='', showticklabels=False, zeroline=True),
        yaxis=dict(title='', showticklabels=False, zeroline=True),
        showlegend=False,
        width=900,
        height=900,
        margin=dict(l=0, r=0, b=0, t=40)
    )

    # Create and show figure
    fig = go.Figure(data=[trace], layout=layout)
    fig.show()

In [21]:
create_2d_word_cloud("galinda", model)

Reducing dimensionality...


## Breakout Room #2: Fine-tuning a BERT-Style Embedding Model on Question Answer Pairs.

Now that we've seen where embeddings "started", as it were, let's see where they've gotten.

In this section, we'll be fine-tuning Hugging Face's [sentence transformers](https://www.sbert.net/).

Sentence Transformers leverages the work done in the [Sentence-BERT](https://arxiv.org/abs/1908.10084) paper. So while the idea of converting input text into a dense vector representation is the same, the way we got to those embeddings is a bit different.

> NOTE: As the name implies, the following model is an *ENTIRE* transformer model (though Encoder-only, as described by Sentence-BERT).

### Fine-tuning Our Embeddings Model

Finally, the set up is complete - and we can move on to fine-tuning our sentence transformer embedding model!

The process is simplified considerably by how amazing the Hugging Face `sentence-transformer` library is, so let's jump straight in!

In [22]:
!pip install -U -q sentence-transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/268.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [23]:
from sentence_transformers import SentenceTransformer

We're going to use the `BAAI/bge-small-en` embedding model as an example, but you could use any of the `sentence-transformer` embeddings models.

In [24]:
model_id = "BAAI/bge-small-en"
model = SentenceTransformer(model_id)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [25]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Let's load our data into the desired format!

In [26]:
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

In [27]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 119 (delta 36), reused 40 (delta 10), pack-reused 8 (from 1)[K
Receiving objects: 100% (119/119), 78.04 MiB | 15.15 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [28]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
VAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [29]:
import json

with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

In [30]:
dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

We're going to be leveraging `sentence_transformers` `MultipleNegativesRankingLoss` as our loss function.

You can read more about it in the docs, [here](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss).

Note that there is [research](https://arxiv.org/pdf/1705.00652.pdf) that indicates that performance generally scales with `BATCH_SIZE`, but we're going to stick with an arbitrary 10 for the example in the notebook.

##### ❓ Question #3:

What is happening in `MultipleNegativesRankingLoss` that makes it useful for our task?

In [31]:
from sentence_transformers import losses

In [32]:
loss = losses.MultipleNegativesRankingLoss(model)

In [33]:
BATCH_SIZE = 10

loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

We'll set up the `InformationRetrievalEvaluator` to determine performance during training.

In [34]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

In [35]:
dataset = val_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

You could use a larger epoch size here, but for the example in the Notebook, we'll stick with 10.

In [36]:
EPOCHS = 10

Nothing left to do but #trainthatmodel!

> NOTE: You'll need to make sure you enter the desired Weights and Biases key - you should be able to simple click the link `https://wandb.ai/authorize` and follow the outlined steps to get the API key.

In [37]:
from datasets import Dataset
from torch.utils.data import DataLoader

warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='exp_finetune',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
50,No log,No log,0.630872,0.791946,0.838926,0.899329,0.630872,0.263982,0.167785,0.089933,0.630872,0.791946,0.838926,0.899329,0.76343,0.720113,0.724508
60,No log,No log,0.627517,0.778523,0.842282,0.885906,0.627517,0.259508,0.168456,0.088591,0.627517,0.778523,0.842282,0.885906,0.755411,0.713607,0.718658
100,No log,No log,0.644295,0.788591,0.845638,0.895973,0.644295,0.262864,0.169128,0.089597,0.644295,0.788591,0.845638,0.895973,0.76773,0.726899,0.731367
120,No log,No log,0.634228,0.788591,0.832215,0.895973,0.634228,0.262864,0.166443,0.089597,0.634228,0.788591,0.832215,0.895973,0.764712,0.722739,0.727859
150,No log,No log,0.630872,0.771812,0.83557,0.885906,0.630872,0.257271,0.167114,0.088591,0.630872,0.771812,0.83557,0.885906,0.756317,0.714995,0.720317
180,No log,No log,0.637584,0.798658,0.832215,0.902685,0.637584,0.266219,0.166443,0.090268,0.637584,0.798658,0.832215,0.902685,0.76739,0.724489,0.728832
200,No log,No log,0.630872,0.795302,0.83557,0.90604,0.630872,0.265101,0.167114,0.090604,0.630872,0.795302,0.83557,0.90604,0.76655,0.722321,0.726448
240,No log,No log,0.627517,0.778523,0.812081,0.889262,0.627517,0.259508,0.162416,0.088926,0.627517,0.778523,0.812081,0.889262,0.755208,0.712708,0.717628
250,No log,No log,0.634228,0.778523,0.815436,0.889262,0.634228,0.259508,0.163087,0.088926,0.634228,0.778523,0.815436,0.889262,0.758341,0.716775,0.722066
300,No log,No log,0.637584,0.791946,0.838926,0.895973,0.637584,0.263982,0.167785,0.089597,0.637584,0.791946,0.838926,0.895973,0.764657,0.722772,0.727183


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

### Task 2: Evaluating our Embeddings Models

Now that we've fine-tuned our embedding model on our data - lets see how it performs compared to the base embeddings!

In [38]:
import json
from tqdm.notebook import tqdm
import pandas as pd

In [39]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
EVAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [40]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(EVAL_DATASET_FPATH, 'r+') as f:
    eval_dataset = json.load(f)

We're going to be using the `InformationRetrievalEvaluator` to help us determine how well our embedding model is performing on a widely used task: Information Retrieval!

You can dive deeper into the documentation [here](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) to see under the hood.

You'll notice, however, that we have common suffixes for our evaluation metrics:

- `X_accuracy@1`, `X_accuracy@3`, etc.

This is computing metrics by looking at the accuracy, recall, precision, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDGC), and Mean-Average Precision (MAP) at various numbers of retrieved items.

That is to say:

We look at these scores as we include the first closest document, top three closest documents, etc.

We can think of these `@k` as "top k` metrics.

These will help us guide important hyper-parameters when using these models for Information Retrieval tasks down the road!

In [41]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path="/content/")

#####🏗️ Activity #2:

Describe what the `evaluate` function is doing in the above cell in natural language.

#### Base Embeddings Model Results

In [42]:
evaluate_st(eval_dataset, "BAAI/bge-small-en", name='bge')

{'bge_cosine_accuracy@1': 0.5067114093959731,
 'bge_cosine_accuracy@3': 0.714765100671141,
 'bge_cosine_accuracy@5': 0.7818791946308725,
 'bge_cosine_accuracy@10': 0.8288590604026845,
 'bge_cosine_precision@1': 0.5067114093959731,
 'bge_cosine_precision@3': 0.23825503355704697,
 'bge_cosine_precision@5': 0.1563758389261745,
 'bge_cosine_precision@10': 0.08288590604026844,
 'bge_cosine_recall@1': 0.5067114093959731,
 'bge_cosine_recall@3': 0.714765100671141,
 'bge_cosine_recall@5': 0.7818791946308725,
 'bge_cosine_recall@10': 0.8288590604026845,
 'bge_cosine_ndcg@10': 0.6710313851865369,
 'bge_cosine_mrr@10': 0.619814637264302,
 'bge_cosine_map@100': 0.6279603491960256}

#### Fine-tuned Results

In [43]:
evaluate_st(eval_dataset, "exp_finetune", name='finetuned')

{'finetuned_cosine_accuracy@1': 0.6442953020134228,
 'finetuned_cosine_accuracy@3': 0.7885906040268457,
 'finetuned_cosine_accuracy@5': 0.8456375838926175,
 'finetuned_cosine_accuracy@10': 0.8959731543624161,
 'finetuned_cosine_precision@1': 0.6442953020134228,
 'finetuned_cosine_precision@3': 0.2628635346756152,
 'finetuned_cosine_precision@5': 0.16912751677852347,
 'finetuned_cosine_precision@10': 0.0895973154362416,
 'finetuned_cosine_recall@1': 0.6442953020134228,
 'finetuned_cosine_recall@3': 0.7885906040268457,
 'finetuned_cosine_recall@5': 0.8456375838926175,
 'finetuned_cosine_recall@10': 0.8959731543624161,
 'finetuned_cosine_ndcg@10': 0.7677298263110102,
 'finetuned_cosine_mrr@10': 0.7268989027378289,
 'finetuned_cosine_map@100': 0.7313671834327362}

### Conclusion

Now we can compare the embeddings models to see which performed the best!

In [44]:
df_st_bge = pd.read_csv('/content/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('/content/Information-Retrieval_evaluation_finetuned_results.csv')

In [45]:
df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

Unnamed: 0_level_0,epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
bge,-1,-1,0.506711,0.714765,0.781879,0.828859,0.506711,0.506711,0.238255,0.714765,0.156376,0.781879,0.082886,0.828859,0.619815,0.671031,0.62796
fine_tuned,-1,-1,0.644295,0.788591,0.845638,0.895973,0.644295,0.644295,0.262864,0.788591,0.169128,0.845638,0.089597,0.895973,0.726899,0.76773,0.731367


##### 👪❓Discussion Question #2:

Discuss the results with your group!