# Embedding Models

This notebook explores the different possibilities of embbedding models we may use in the project. The code implementing these models can be found in `/src/embedders` and are imported into this notebook.

This document will go over each embedding model, talk about how it works and its potential strengths/weaknesses.

In [8]:
# Import sys and use it to add the root of the project to the path to import from src
import sys
sys.path.append('../')

# Necessary imports for basic similarity calculation
from sentence_transformers import util
import torch
import pandas as pd

# Embedders
from src.embedders import CountEmbeddor, TFIDFEmbeddor, BERTEmbeddor, SentenceTransformerEmbeddor, EmbeddorBase

# Reloading of embedders when the file changes
%load_ext autoreload
%autoreload 1
%aimport src.embedders


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
# Declare each embedder
models: list[EmbeddorBase] = [
    CountEmbeddor(),
    TFIDFEmbeddor(),
    BERTEmbeddor(),
    SentenceTransformerEmbeddor(),
]

First, let's see how the embedders work on a simple example.

In [11]:
import pandas as pd

# Create a dataframe with some sample reviews
df = pd.DataFrame(
    {
        "text": [
            "Sweet nutty flavours",
            "This is a different sentence",
            "Piss yellow beer",
            "Not sweet enough",
        ]
    }
)

# Loop through each model and add the embeddings to the dataframe
for model in models:
    df[model.name] = model.transform(df["text"]).tolist()
df.head()


Unnamed: 0,text,CountEmbeddor,TFIDFEmbeddor,BERTEmbeddor,SentenceTransformerEmbeddor
0,Sweet nutty flavours,"[0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0]","[0.0, 0.0, 0.0, 0.6176143709756019, 0.0, 0.0, ...","[-0.4556387960910797, -0.5342035293579102, 0.1...","[-0.04542968049645424, -0.11505721509456635, 0..."
1,This is a different sentence,"[0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0]","[0.0, 0.5, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.5, ...","[-0.19627796113491058, -0.5809494853019714, 0....","[0.045435935258865356, 0.03793025761842728, 0...."
2,Piss yellow beer,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]","[0.5773502691896257, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.04581831395626068, 0.7215461134910583, 0.1...","[-0.033432573080062866, 0.057189930230379105, ..."
3,Not sweet enough,"[0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]","[0.0, 0.0, 0.6176143709756019, 0.0, 0.0, 0.617...","[-0.41678184270858765, -0.3863939940929413, -0...","[-0.039078567177057266, -0.05166036635637283, ..."


We can use this to calculate simple cosine similarities between the embeddings.

In [20]:
def get_similarity(review1: str, review2: str) -> float:
    """Computes the similarity between two reviews using all the models"""
    texts = [review1, review2]
    similarities = {}
    for model in models:
        embeddings = model.transform(texts).tolist()
        similarities[model.name] = util.cos_sim(
            torch.Tensor(embeddings[0]), torch.Tensor(embeddings[1])
        ).item()
    return similarities

df_methods = pd.DataFrame(index=[model.name for model in models])

# Compute the similarity between the first and the nth sentence
for i in range(1, len(df)):
    df_methods["Similarity " + str(i+1)] = get_similarity(df["text"][0], df["text"][i]).values()

print("Texts: " + str(df["text"].values))
print("Similarity between first and nth sentence:")
df_methods.head()

Texts: ['Sweet nutty flavours' 'This is a different sentence' 'Piss yellow beer'
 'Not sweet enough']
Similarity between first and nth sentence:


Unnamed: 0,Similarity 1,Similarity 2,Similarity 3
CountEmbeddor,0.0,0.0,0.333333
TFIDFEmbeddor,0.0,0.0,0.201993
BERTEmbeddor,0.51483,0.68803,0.661113
SentenceTransformerEmbeddor,0.094743,0.236396,0.527919


In [18]:


print(get_similarity("Sweet nutty flavours", "This is a different sentence"))

{'CountEmbeddor': 0.0, 'TFIDFEmbeddor': 0.0, 'BERTEmbeddor': 0.514830470085144, 'SentenceTransformerEmbeddor': 0.09474346786737442}
