# Embedding Models

This notebook explores the different possibilities of embbedding models we may use in the project. The code implementing these models can be found in `/src/embedders` and are imported into this notebook.

This document will go over each embedding model, talk about how it works and its potential strengths/weaknesses.

In [38]:
# Import sys and use it to add the root of the project to the path to import from src
import sys
import os
sys.path.append('../')

# Necessary imports for basic similarity calculation
from sentence_transformers import util
import torch
import pandas as pd

# Embedders
from src.embedders import CountEmbeddor, TFIDFEmbeddor, BERTEmbeddor, SentenceTransformerEmbeddor, EmbeddorBase
from src import utils

# Reloading of embedders when the file changes
%load_ext autoreload
%autoreload 1
%aimport src.embedders
%aimport src


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
# Declare each embedder
models: list[EmbeddorBase] = [
    CountEmbeddor(),
    TFIDFEmbeddor(),
    BERTEmbeddor(),
    SentenceTransformerEmbeddor(),
]

First, let's see how the embedders work on a simple example.

In [57]:
import pandas as pd

# Create a dataframe with some sample reviews
df = pd.DataFrame(
    {
        "text": [
            "The beer is nice, with sweet nutty flavours",
            "This is a very different sentence",
            "Not sweet enough. I like my beer sweet. ",
            "Not sweet at all. Terrible beer. ",
            "Not sweet at all. But I like bitter beers so it is a nice beer. ",
            "Piss yellow beer",
            "Sweet beer"
        ]
    }
)

# Loop through each model and add the embeddings to the dataframe
for model in models:
    df[model.name] = model.transform(df["text"]).tolist()
df.head()


Unnamed: 0,text,CountEmbeddor,TFIDFEmbeddor,BERTEmbeddor,SentenceTransformerEmbeddor
0,"The beer is nice, with sweet nutty flavours","[0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, ...","[0.0, 0.0, 0.19880075548313855, 0.0, 0.0, 0.0,...","[-0.13595613837242126, -0.12485707551240921, 0...","[-0.03745955228805542, -0.04137440770864487, 0..."
1,This is a very different sentence,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4712248263552...","[-0.4844885468482971, -0.45589154958724976, 0....","[0.063898965716362, 0.035758040845394135, 0.01..."
2,Not sweet enough. I like my beer sweet.,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, ...","[0.0, 0.0, 0.2218918546331659, 0.0, 0.0, 0.0, ...","[0.1911739706993103, 0.1460820585489273, 0.283...","[-0.010137715376913548, -0.03818701580166817, ..."
3,Not sweet at all. Terrible beer.,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0.45028691074908056, 0.45028691074908056, 0.2...","[-0.46719881892204285, 0.5336718559265137, -0....","[-0.018374629318714142, -0.0001956572814378887..."
4,Not sweet at all. But I like bitter beers so i...,"[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, ...","[0.2724910606076718, 0.2724910606076718, 0.155...","[-0.2824631333351135, 0.30220910906791687, 0.1...","[-0.025126323103904724, -0.04645886644721031, ..."


Let's go over how each one works.

- CountEmbeddors uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Count vectorisation simply assigns each word in the vocabulary to a variable in the feature vector, and the values are the counts of each word. 

- TFIDF is similar to CountVectorizer, but also multiplies by an 'inverse document frequency' term. This weights a word in the vocabular by how frequently it appears in the corpus. Very common words are penalised, and rarer words are given more weight. This also uses sklearn's [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

*NB: Unlike the the other embeddors, both CountEmbedders and TFIDF require the **entire** string of reviews to be passed at once in their current implementation. You cannot pass in reviews separately, since the feature vector must fit to the entire vocabulary before transforming any reviews. This could be changed in future by passing in a length of reviews into the constructor of the embeddors for fitting, and transforming only when called. However, given that these models are simple and fast, this doesn't feel necessary.*

- BERTEmbeddor uses `bert-base-uncased` [from HuggingFace](https://huggingface.co/bert-base-uncased). BERT is a bidirectional encoder-only transformer. There are many options for extracting embeddings from the model since there are 12 layers, and an embbedding for each token input. Currently, the implementation takes the penultimate hidden state of the model and takes the mean across all tokens in the input (see [this guide](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/)).

- SentenceTransformerEmbeddor uses the recommended `all-MiniLM-L6-v2` model from the `sentence-transformers` [library]((https://www.sbert.net/docs/pretrained_models.html)). These models take outputs from BERT, conduct pooling similar to above (e.g. by default, mean of last layer), and are trained on various sentence-related NLP problems using [Siamese networks](https://towardsdatascience.com/a-friendly-introduction-to-siamese-networks-85ab17522942).

We can now compare the models' behaviour with desired behaviour using cosine similarity.

In [59]:
def get_similarity(review1: str, review2: str) -> float:
    """Computes the similarity between two reviews using all the models"""
    texts = [review1, review2]
    similarities = {}
    for model in models:
        embeddings = model.transform(texts).tolist()
        similarities[model.name] = util.cos_sim(
            torch.Tensor(embeddings[0]), torch.Tensor(embeddings[1])
        ).item()
    return similarities

df_methods = pd.DataFrame(index=[model.name for model in models])

# Compute the similarity between the first and the nth sentence
for i in range(1, len(df)):
    df_methods["Similarity " + str(i+1)] = get_similarity(df["text"][0], df["text"][i]).values()

print("Texts: \n" + "\n".join(df["text"].values.tolist()))
print("\nSimilarity between first and nth sentence:")
df_methods.head()

Texts: 
The beer is nice, with sweet nutty flavours
This is a very different sentence
Not sweet enough. I like my beer sweet. 
Not sweet at all. Terrible beer. 
Not sweet at all. But I like bitter beers so it is a nice beer. 
Piss yellow beer
Sweet beer

Similarity between first and nth sentence:


Unnamed: 0,Similarity 2,Similarity 3,Similarity 4,Similarity 5,Similarity 6,Similarity 7
CountEmbeddor,0.158114,0.353553,0.288675,0.392232,0.204124,0.5
TFIDFEmbeddor,0.087044,0.224413,0.170776,0.248458,0.116718,0.379978
BERTEmbeddor,0.617567,0.79183,0.791252,0.828123,0.721489,0.776372
SentenceTransformerEmbeddor,0.094767,0.667616,0.69976,0.718871,0.334227,0.650784


Here we can see the obvious pitfall of using count-vectorizer and tf-idf - they lose all context. If, during the pipeline, we were to group beers by some measure that affects their sweetness, then in order to confirm out hypothesis we would like to see an increase of similarity inside each group, but we may lower similarity due to negations.

However, BERT and SentenceTransformers are not necessarily better. The values are far less interpretable, with sentence embeddor falling for a similar negation trap. Interestingly, SentenceTransformer was far better than BERT at differentiating between sentences on different topic matters - but it dealt with negation interestingly. Perhaps with many descriptors sentence-transformer would do very well. BERT's scores are all broadly similar, it roughly gets it right, but I have little faith that this translates any better than sentence transformer to the real reviews.

However, for now, I think tf-idf is our best shot. It is the most interpretable (we can get out the most impactful words at the end), and if there are enough reviews are long enough, we should see a meaningful vocabulary emerge. If the tfidf embeddings seem to be limiting us in the future, we can experiment further with other methods.

Now let's try with some real sample reviews.

In [43]:
ROOT_DIR = os.getcwd()
DATA_DIR = os.path.join(ROOT_DIR, "../data")

sample_reviews = utils.load_data(DATA_DIR, num_samples=4)

In [50]:
reviews = sample_reviews['review']['text'].tolist()

print("\n".join(reviews))

df = pd.DataFrame(
    {
        "text": reviews
    }
)

# Loop through each model and add the embeddings to the dataframe
for model in models:
    df[model.name] = model.transform(df["text"]).tolist()
print("Embeddings:")
df



From a bottle, pours a piss yellow color with a fizzy white head.  This is carbonated similar to soda.The nose is basic.. malt, corn, a little floral, some earthy straw.  The flavor is boring, not offensive, just boring.  Tastes a little like corn and grain.  Hard to write a review on something so simple.Its ok, could be way worse.
Pours pale copper with a thin head that quickly goes. Caramel, golden syrup nose. Taste is big toasty, grassy hops backed by dark fruit, candy corn and brack malts. Clingy. Dries out at the end with more hops. Brave, more going on that usual for this type.
500ml Bottle bought from The Vintage, Antrim...Poured a golden yellow / orange colour... White head poured quite thick and foamy and faded to thin layer...Aroma - Fruity (burnt orange, some apple hints), light maltiness, spicy hops, vanilla, some sea saltiness...Taste - Spicy / peppery hop notes, citrusy, light sweetness, grassy, slight creaminess, some bready notes...Feel - Quite sharp and pretty dry. Lig

Unnamed: 0,text,CountEmbeddor,TFIDFEmbeddor,BERTEmbeddor,SentenceTransformerEmbeddor
0,"From a bottle, pours a piss yellow color with ...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...","[0.0, 0.0, 0.07052336179037168, 0.0, 0.0, 0.0,...","[-0.09374433755874634, 0.11626264452934265, 0....","[-0.02019408345222473, -0.056765299290418625, ..."
1,Pours pale copper with a thin head that quickl...,"[0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, ...","[0.0, 0.0, 0.0805288704304574, 0.0, 0.0, 0.0, ...","[-0.32329174876213074, 0.22151388227939606, 0....","[-0.03042781352996826, -0.047437869012355804, ..."
2,"500ml Bottle bought from The Vintage, Antrim.....","[1, 0, 3, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0.08614715820209162, 0.0, 0.1710597639733336,...","[-0.5143056511878967, -0.01398462150245905, 0....","[-0.01840173825621605, -0.04508772864937782, 0..."
3,Serving: 500ml brown bottlePour: Good head wit...,"[1, 1, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, ...","[0.08193284252694036, 0.10392142170524836, 0.1...","[-0.33155936002731323, 0.016586432233452797, 0...","[-0.006383390631526709, -0.05744396895170212, ..."


In [52]:
df_methods = pd.DataFrame(index=[model.name for model in models])

# Compute the similarity between the first and the nth sentence
for i in range(1, len(df)):
    df_methods["Similarity " + str(i+1)] = get_similarity(df["text"][0], df["text"][i]).values()

print("Texts: \n" + "\n".join(df["text"].values.tolist()))
print("Similarity between first and nth sentence:")
df_methods.head()

Texts: 
From a bottle, pours a piss yellow color with a fizzy white head.  This is carbonated similar to soda.The nose is basic.. malt, corn, a little floral, some earthy straw.  The flavor is boring, not offensive, just boring.  Tastes a little like corn and grain.  Hard to write a review on something so simple.Its ok, could be way worse.
Pours pale copper with a thin head that quickly goes. Caramel, golden syrup nose. Taste is big toasty, grassy hops backed by dark fruit, candy corn and brack malts. Clingy. Dries out at the end with more hops. Brave, more going on that usual for this type.
500ml Bottle bought from The Vintage, Antrim...Poured a golden yellow / orange colour... White head poured quite thick and foamy and faded to thin layer...Aroma - Fruity (burnt orange, some apple hints), light maltiness, spicy hops, vanilla, some sea saltiness...Taste - Spicy / peppery hop notes, citrusy, light sweetness, grassy, slight creaminess, some bready notes...Feel - Quite sharp and pretty 

Unnamed: 0,Similarity 2,Similarity 3,Similarity 4
CountEmbeddor,0.242251,0.175406,0.14915
TFIDFEmbeddor,0.143139,0.100011,0.082265
BERTEmbeddor,0.932061,0.88774,0.88353
SentenceTransformerEmbeddor,0.6447,0.596318,0.472954


All the metrics agree on the ordering of similarity - which is a good sign. BERT is the least sure, with very high values, as in the previous example. The other 3 are pretty much the same but at different magnitudes.

Reading the reviews, it's very hard to define what the ordering *should* be.