### Embeddings model peformance evaluation - Cosine Similarity

In [19]:
import numpy as np
from sklearn.preprocessing import normalize
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.max_colwidth', None)  # show full text (no truncation)

In [14]:
# Load samples
samples_df = pd.read_parquet("data/sample_reviews.parquet")

In [15]:
samples_df.iloc[12:14, :]['text']
#12 amazing food, amazing service, food recommendations
#13 bad service, bad food
# sentiment differs, but semnatic is same

12                       Amazing food and service ! Get the mushrooms! And wedge salad was great. Filet was cooked perfectly.
13    Bruh the food was so got dam bad my meat wasn't even good Bad service literally we had to wait for our food like twice‍
Name: text, dtype: object

#### Load embeddings generated

In [16]:
# Define paths (adjust to your filenames)
model_paths = {
    "Glove300": "glove_300_embeddings.npy",
    "Word2Vec": "word_2_vec_embeddings.npy",
    "RoBERTa": "all_distill_roberta_v1_embeddings.npy",
    "HKU_INSTRUCTOR": "hku_nlp_instructor_embeddings.npy",
    "MiniLM-L6": "all_mini_LM_L6_v2_embeddings.npy",
    "MiniLM-L12": "all_mini_LM_L12_v2_embeddings.npy",
    "Paraphrase-MPNET": "paraphrase_mpnet_base_v2_embeddings.npy",
    "Paraphrase-MiniLM-L3": "paraphrase_mini_LM_L3_v2_embeddings.npy",
    "Open-AI-Text-Embedding": "open_ai_text_small_embeddings.npy"
}

# Load and normalize embeddings
models = {}
for name, path in model_paths.items():
    emb = np.load(f"embeddings/{path}")
    models[name] = normalize(emb)
    print(f"{name:25s} loaded: {emb.shape}")

Glove300                  loaded: (1000, 300)
Word2Vec                  loaded: (1000, 300)
RoBERTa                   loaded: (1000, 768)
HKU_INSTRUCTOR            loaded: (1000, 768)
MiniLM-L6                 loaded: (1000, 384)
MiniLM-L12                loaded: (1000, 384)
Paraphrase-MPNET          loaded: (1000, 768)
Paraphrase-MiniLM-L3      loaded: (1000, 384)
Open-AI-Text-Embedding    loaded: (1000, 1536)


#### Test 1: Capturing sentiment

In [24]:
samples_df.iloc[[12,13], :]['text']

12                       Amazing food and service ! Get the mushrooms! And wedge salad was great. Filet was cooked perfectly.
13    Bruh the food was so got dam bad my meat wasn't even good Bad service literally we had to wait for our food like twice‍
Name: text, dtype: object

In [22]:
results = []
review_example1_index = 12
review_example2_index = 13

for model_name, model_embedding in models.items():
    # select the two review vectors
    vec_a = model_embedding[review_example1_index].reshape(1, -1)
    vec_b = model_embedding[review_example2_index].reshape(1, -1)
    
    # compute cosine similarity
    sim = cosine_similarity(vec_a, vec_b)[0][0]
    
    results.append({"Model": model_name, "Cosine Similarity": sim})

similarity_df = pd.DataFrame(results).sort_values(by="Cosine Similarity", ascending=False)
similarity_df

Unnamed: 0,Model,Cosine Similarity
1,Word2Vec,0.999989
0,Glove300,0.640628
3,HKU_INSTRUCTOR,0.629629
2,RoBERTa,0.452794
5,MiniLM-L12,0.422687
6,Paraphrase-MPNET,0.418015
4,MiniLM-L6,0.4067
7,Paraphrase-MiniLM-L3,0.365562
8,Open-AI-Text-Embedding,0.335018


** As expected, contextual embeddings are able to capture the difference in sentiment compared to their static embeddings counterparts(Word2Vec, Glove300)

#### Test 2: Different tone, different reviews

In [30]:
samples_df.iloc[[337, 492], :]['text']

337                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Great brews if you like IPA's and spurs.  I wish there were some different options, but the sours were really good.  Below is the Cherry fluff and it was a mix of sour, fruity, and he had some vanilla in it to smooth it out.  The staff was super friendly.  I definitely would recommend checking it out.
492    My check engine light was on and when I went to get gas and my car wouldn't start. I got my car jumped to go over to this shop and made the owner aware that my che

In [31]:
results = []
review_example1_index = 337
review_example2_index = 492

for model_name, model_embedding in models.items():
    # select the two review vectors
    vec_a = model_embedding[review_example1_index].reshape(1, -1)
    vec_b = model_embedding[review_example2_index].reshape(1, -1)
    
    # compute cosine similarity
    sim = cosine_similarity(vec_a, vec_b)[0][0]
    
    results.append({"Model": model_name, "Cosine Similarity": sim})

similarity_df = pd.DataFrame(results).sort_values(by="Cosine Similarity", ascending=False)
similarity_df

Unnamed: 0,Model,Cosine Similarity
1,Word2Vec,0.999986
0,Glove300,0.631834
3,HKU_INSTRUCTOR,0.365408
6,Paraphrase-MPNET,0.180938
2,RoBERTa,0.145081
7,Paraphrase-MiniLM-L3,0.138384
5,MiniLM-L12,0.125265
8,Open-AI-Text-Embedding,0.093781
4,MiniLM-L6,0.089851


** both reviews are about two completely different things, word2vec and glove perform worse probably because there are still a lot of common words cin both reviews (after lemmatization).

#### Test 3: Negation

In [32]:
samples_df.iloc[[32, 887], :]['text']

32                                                                   This place was recommended by locals and now I know why. The burger was beyond great and made to order. The fries were just as good and they have draft local beer. Wow! I highly recommend this place.
887    this place is really not good. some of the meat tasted like it had no seasoning/gamey. rice is horrible and bean is gross. food came out slow and seemed like the man that is at the register hate being there and does not seem to be friendly. wont be coming back.
Name: text, dtype: object

In [33]:
results = []
review_example1_index = 32
review_example2_index = 887

for model_name, model_embedding in models.items():
    # select the two review vectors
    vec_a = model_embedding[review_example1_index].reshape(1, -1)
    vec_b = model_embedding[review_example2_index].reshape(1, -1)
    
    # compute cosine similarity
    sim = cosine_similarity(vec_a, vec_b)[0][0]
    
    results.append({"Model": model_name, "Cosine Similarity": sim})

similarity_df = pd.DataFrame(results).sort_values(by="Cosine Similarity", ascending=False)
similarity_df

Unnamed: 0,Model,Cosine Similarity
1,Word2Vec,0.99999
0,Glove300,0.803961
5,MiniLM-L12,0.573196
4,MiniLM-L6,0.536775
6,Paraphrase-MPNET,0.527897
3,HKU_INSTRUCTOR,0.494168
2,RoBERTa,0.452413
7,Paraphrase-MiniLM-L3,0.34795
8,Open-AI-Text-Embedding,0.296729


** Word2vec fails to capture negative sentiment; contextual emebeddings see these two reviews as far from each other whic I agree with.

#### Test 4: Different words, same expressed sentiment

In [34]:
samples_df.iloc[[44, 172], :]['text']

44     All the food was absolutely excellent! The menu is so unique and amazing we loved everything we got. Even just simple chicken was so crispy and juicy but the mahi was one of the best dishes I have had in a while. I would recommend this place and the mahi yo everyone!
172                                                                                                            Fantastic Service every item that was presented was tasty and delicious the stuffed mushrooms and tea sandwiches were phenomenal the ambiance is cozy yet elegant !
Name: text, dtype: object

In [35]:
results = []
review_example1_index = 44
review_example2_index = 172

for model_name, model_embedding in models.items():
    # select the two review vectors
    vec_a = model_embedding[review_example1_index].reshape(1, -1)
    vec_b = model_embedding[review_example2_index].reshape(1, -1)
    
    # compute cosine similarity
    sim = cosine_similarity(vec_a, vec_b)[0][0]
    
    results.append({"Model": model_name, "Cosine Similarity": sim})

similarity_df = pd.DataFrame(results).sort_values(by="Cosine Similarity", ascending=False)
similarity_df

Unnamed: 0,Model,Cosine Similarity
1,Word2Vec,0.999986
0,Glove300,0.78995
3,HKU_INSTRUCTOR,0.771921
2,RoBERTa,0.652833
6,Paraphrase-MPNET,0.604493
5,MiniLM-L12,0.581455
4,MiniLM-L6,0.560633
8,Open-AI-Text-Embedding,0.496293
7,Paraphrase-MiniLM-L3,0.474207
