# Methodology of Scoring Semantic Similarity

* In this section, we outline the methodology for scoring reviews based on their semantic similarity to our predefined list of emotions. As a preliminary trial, we have selected 27 reviews from two different beers featured on a single website.

In [50]:
import pandas as pd

file_path = 'sample.xls'
df = pd.read_excel(file_path)
df.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating,text
0,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1500976800,mR_fr0g,108099,4,7,4,7,15,3.7,Bottle shared at Chris & Ruthâs pre GBBF Shi...
1,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1499076000,anstei,288109,2,4,2,4,4,1.6,Bottle at MC Zurich. Pours black with a medium...
2,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1485255600,Beersiveknown,128086,4,8,4,8,15,3.9,Bottle shared with chriso et all and also a bo...
3,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1473847200,The_Osprey,249130,4,8,4,7,16,3.9,August 2016 - Bottle share at Chrisoâs Sunda...
4,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1471773600,rlgk,7982,4,7,4,7,14,3.6,"Chris and Ruthâs Shindig 2016, Sunday. Black..."


In [None]:
!pip install sentence-transformers

In [52]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

reviews = df['text'].tolist()

joy = ['joy']
sadness = ['sadness']
anger = ['anger']
fear = ['fear']
love = ['love']
surprise = ['surprise']

In [None]:
#Computing embedding for both lists
embeddings1 = model.encode(reviews, convert_to_tensor=True)
embeddings2 = model.encode(joy, convert_to_tensor=True)
embeddings3 = model.encode(sadness, convert_to_tensor=True)
embeddings4 = model.encode(anger, convert_to_tensor=True)
embeddings5 = model.encode(fear, convert_to_tensor=True)
embeddings6 = model.encode(love, convert_to_tensor=True)
embeddings7 = model.encode(surprise, convert_to_tensor=True)

#Computing cosine-similarities
joy_scores = util.cos_sim(embeddings1, embeddings2)
sadness_scores = util.cos_sim(embeddings1, embeddings3)
anger_scores = util.cos_sim(embeddings1, embeddings4)
fear_scores = util.cos_sim(embeddings1, embeddings5)
love_scores = util.cos_sim(embeddings1, embeddings6)
surprise_scores = util.cos_sim(embeddings1, embeddings7)

In [54]:
# Adding the similarity scores as new columns to the DataFrame
df['joy_score'] = joy_scores.tolist()
df['sadness_score'] = sadness_scores.tolist()
df['anger_score'] = anger_scores.tolist()
df['fear_score'] = fear_scores.tolist()
df['love_score'] = love_scores.tolist()
df['surprise_score'] = surprise_scores.tolist()

# Saving the updated DataFrame to a new CSV file
df.to_csv('updated_data.csv', index=False)

In [58]:
df.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,...,taste,overall,rating,text,joy_score,sadness_score,anger_score,fear_score,love_score,surprise_score
0,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1500976800,mR_fr0g,108099,4,...,7,15,3.7,Bottle shared at Chris & Ruthâs pre GBBF Shi...,[0.13793334364891052],[0.09664055705070496],[0.1604452133178711],[0.07894943654537201],[0.21690380573272705],[0.11983683705329895]
1,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1499076000,anstei,288109,2,...,4,4,1.6,Bottle at MC Zurich. Pours black with a medium...,[0.07160473614931107],[0.055134691298007965],[0.05836186930537224],[0.06901727616786957],[0.08915862441062927],[0.11216261237859726]
2,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1485255600,Beersiveknown,128086,4,...,8,15,3.9,Bottle shared with chriso et all and also a bo...,[0.04942207410931587],[0.006872272118926048],[0.03351932018995285],[-0.0033654700964689255],[0.1062934622168541],[0.07568646222352982]
3,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1473847200,The_Osprey,249130,4,...,7,16,3.9,August 2016 - Bottle share at Chrisoâs Sunda...,[0.1778445839881897],[0.06544509530067444],[0.14239011704921722],[0.10937458276748657],[0.271146297454834],[0.13629484176635742]
4,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1471773600,rlgk,7982,4,...,7,14,3.6,"Chris and Ruthâs Shindig 2016, Sunday. Black...",[0.16992557048797607],[0.09210918843746185],[0.14046210050582886],[0.06581304967403412],[0.20374217629432678],[0.1373400241136551]


## **Preliminary Analysis for Model Selection**

> In this part, our goal is to select the most effective multilingual model for our analysis, ensuring it can accurately differentiate between opposite emotions selected as joy and sadness.

* We are trying to decide between 4 multilingual models listed below not to miss the reviews in different languages that exist in our dataset.

* We will compare four different multilingual models to determine which one best represents the emotional spectrum by evaluating the score differences between emotions 'joy' and 'sadness'.



**Multi-Lingual Models**

> The models of sentence-transformers trained for many languages are listed below:

**distiluse-base-multilingual-cased-v1**: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.

**distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.

**paraphrase-multilingual-MiniLM-L12-v2** - Multilingual version of paraphrase-MiniLM-L12-v2, trained on parallel data for 50+ languages.

**paraphrase-multilingual-mpnet-base-v2** - Multilingual version of paraphrase-mpnet-base-v2, trained on parallel data for 50+ languages.

Reference: https://www.sbert.net/docs/pretrained_models.html

In [60]:
file_path = 'sample.xls'
df2 = pd.read_excel(file_path)
df2.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating,text
0,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1500976800,mR_fr0g,108099,4,7,4,7,15,3.7,Bottle shared at Chris & Ruthâs pre GBBF Shi...
1,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1499076000,anstei,288109,2,4,2,4,4,1.6,Bottle at MC Zurich. Pours black with a medium...
2,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1485255600,Beersiveknown,128086,4,8,4,8,15,3.9,Bottle shared with chriso et all and also a bo...
3,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1473847200,The_Osprey,249130,4,8,4,7,16,3.9,August 2016 - Bottle share at Chrisoâs Sunda...
4,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1471773600,rlgk,7982,4,7,4,7,14,3.6,"Chris and Ruthâs Shindig 2016, Sunday. Black..."


In [None]:
from sentence_transformers import SentenceTransformer, util
model1 = SentenceTransformer('distiluse-base-multilingual-cased-v1')
model2 = SentenceTransformer('distiluse-base-multilingual-cased-v2')
model3 = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model4 = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

reviews = df2['text'].tolist()

joy = ['joy']
sadness = ['sadness']

In [57]:
#Computing embedding for both lists for all the models
embeddings_1_a = model1.encode(reviews, convert_to_tensor=True)
embeddings_1_b = model1.encode(joy, convert_to_tensor=True)
embeddings_1_c = model1.encode(sadness, convert_to_tensor=True)

embeddings_2_a = model2.encode(reviews, convert_to_tensor=True)
embeddings_2_b = model2.encode(joy, convert_to_tensor=True)
embeddings_2_c = model2.encode(sadness, convert_to_tensor=True)

embeddings_3_a = model3.encode(reviews, convert_to_tensor=True)
embeddings_3_b = model3.encode(joy, convert_to_tensor=True)
embeddings_3_c = model3.encode(sadness, convert_to_tensor=True)

embeddings_4_a = model4.encode(reviews, convert_to_tensor=True)
embeddings_4_b = model4.encode(joy, convert_to_tensor=True)
embeddings_4_c = model4.encode(sadness, convert_to_tensor=True)

In [61]:
#Computing cosine-similarities for all the models
joy_scores1 = util.cos_sim(embeddings_1_a, embeddings_1_b)
sadness_scores1 = util.cos_sim(embeddings_1_a, embeddings_1_c)

joy_scores2 = util.cos_sim(embeddings_2_a, embeddings_2_b)
sadness_scores2 = util.cos_sim(embeddings_2_a, embeddings_2_c)

joy_scores3 = util.cos_sim(embeddings_3_a, embeddings_3_b)
sadness_scores3 = util.cos_sim(embeddings_3_a, embeddings_3_c)

joy_scores4 = util.cos_sim(embeddings_4_a, embeddings_4_b)
sadness_scores4 = util.cos_sim(embeddings_4_a, embeddings_4_c)

In [62]:
# Calculate the difference between joy and sadness scores for all the models
df2['score_diff_1'] = joy_scores1 - sadness_scores1
df2['score_diff_2'] = joy_scores2 - sadness_scores2
df2['score_diff_3'] = joy_scores3 - sadness_scores3
df2['score_diff_4'] = joy_scores4 - sadness_scores4

# Saving the updated DataFrame to a new CSV file
df2.to_csv('updated_data2.csv', index=False)


In [64]:
# Calculating the absolute differences between joy and sadness scores for all the models
df2['joy_sadness_score_abs_diff_1'] = abs(joy_scores1 - sadness_scores1)
df2['joy_sadness_score_abs_diff_2'] = abs(joy_scores2 - sadness_scores2)
df2['joy_sadness_score_abs_diff_3'] = abs(joy_scores3 - sadness_scores3)
df2['joy_sadness_score_abs_diff_4'] = abs(joy_scores4 - sadness_scores4)

# Calculating the average of the absolute differences for each model
avg_abs_diff_1 = df2['joy_sadness_score_abs_diff_1'].mean()
avg_abs_diff_2 = df2['joy_sadness_score_abs_diff_2'].mean()
avg_abs_diff_3 = df2['joy_sadness_score_abs_diff_3'].mean()
avg_abs_diff_4 = df2['joy_sadness_score_abs_diff_4'].mean()

# Print the average absolute differences for each model
print("Average Absolute Difference Model 1:", avg_abs_diff_1)
print("Average Absolute Difference Model 2:", avg_abs_diff_2)
print("Average Absolute Difference Model 3:", avg_abs_diff_3)
print("Average Absolute Difference Model 4:", avg_abs_diff_4)

Average Absolute Difference Model 1: 0.034360938
Average Absolute Difference Model 2: 0.046473566
Average Absolute Difference Model 3: 0.12992051
Average Absolute Difference Model 4: 0.0681108


> In this case, the model that best shows the difference between joy and sadness is model 3 which is *paraphrase-multilingual-MiniLM-L12-v2*

In [67]:
df2.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,...,rating,text,score_diff_1,score_diff_2,score_diff_3,score_diff_4,joy_sadness_score_abs_diff_1,joy_sadness_score_abs_diff_2,joy_sadness_score_abs_diff_3,joy_sadness_score_abs_diff_4
0,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1500976800,mR_fr0g,108099,4,...,3.7,Bottle shared at Chris & Ruthâs pre GBBF Shi...,0.003929,0.047032,0.159737,0.071288,0.003929,0.047032,0.159737,0.071288
1,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1499076000,anstei,288109,2,...,1.6,Bottle at MC Zurich. Pours black with a medium...,-0.053292,-0.030284,0.030511,-0.04963,0.053292,0.030284,0.030511,0.04963
2,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1485255600,Beersiveknown,128086,4,...,3.9,Bottle shared with chriso et all and also a bo...,0.035151,0.05874,0.096382,0.070102,0.035151,0.05874,0.096382,0.070102
3,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1473847200,The_Osprey,249130,4,...,3.9,August 2016 - Bottle share at Chrisoâs Sunda...,0.033852,0.045596,0.194106,0.085266,0.033852,0.045596,0.194106,0.085266
4,Bullhouse El Capitan,441604,Bullhouse Brewing Company,27006,Imperial Stout,8.0,1471773600,rlgk,7982,4,...,3.6,"Chris and Ruthâs Shindig 2016, Sunday. Black...",0.035561,0.086692,0.169923,0.137073,0.035561,0.086692,0.169923,0.137073
