# Testing Performance of Sentiment Analysers

#### Author: Felipe Valencia - Data Scientist at Dataplicada

This program is part of a broader effort to optimise sentiment analysis for customer experience applications. We aim to test and evaluate the accuracy and relevance of various sentiment analysis models—specifically VADER, TextBlob, a fine-tuned MultilingualBERT, and a fine-tuned DistilBERT model—by comparing their sentiment scores and performance when classifying customer reviews into a 5-star metric. In addition, we have designed two ensemble models to enhance the classification accuracy by combining the strengths of VADER and TextBlob.

The ensemble models employ rule-based heuristics derived from a detailed analysis of each model’s outputs on the dataset, allowing us to refine and improve sentiment scoring. By blending VADER and TextBlob’s predictions according to these heuristics, the ensemble aims to deliver more accurate, consistent results that capture the nuances of customer feedback.

In evaluating each model and the ensemble’s effectiveness, we’re not only focusing on accuracy but also accounting for practical considerations like processing speed, server storage, and resource usage (RAM and CPU). The outcome will inform our choice for the ideal model setup in a customer feedback tool, where businesses can seamlessly upload multiple comments and reviews, with sentiments classified into actionable insights.

In [120]:
import pandas as pd

# Load the three CSV files into DataFrames
tblob = pd.read_csv('output_datasets/output_TextBlob.csv')
vader = pd.read_csv('output_datasets/output_Vader.csv').drop(columns=['id', 'reviews.rating', 'reviews.text'])
dbert = pd.read_csv('output_datasets/output_DistilBert.csv').drop(columns=['id', 'reviews.rating', 'reviews.text'])
mbert = pd.read_csv('output_datasets/output_MultilingualBert.csv').drop(columns=['id', 'reviews.rating', 'reviews.text'])

# Concatenate the dataframes along the columns (axis=1)
merged_df = pd.concat([tblob, vader, dbert, mbert], axis=1)


In [121]:
merged_df = merged_df[['id', 'reviews.text', 'reviews.rating', 'textblob.sentiment', 'vader.sentiment', 'distilbert.sentiment', 'multilingual.sentiment']]

In [122]:
# Save the merged DataFrame to a new CSV file
merged_df.to_csv('output_datasets/output_merged.csv', index=False)

In [123]:
# For an unknown reason there was a NA row
merged_df = merged_df.dropna()

# Convert ratings from float to integer

merged_df['textblob.sentiment'] = merged_df['textblob.sentiment'].astype(int)
merged_df['vader.sentiment'] = merged_df['vader.sentiment'].astype(int)

In [124]:
merged_df

Unnamed: 0,id,reviews.text,reviews.rating,textblob.sentiment,vader.sentiment,distilbert.sentiment,multilingual.sentiment
0,AVwc252WIN2L1WUfpqLP,Our experience at Rancho Valencia was absolute...,5,4,5,5,5
1,AVwc252WIN2L1WUfpqLP,Amazing place. Everyone was extremely warm and...,5,4,5,5,5
2,AVwc252WIN2L1WUfpqLP,We booked a 3 night stay at Rancho Valencia to...,5,4,5,5,5
3,AVwdOclqIN2L1WUfti38,Currently in bed writing this for the past hr ...,2,3,3,2,1
4,AVwdOclqIN2L1WUfti38,I live in Md and the Aloft is my Home away fro...,5,4,5,5,5
...,...,...,...,...,...,...,...
9995,AVwd4TMv_7pvs4fz-Ers,It is hard for me to review an oceanfront hote...,3,3,5,5,4
9996,AVwdRp4DIN2L1WUfuGZZ,"I live close by, and needed to stay somewhere ...",4,4,5,5,5
9997,AVwd1TbkByjofQCxs6FH,Rolled in 11:30 laid out heads down woke up to...,4,4,5,5,5
9998,AVwdHbizIN2L1WUfsXto,Absolutely terrible..I was told I was being gi...,1,3,3,2,1


In [125]:
merged_df.describe()

Unnamed: 0,reviews.rating,textblob.sentiment,vader.sentiment,distilbert.sentiment,multilingual.sentiment
count,9999.0,9999.0,9999.0,9999.0,9999.0
mean,3.982498,3.711371,4.347435,4.146815,3.746775
std,1.175445,0.659233,1.189867,1.335966,1.283516
min,1.0,1.0,1.0,2.0,1.0
25%,3.0,3.0,4.0,2.0,3.0
50%,4.0,4.0,5.0,5.0,4.0
75%,5.0,4.0,5.0,5.0,5.0
max,5.0,5.0,5.0,5.0,5.0


In [126]:


# Calculate differences
merged_df['diff_textblob'] = merged_df['reviews.rating'] - merged_df['textblob.sentiment']
merged_df['diff_vader'] = merged_df['reviews.rating'] - merged_df['vader.sentiment']
merged_df['diff_distilbert'] = merged_df['reviews.rating'] - merged_df['distilbert.sentiment']
merged_df['diff_multilingual'] = merged_df['reviews.rating'] - merged_df['multilingual.sentiment']

merged_df.head()


Unnamed: 0,id,reviews.text,reviews.rating,textblob.sentiment,vader.sentiment,distilbert.sentiment,multilingual.sentiment,diff_textblob,diff_vader,diff_distilbert,diff_multilingual
0,AVwc252WIN2L1WUfpqLP,Our experience at Rancho Valencia was absolute...,5,4,5,5,5,1,0,0,0
1,AVwc252WIN2L1WUfpqLP,Amazing place. Everyone was extremely warm and...,5,4,5,5,5,1,0,0,0
2,AVwc252WIN2L1WUfpqLP,We booked a 3 night stay at Rancho Valencia to...,5,4,5,5,5,1,0,0,0
3,AVwdOclqIN2L1WUfti38,Currently in bed writing this for the past hr ...,2,3,3,2,1,-1,-1,0,1
4,AVwdOclqIN2L1WUfti38,I live in Md and the Aloft is my Home away fro...,5,4,5,5,5,1,0,0,0


In [127]:
import pandas as pd
from scipy.stats import shapiro, ttest_rel

# Assuming 'merged_df' is your DataFrame
# Perform normality tests on the difference columns
stat_textblob, p_textblob = shapiro(merged_df['diff_textblob'])
stat_vader, p_vader = shapiro(merged_df['diff_vader'])
stat_distilbert, p_distilbert = shapiro(merged_df['diff_distilbert'])
stat_multilingualbert, p_multilingualbert = shapiro(merged_df['diff_multilingual'])

# Print normality test results
print("Shapiro-Wilk Test for Normality:")
print(f"TextBlob: Statistics={stat_textblob:.3f}, p-value={p_textblob:.3f}")
print(f"VADER: Statistics={stat_vader:.3f}, p-value={p_vader:.3f}")
print(f"DistilBERT: Statistics={stat_distilbert:.3f}, p-value={p_distilbert:.3f}")
print(f"MultilingualBERT: Statistics={stat_multilingualbert:.3f}, p-value={p_multilingualbert:.3f}")

# Interpret the normality test results
alpha = 0.05
for model, p in zip(['TextBlob', 'VADER', 'DistilBERT', 'MultilingualBERT'], [p_textblob, p_vader, p_distilbert, p_multilingualbert]):
    if p > alpha:
        print(f"{model} differences are normally distributed (fail to reject H0)")
    else:
        print(f"{model} differences are not normally distributed (reject H0)")

# Perform paired t-tests
t_stat_textblob, p_textblob_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
t_stat_vader, p_vader_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['vader.sentiment'])
t_stat_distilbert, p_distilbert_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
t_stat_multilingualbert, p_multilingualbert_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])

# Print t-test results
print("\nPaired T-Test Results:")
print(f"TextBlob: t-statistic={t_stat_textblob:.3f}, p-value={p_textblob_ttest:.3f}")
print(f"VADER: t-statistic={t_stat_vader:.3f}, p-value={p_vader_ttest:.3f}")
print(f"DistilBERT: t-statistic={t_stat_distilbert:.3f}, p-value={p_distilbert_ttest:.3f}")
print(f"MultilingualBERT: t-statistic={t_stat_multilingualbert:.3f}, p-value={p_multilingualbert_ttest:.3f}")



Shapiro-Wilk Test for Normality:
TextBlob: Statistics=0.880, p-value=0.000
VADER: Statistics=0.885, p-value=0.000
DistilBERT: Statistics=0.878, p-value=0.000
MultilingualBERT: Statistics=0.860, p-value=0.000
TextBlob differences are not normally distributed (reject H0)
VADER differences are not normally distributed (reject H0)
DistilBERT differences are not normally distributed (reject H0)
MultilingualBERT differences are not normally distributed (reject H0)

Paired T-Test Results:
TextBlob: t-statistic=26.688, p-value=0.000
VADER: t-statistic=-33.230, p-value=0.000
DistilBERT: t-statistic=-15.247, p-value=0.000
MultilingualBERT: t-statistic=24.964, p-value=0.000


  stat_textblob, p_textblob = shapiro(merged_df['diff_textblob'])
  stat_vader, p_vader = shapiro(merged_df['diff_vader'])
  stat_distilbert, p_distilbert = shapiro(merged_df['diff_distilbert'])
  stat_multilingualbert, p_multilingualbert = shapiro(merged_df['diff_multilingual'])


In [128]:
import pandas as pd
from scipy.stats import wilcoxon

# Perform Wilcoxon signed-rank tests
wilcoxon_textblob = wilcoxon(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
wilcoxon_vader = wilcoxon(merged_df['reviews.rating'], merged_df['vader.sentiment'])
wilcoxon_distilbert = wilcoxon(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
wilcoxon_multilingualbert = wilcoxon(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])

# Print Wilcoxon test results
print("Wilcoxon Signed-Rank Test Results:")
print(f"TextBlob: Statistic={wilcoxon_textblob.statistic:.3f}, p-value={wilcoxon_textblob.pvalue:.3f}")
print(f"VADER: Statistic={wilcoxon_vader.statistic:.3f}, p-value={wilcoxon_vader.pvalue:.3f}")
print(f"DistilBERT: Statistic={wilcoxon_distilbert.statistic:.3f}, p-value={wilcoxon_distilbert.pvalue:.3f}")
print(f"MultilingualBERT: Statistic={wilcoxon_multilingualbert.statistic:.3f}, p-value={wilcoxon_multilingualbert.pvalue:.3f}")


Wilcoxon Signed-Rank Test Results:
TextBlob: Statistic=7549700.000, p-value=0.000
VADER: Statistic=3435417.000, p-value=0.000
DistilBERT: Statistic=5400486.000, p-value=0.000
MultilingualBERT: Statistic=3270391.500, p-value=0.000


In [129]:
import numpy as np

# Define a function to calculate accuracy
def calculate_accuracy(actual, predicted):
    # Consider a prediction correct if it's equal to the actual rating
    correct_predictions = np.sum(actual == predicted)
    accuracy = correct_predictions / len(actual)  # Calculate accuracy
    return accuracy

# Calculate accuracy for each model
accuracy_textblob = calculate_accuracy(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
accuracy_vader = calculate_accuracy(merged_df['reviews.rating'], merged_df['vader.sentiment'])
accuracy_distilbert = calculate_accuracy(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
accuracy_multilingualbert = calculate_accuracy(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])

# Print the accuracy results
print("Accuracy Results:")
print(f"TextBlob Accuracy: {accuracy_textblob:.4f} ({accuracy_textblob * 100:.2f}%)")
print(f"VADER Accuracy: {accuracy_vader:.4f} ({accuracy_vader * 100:.2f}%)")
print(f"DistilBERT Accuracy: {accuracy_distilbert:.4f} ({accuracy_distilbert * 100:.2f}%)")
print(f"MultilingualBERT Accuracy: {accuracy_multilingualbert:.4f} ({accuracy_multilingualbert * 100:.2f}%)")


Accuracy Results:
TextBlob Accuracy: 0.3265 (32.65%)
VADER Accuracy: 0.4817 (48.17%)
DistilBERT Accuracy: 0.4726 (47.26%)
MultilingualBERT Accuracy: 0.5371 (53.71%)


In [130]:
# Define a function to calculate proximity
def calculate_proximity(actual, predicted):
    # Calculate the absolute differences between actual and predicted ratings
    differences = np.abs(actual - predicted)
    average_proximity = np.mean(differences)  # Average of absolute differences
    return average_proximity

# Calculate proximity for each model
proximity_textblob = calculate_proximity(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
proximity_vader = calculate_proximity(merged_df['reviews.rating'], merged_df['vader.sentiment'])
proximity_distilbert = calculate_proximity(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
proximity_multilingualbert = calculate_proximity(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])

# Print the proximity results
print("Average Proximity Results:")
print(f"TextBlob Average Proximity: {proximity_textblob:.4f}")
print(f"VADER Average Proximity: {proximity_vader:.4f}")
print(f"DistilBERT Average Proximity: {proximity_distilbert:.4f}")
print(f"MultilingualBERT Average Proximity: {proximity_multilingualbert:.4f}")


Average Proximity Results:
TextBlob Average Proximity: 0.8096
VADER Average Proximity: 0.7452
DistilBERT Average Proximity: 0.7206
MultilingualBERT Average Proximity: 0.5992


In [146]:
import numpy as np
import time

def ensemble_predictions(row):
    vader_pred = row['vader.sentiment']
    textblob_pred = row['textblob.sentiment']
    
    if vader_pred == textblob_pred:
        return vader_pred  # Rule 1: Same predictions
    else:
        # Rule 2: Different predictions
        if abs(vader_pred - textblob_pred) == 1:  # Adjacent values
            return vader_pred  # Prioritise VADER
        else:
            # Calculate the average
            avg = (vader_pred + textblob_pred) / 2
            
            # Check if average is an integer
            if avg.is_integer():
                return int(avg)  # Return average as integer
            else:
                # Round appropriately based on the position of VADER
                if vader_pred > avg:
                    return int(np.ceil(avg))  # Round up if VADER is higher
                else:
                    return int(np.floor(avg))  # Round down if VADER is lower

# Process the ensemble
print("Starting the ensemble heuristics...")
start_time = time.time()

# Apply the ensemble function to create a new column in the DataFrame
merged_df['ensemble_prediction'] = merged_df.apply(ensemble_predictions, axis=1)

print(f"Processing completed in {(time.time() - start_time) / 60:.2f} minutes")

# Display the updated DataFrame with ensemble predictions
merged_df[['reviews.rating', 'vader.sentiment', 'textblob.sentiment', 'ensemble_prediction']]


Starting the ensemble heuristics...
Processing completed in 0.00 minutes


Unnamed: 0,reviews.rating,vader.sentiment,textblob.sentiment,ensemble_prediction
0,5,5,4,5
1,5,5,4,5
2,5,5,4,5
3,2,3,3,3
4,5,5,4,5
...,...,...,...,...
9995,3,5,3,4
9996,4,5,4,5
9997,4,5,4,5
9998,1,3,3,3


In [147]:
import numpy as np
import time

def ensemble_predictions2(row):
    vader_pred = row['vader.sentiment']
    textblob_pred = row['textblob.sentiment']
    
    if vader_pred == textblob_pred:
        return vader_pred  # Rule 1: Same predictions
    else:
        # Rule 2: Different predictions
        if abs(vader_pred - textblob_pred) == 1:  # Adjacent values
            return vader_pred  # Prioritise VADER
        else:
            # Calculate the average
            avg = (vader_pred + textblob_pred) / 2
            
            # Check if average is an integer
            if avg.is_integer():
                return int(avg)  # Return average as integer
            else:
                return int(np.floor(avg))  # Always round down if not an integer

# Process the ensemble
print("Starting the ensemble heuristics...")
start_time = time.time()

# Apply the ensemble function to create a new column in the DataFrame
merged_df['ensemble_prediction2'] = merged_df.apply(ensemble_predictions2, axis=1)

print(f"Processing completed in {(time.time() - start_time) / 60:.2f} minutes")

# Display the updated DataFrame with ensemble predictions
merged_df[['reviews.rating', 'vader.sentiment', 'textblob.sentiment', 'ensemble_prediction', 'ensemble_prediction2']]


Starting the ensemble heuristics...
Processing completed in 0.00 minutes


Unnamed: 0,reviews.rating,vader.sentiment,textblob.sentiment,ensemble_prediction,ensemble_prediction2
0,5,5,4,5,5
1,5,5,4,5,5
2,5,5,4,5,5
3,2,3,3,3,3
4,5,5,4,5,5
...,...,...,...,...,...
9995,3,5,3,4,4
9996,4,5,4,5,5
9997,4,5,4,5,5
9998,1,3,3,3,3


In [136]:
# Calculate difference for the ensemble of Vader and Textblob
merged_df['diff_ensemble'] = merged_df['reviews.rating'] - merged_df['ensemble_prediction']

# Calculate difference for the ensemble2 of Vader and Textblob
merged_df['diff_ensemble2'] = merged_df['reviews.rating'] - merged_df['ensemble_prediction2']

In [152]:
# Display the updated DataFrame with ensemble predictions
merged_df[['reviews.rating', 'vader.sentiment', 'textblob.sentiment','distilbert.sentiment' ,'multilingual.sentiment' , 'ensemble_prediction', 'ensemble_prediction2']].describe()

Unnamed: 0,reviews.rating,vader.sentiment,textblob.sentiment,distilbert.sentiment,multilingual.sentiment,ensemble_prediction,ensemble_prediction2
count,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0
mean,3.982498,4.347435,3.711371,4.146815,3.746775,4.288829,4.288529
std,1.175445,1.189867,0.659233,1.335966,1.283516,1.054234,1.054648
min,1.0,1.0,1.0,2.0,1.0,1.0,1.0
25%,3.0,4.0,3.0,2.0,3.0,4.0,4.0
50%,4.0,5.0,4.0,5.0,4.0,5.0,5.0
75%,5.0,5.0,4.0,5.0,5.0,5.0,5.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [140]:
import pandas as pd
from scipy.stats import shapiro, ttest_rel

# Assuming 'merged_df' is your DataFrame
# Perform normality tests on the difference columns
stat_textblob, p_textblob = shapiro(merged_df['diff_textblob'])
stat_vader, p_vader = shapiro(merged_df['diff_vader'])
stat_distilbert, p_distilbert = shapiro(merged_df['diff_distilbert'])
# With emsemble
stat_ensemble, p_ensemble = shapiro(merged_df['diff_ensemble'])
stat_ensemble2, p_ensemble2 = shapiro(merged_df['diff_ensemble2'])

# Print normality test results
print("Shapiro-Wilk Test for Normality:")
print(f"TextBlob: Statistics={stat_textblob:.3f}, p-value={p_textblob:.3f}")
print(f"VADER: Statistics={stat_vader:.3f}, p-value={p_vader:.3f}")
print(f"DistilBERT: Statistics={stat_distilbert:.3f}, p-value={p_distilbert:.3f}")
print(f"MultilingualBERT: Statistics={stat_multilingualbert:.3f}, p-value={p_multilingualbert:.3f}")
print(f"Ensemble: Statistics={stat_ensemble:.3f}, p-value={p_ensemble:.3f}")
print(f"Ensemble2: Statistics={stat_ensemble2:.3f}, p-value={p_ensemble2:.3f}")
print()

# Interpret the normality test results
alpha = 0.05
for model, p in zip(['TextBlob', 'VADER', 'DistilBERT', 'MultilingualBERT', 'Ensemble', 'Ensemble2'], [p_textblob, p_vader, p_distilbert, p_multilingualbert, p_ensemble, p_ensemble2]):
    if p > alpha:
        print(f"{model} differences are normally distributed (fail to reject H0)")
    else:
        print(f"{model} differences are not normally distributed (reject H0)")

# Perform paired t-tests
t_stat_textblob, p_textblob_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
t_stat_vader, p_vader_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['vader.sentiment'])
t_stat_distilbert, p_distilbert_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
t_stat_multilingualbert, p_multilingualbert_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])
t_stat_ensemble, p_ensemble_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['ensemble_prediction'])
t_stat_ensemble2, p_ensemble2_ttest = ttest_rel(merged_df['reviews.rating'], merged_df['ensemble_prediction2'])

# Print t-test results
print("\nPaired T-Test Results:")
print(f"TextBlob: t-statistic={t_stat_textblob:.3f}, p-value={p_textblob_ttest:.3f}")
print(f"VADER: t-statistic={t_stat_vader:.3f}, p-value={p_vader_ttest:.3f}")
print(f"DistilBERT: t-statistic={t_stat_distilbert:.3f}, p-value={p_distilbert_ttest:.3f}")
print(f"MultilingualBERT: t-statistic={t_stat_multilingualbert:.3f}, p-value={p_multilingualbert_ttest:.3f}")
print(f"Ensemble: t-statistic={t_stat_ensemble:.3f}, p-value={p_ensemble_ttest:.3f}")
print(f"Ensemble2: t-statistic={t_stat_ensemble2:.3f}, p-value={p_ensemble2_ttest:.3f}")

Shapiro-Wilk Test for Normality:
TextBlob: Statistics=0.880, p-value=0.000
VADER: Statistics=0.885, p-value=0.000
DistilBERT: Statistics=0.878, p-value=0.000
MultilingualBERT: Statistics=0.860, p-value=0.000
Ensemble: Statistics=0.887, p-value=0.000
Ensemble2: Statistics=0.887, p-value=0.000

TextBlob differences are not normally distributed (reject H0)
VADER differences are not normally distributed (reject H0)
DistilBERT differences are not normally distributed (reject H0)
MultilingualBERT differences are not normally distributed (reject H0)
Ensemble differences are not normally distributed (reject H0)
Ensemble2 differences are not normally distributed (reject H0)

Paired T-Test Results:
TextBlob: t-statistic=26.688, p-value=0.000
VADER: t-statistic=-33.230, p-value=0.000
DistilBERT: t-statistic=-15.247, p-value=0.000
MultilingualBERT: t-statistic=24.964, p-value=0.000
Ensemble: t-statistic=-30.748, p-value=0.000
Ensemble2: t-statistic=-30.725, p-value=0.000


  stat_textblob, p_textblob = shapiro(merged_df['diff_textblob'])
  stat_vader, p_vader = shapiro(merged_df['diff_vader'])
  stat_distilbert, p_distilbert = shapiro(merged_df['diff_distilbert'])
  stat_ensemble, p_ensemble = shapiro(merged_df['diff_ensemble'])
  stat_ensemble2, p_ensemble2 = shapiro(merged_df['diff_ensemble2'])


In [141]:
import pandas as pd
from scipy.stats import wilcoxon

# Perform Wilcoxon signed-rank tests
wilcoxon_textblob = wilcoxon(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
wilcoxon_vader = wilcoxon(merged_df['reviews.rating'], merged_df['vader.sentiment'])
wilcoxon_distilbert = wilcoxon(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
wilcoxon_multilingualbert = wilcoxon(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])
wilcoxon_ensemble = wilcoxon(merged_df['reviews.rating'], merged_df['ensemble_prediction'])
wilcoxon_ensemble2 = wilcoxon(merged_df['reviews.rating'], merged_df['ensemble_prediction2'])

# Print Wilcoxon test results
print("Wilcoxon Signed-Rank Test Results:")
print(f"TextBlob: Statistic={wilcoxon_textblob.statistic:.3f}, p-value={wilcoxon_textblob.pvalue:.3f}")
print(f"VADER: Statistic={wilcoxon_vader.statistic:.3f}, p-value={wilcoxon_vader.pvalue:.3f}")
print(f"DistilBERT: Statistic={wilcoxon_distilbert.statistic:.3f}, p-value={wilcoxon_distilbert.pvalue:.3f}")
print(f"MultilingualBERT: Statistic={wilcoxon_multilingualbert.statistic:.3f}, p-value={wilcoxon_multilingualbert.pvalue:.3f}")
print(f"Ensemble: Statistic={wilcoxon_ensemble.statistic:.3f}, p-value={wilcoxon_ensemble.pvalue:.3f}")
print(f"Ensemble2: Statistic={wilcoxon_ensemble2.statistic:.3f}, p-value={wilcoxon_ensemble2.pvalue:.3f}")

Wilcoxon Signed-Rank Test Results:
TextBlob: Statistic=7549700.000, p-value=0.000
VADER: Statistic=3435417.000, p-value=0.000
DistilBERT: Statistic=5400486.000, p-value=0.000
MultilingualBERT: Statistic=3270391.500, p-value=0.000
Ensemble: Statistic=3613358.500, p-value=0.000
Ensemble2: Statistic=3613823.500, p-value=0.000


In [148]:
import numpy as np

# Define a function to calculate accuracy
def calculate_accuracy(actual, predicted):
    # Consider a prediction correct if it's equal to the actual rating
    correct_predictions = np.sum(actual == predicted)
    accuracy = correct_predictions / len(actual)  # Calculate accuracy
    return accuracy

# Calculate accuracy for each model
accuracy_textblob = calculate_accuracy(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
accuracy_vader = calculate_accuracy(merged_df['reviews.rating'], merged_df['vader.sentiment'])
accuracy_distilbert = calculate_accuracy(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
accuracy_multilingualbert = calculate_accuracy(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])
accuracy_ensemble = calculate_accuracy(merged_df['reviews.rating'], merged_df['ensemble_prediction'])
accuracy_ensemble2 = calculate_accuracy(merged_df['reviews.rating'], merged_df['ensemble_prediction2'])

# Print the accuracy results
print("Accuracy Results:")
print(f"TextBlob Accuracy: {accuracy_textblob:.4f} ({accuracy_textblob * 100:.2f}%)")
print(f"VADER Accuracy: {accuracy_vader:.4f} ({accuracy_vader * 100:.2f}%)")
print(f"DistilBERT Accuracy: {accuracy_distilbert:.4f} ({accuracy_distilbert * 100:.2f}%)")
print(f"MultilingualBERT Accuracy: {accuracy_multilingualbert:.4f} ({accuracy_multilingualbert * 100:.2f}%)")
print(f"Ensemble Accuracy: {accuracy_ensemble:.4f} ({accuracy_ensemble * 100:.2f}%)")
print(f"Ensemble2 Accuracy: {accuracy_ensemble2:.4f} ({accuracy_ensemble2 * 100:.2f}%)")


Accuracy Results:
TextBlob Accuracy: 0.3265 (32.65%)
VADER Accuracy: 0.4817 (48.17%)
DistilBERT Accuracy: 0.4726 (47.26%)
MultilingualBERT Accuracy: 0.5371 (53.71%)
Ensemble Accuracy: 0.4879 (48.79%)
Ensemble2 Accuracy: 0.4880 (48.80%)


In [143]:
# Define a function to calculate proximity
def calculate_proximity(actual, predicted):
    # Calculate the absolute differences between actual and predicted ratings
    differences = np.abs(actual - predicted)
    average_proximity = np.mean(differences)  # Average of absolute differences
    return average_proximity

# Calculate proximity for each model
proximity_textblob = calculate_proximity(merged_df['reviews.rating'], merged_df['textblob.sentiment'])
proximity_vader = calculate_proximity(merged_df['reviews.rating'], merged_df['vader.sentiment'])
proximity_distilbert = calculate_proximity(merged_df['reviews.rating'], merged_df['distilbert.sentiment'])
proximity_multilingualbert = calculate_proximity(merged_df['reviews.rating'], merged_df['multilingual.sentiment'])
proximity_ensemble = calculate_proximity(merged_df['reviews.rating'], merged_df['ensemble_prediction'])
proximity_ensemble2 = calculate_proximity(merged_df['reviews.rating'], merged_df['ensemble_prediction2'])

# Print the proximity results
print("Average Proximity Results:")
print(f"TextBlob Average Proximity: {proximity_textblob:.4f}")
print(f"VADER Average Proximity: {proximity_vader:.4f}")
print(f"DistilBERT Average Proximity: {proximity_distilbert:.4f}")
print(f"MultilingualBERT Average Proximity: {proximity_multilingualbert:.4f}")
print(f"Ensemble Average Proximity: {proximity_ensemble:.4f}")
print(f"Ensemble2 Average Proximity: {proximity_ensemble2:.4f}")

Average Proximity Results:
TextBlob Average Proximity: 0.8096
VADER Average Proximity: 0.7452
DistilBERT Average Proximity: 0.7206
MultilingualBERT Average Proximity: 0.5992
Ensemble Average Proximity: 0.6782
Ensemble2 Average Proximity: 0.6779
