In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

from nltk.sentiment import SentimentIntensityAnalyzer
import seaborn as sns
import matplotlib.pyplot as plt

# Statistical Tests:
- For our question, we can perform a hypothesis test between two groups - difference in means of the accuracies of NLTK on english and spanish text, and the difference in means of accuracies of BERT on the same. 
- One sample will be from directly passing reviews in Spanish and English into NLTK’s Sentiment Intensity Analyzer and taking the difference in accuracy, and our second sample will use the same method but instead using BERT’s pretrained model. To get a good estimate of the accuracy of both models, we will run replacement sampling of size 1000, with 15 different samples.
- We believe that BERT will perform better than NLTK because the BERT model will be trained and fine tuned to our testing data, whereas we will only use the given NLTK model.


In [22]:
nltk_en_values

[0.688,
 0.678,
 0.69,
 0.704,
 0.705,
 0.675,
 0.683,
 0.698,
 0.708,
 0.679,
 0.698,
 0.691,
 0.678,
 0.69,
 0.712]

In [23]:
nltk_es_values

[0.649,
 0.666,
 0.648,
 0.667,
 0.645,
 0.646,
 0.656,
 0.65,
 0.657,
 0.658,
 0.662,
 0.686,
 0.648,
 0.657,
 0.648]

## the accuracy results from bert

In [25]:
bert_en_values = [0.895,
 0.877,
 0.897,
 0.903,
 0.891,
 0.889,
 0.89,
 0.877,
 0.887,
 0.89,
 0.899,
 0.886,
 0.895,
 0.881,
 0.887]
bert_es_values = [0.8581,0.8652,0.853,0.8584,0.8635,0.8566,0.8627,0.8698,0.8519,0.84810,0.84911,0.84312,0.8513,0.86214,0.873]

In [26]:
import statistics

## EDA on the variance of each group

In [27]:
print(statistics.variance(nltk_en_values))
print(statistics.variance(nltk_es_values))
print(statistics.variance(bert_en_values))
print(statistics.variance(bert_es_values))

0.0001425999999999992
0.00011860000000000021
5.725714285714296e-05
7.086881238095256e-05


In [28]:
statistics.variance(bert_es_values) * 2

0.00014173762476190512

In [29]:
levene_statistic = stats.levene(nltk_en_values, nltk_es_values, bert_en_values, bert_es_values)
levene_statistic.pvalue

0.31540351933199584

In [30]:
levene_statistic.statistic

1.207684518369143

## The levene statistic shows that there isn't significant enough evidence to show that there are differences in the variance, since the p-value is greater than 0.05, we fail to reject the null hypothesis of the levene's test. this means that the groups have high homoscedasticity, which means the variances are similar

In [31]:
print('average accuracy of nltk on english : ' + str(np.mean(nltk_en_values)))
print('average accuracy of nltk on spanish : ' + str(np.mean(nltk_es_values)))
print('average accuracy of bert on english : ' + str(np.mean(bert_en_values)))
print('average accuracy of bert on spanish : ' + str(np.mean(bert_es_values)))

average accuracy of nltk on english : 0.6918000000000001
average accuracy of nltk on spanish : 0.6562
average accuracy of bert on english : 0.8896000000000001
average accuracy of bert on spanish : 0.8577313333333333


## Two-way ANOVA test with the independent variables being the model and the language the the dependent variable being the accuracy

In [32]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [33]:
# Example data: NLP model, Language, and Performance scores
data = {
    'NLP_model': ['NLTK'] * 30 + ['BERT'] * 30,
    'Language': ['English'] * 15 + ['Spanish'] * 15 + ['English'] * 15 + ['Spanish'] * 15,
    'Performance': nltk_en_values + nltk_es_values + bert_en_values + bert_es_values
}

# Convert data to pandas DataFrame
df = pd.DataFrame(data)
print(df)

# Perform two-way ANOVA
formula = 'Performance ~ C(NLP_model) + C(Language) + C(NLP_model):C(Language)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

   NLP_model Language  Performance
0       NLTK  English      0.68800
1       NLTK  English      0.67800
2       NLTK  English      0.69000
3       NLTK  English      0.70400
4       NLTK  English      0.70500
5       NLTK  English      0.67500
6       NLTK  English      0.68300
7       NLTK  English      0.69800
8       NLTK  English      0.70800
9       NLTK  English      0.67900
10      NLTK  English      0.69800
11      NLTK  English      0.69100
12      NLTK  English      0.67800
13      NLTK  English      0.69000
14      NLTK  English      0.71200
15      NLTK  Spanish      0.64900
16      NLTK  Spanish      0.66600
17      NLTK  Spanish      0.64800
18      NLTK  Spanish      0.66700
19      NLTK  Spanish      0.64500
20      NLTK  Spanish      0.64600
21      NLTK  Spanish      0.65600
22      NLTK  Spanish      0.65000
23      NLTK  Spanish      0.65700
24      NLTK  Spanish      0.65800
25      NLTK  Spanish      0.66200
26      NLTK  Spanish      0.68600
27      NLTK  Spanis

- #### The results of this ANOVA test indicate that there is significant evidence that both the NLP model and the language individually impact the accuracy levels to a significant extent, since both p-values were extremely low. 
- #### With the choice of model explaining about 61% of the variance in our accuracy outcomes, and choice of language explaining around 19%. However, the p-value of the joint model and language choice was relatively large, thusly not significant. 
- #### This can be interpreted as neither model performing better than the other on only one language.


## T-test between Spanish and English performance within each NLP model and between each NLP model within each language. 

In [35]:
def ttest(X_1, X_2):
    t_statistic, p_value = stats.ttest_rel(X_1, X_2)

    # Output results
    print("Paired t-test results:")
    print("T-statistic:", t_statistic)
    print("P-value:", p_value)

    # Interpret results
    alpha = 0.05
    if p_value < alpha:
        print("Reject null hypothesis: There is a significant difference in accuracies.")
    else:
        print("Fail to reject null hypothesis: There is no significant difference in accuracies.")

print(ttest(nltk_en_values, nltk_es_values))
print(ttest(bert_en_values, bert_es_values))
print(ttest(nltk_en_values, bert_en_values))
print(ttest(nltk_es_values, bert_es_values))

Paired t-test results:
T-statistic: 8.470247564897582
P-value: 7.00010101220569e-07
Reject null hypothesis: There is a significant difference in accuracies.
None
Paired t-test results:
T-statistic: 9.1067718153829
P-value: 2.9383496627066053e-07
Reject null hypothesis: There is a significant difference in accuracies.
None
Paired t-test results:
T-statistic: -57.41522397544252
P-value: 5.076521071604921e-18
Reject null hypothesis: There is a significant difference in accuracies.
None
Paired t-test results:
T-statistic: -47.19061047330478
P-value: 7.801600968013823e-17
Reject null hypothesis: There is a significant difference in accuracies.
None


- #### These results show that there is a significant difference between each pair of categories. 
- #### As the large positive T-value and small p-value in the first test indicates that NLTK performed significantly better on the English over Spanish text.
- #### The large positive T-value and small p-value in the second test indicates that BERT performed significantly better on the English over Spanish text.
- #### The large negative T-value and small p-value in the third test indicates that NLTK performed significantly worse than BERT on the English text.
- #### The large negative T-value and small p-value in the fourth test indicates that NLTK performed significantly worse than BERT on the Spanish text. 



## Conclusion:
- ### The trained BERT model significantly outperformed the NLTK model on both the English and Spanish texts, while both BERT and NLTK did statistically better on the English over Spanish.


## Future work:

- For future work, we can explore other languages aside from English and Spanish while testing the performance of both the BERT and NLTK model. If we wanted to continue analyzing the performance of the models in the context of languages, we could potentially do an analysis that includes more languages, even comparing those autotranslated versus manually done so.
- Similarly there is an opportunity to open up this exploration to include more NLP models. The Friedman hypothesis test would work well for this as it has less assumptions than ANOVA, however, in order to use it, we would then need the testing rows to be held constant for each model, for each sample.


___