# Dictionary Validation
Not all dictionaries are made the same, which is why we are validating them pairwise for each source. Hard to say why analysis yields different results based on the dictionary. Maybe it has to do with the language that each individual magazine uses. For instance, the conservative Deutschland Kurier uses sarcasm a lot. It may wax positive about an event but does so to make fun of it and expose the delusion of Germans and German authorities. In this case, the positive lexicon may mislead the algorithm and make it inclined to mark a text as "positive." Therefore, it may be a better idea to use a dictionary with a greater emphasis on negative vocabulary when processing such corpora.

In [5]:
import pandas as pd
import os
from pandas import read_csv
import numpy as np

In [6]:
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
import scipy.stats as stats
from scipy.stats import shapiro
from sklearn import metrics

The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things. Normalization usually means to scale a variable to have a values between 0 and 1, while standardization transforms data to have a mean of zero and a standard deviation of 1. For the purposes of this research, we needed to be able to compare the means of sentiment scores. Standardization would not work as it sets a mean to zero. Besides, a standard deviation can tell us a lot about the diversity of scores. So making it one again would deprove us of some very useful information.

In [7]:
df_taz_sentiws = pd.read_excel('Taz Sentiment.xlsx')
df_taz_zurich = pd.read_excel('Taz_Sentiment_Zurich.xlsx')

#list of values from the Zurich dictionary
#does not need normalization as they lie between 0 and 1
data_taz_Zu = df_taz_zurich['Value'].tolist()


#normalization of the SW dictionary:
#reshaping into a vector, normalizing, converting back to the list form, 
#and getting rid of parentheses

scaler = MinMaxScaler() 
data_taz_WS = df_taz_sentiws['Value'].tolist()
data_taz_WS = np.reshape(data_taz_WS, (-1,1))
data_scaled_WS = scaler.fit_transform(data_taz_WS)
data_scaled_WS = data_scaled_WS.tolist()
data_scaled_WS = [item for sublist in data_scaled_WS for item in sublist]

print(stats.ttest_ind(data_scaled_WS, data_taz_Zu))

Ttest_indResult(statistic=0.7623471473248491, pvalue=0.4465232642872813)


Ttest_indResult(statistic=0.7623471473248491, pvalue=0.4465232642872813)

Student's T-test shows that there is a significant difference between the means of two groups. It means that after normalization, the scores yielded with SentiWS and PolartLexicon are more or less similar. But do they prove to be true after two humans (me and a native speaker) make judgments about same texts? 

In [18]:
def fisher_test(df1,df2):
    """fisher's exact test is calculating the differences
    between the results of classification derived from two different dictionaries"""
    
    list_of_numbers_for_ws = df1['Judgment'].value_counts().tolist()
    list_of_numbers_for_zu = df2['Judgment'].value_counts().tolist()

    oddsratio, pvalue = stats.fisher_exact([[list_of_numbers_for_ws[0], list_of_numbers_for_taz_zu[0]],
                                        [list_of_numbers_for_ws[1], list_of_numbers_for_zu[1]]]) 
    return oddsratio, pvalue

In [20]:
print("Taz", fisher_test(df_taz_sentiws, df_taz_zurich))

Taz (1.0054171180931744, 1.0)


Taz oddsratio = 1.0054171180931744, p-value = 1.0

Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two or more categorical variables. Our sample is rather modest, so its use is justified. As you can see, there is no association between the numbers of negative, positive, and neutral scores yielded with the two dictionaries.   



In [9]:
#Confusion matrices for Taz.de
#Actual is the result of manual validation

P="Positive"
N="Negative"
O="Neutral"

taz_ws_actual = [N, P, N, P, N, O, P, P, P, O, P, N, N, O, P, N, P, O, P, N, N, N, P, P, P, P, N, P, N, P, P, N, N, N, P]
taz_ws_pred = [N, P, P, N, N, P, P, P, P, O, P, N, N, P, P, P, P, N, P, P, P, N, P, P, P, P, N, P, N, P, P, N, N, N, P]

taz_zu_actual = [N, P, N, P, N, O, P, P, P, O, P, N, N, O, P, N, P, P, P, P, N, N, P, P, P, P, P, N, N, P]
taz_zu_pred = [N, P, N, N, N, P, P, P, N, N, P, N, N, P, P, N, P, P, P,P, N, N, P, P, P, P, P, N, N, P]
taz_zu_pred_norm = [O, P, N, N, N, P, P, P, N, N, P, N, N, P, P, N, P, P, P, P, N, N, P, P, P, P, P, N, N, P] 
taz_hyb_pred = [N, N, N, N, N, N, N, P, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, P, N, N, N, P]
taz_hyb_act = [N, P, N, P, N, O, P, P, P, O, P, N, N, O, P, N, P, P, P, P, N, N, P, P, P, P, P, N, N, P]


print(metrics.confusion_matrix(taz_ws_actual,taz_ws_pred))
print(metrics.classification_report(taz_ws_actual,taz_ws_pred, digits=3))

print(metrics.confusion_matrix(taz_zu_actual,taz_zu_pred))
print(metrics.classification_report(taz_zu_actual,taz_zu_pred, digits=3))

print(metrics.confusion_matrix(taz_hyb_act,taz_hyb_pred))
print(metrics.classification_report(taz_hyb_act,taz_hyb_pred, digits=3))

[[10  0  4]
 [ 1  1  2]
 [ 1  0 16]]
              precision    recall  f1-score   support

    Negative      0.833     0.714     0.769        14
     Neutral      1.000     0.250     0.400         4
    Positive      0.727     0.941     0.821        17

    accuracy                          0.771        35
   macro avg      0.854     0.635     0.663        35
weighted avg      0.801     0.771     0.752        35

[[10  0  0]
 [ 1  0  2]
 [ 2  0 15]]
              precision    recall  f1-score   support

    Negative      0.769     1.000     0.870        10
     Neutral      0.000     0.000     0.000         3
    Positive      0.882     0.882     0.882        17

    accuracy                          0.833        30
   macro avg      0.551     0.627     0.584        30
weighted avg      0.756     0.833     0.790        30

[[10  0  0]
 [ 3  0  0]
 [14  0  3]]
              precision    recall  f1-score   support

    Negative      0.370     1.000     0.541        10
     Neutral      

  _warn_prf(average, modifier, msg_start, len(result))


Taz WS actual vs. Taz WS predicted gives us a macro averaged precision of 85.4%, which is pretty good for a rule-based sentiment analysis algorithm. 


#### SentiWS Confusion Matrix

|            | precision | recall | f1-score | support |
|------------|-----------|--------|----------|---------|
|   Negative |      0.833|   0.714|    0.769 |      14 | 
     Neutral |     1.000 |   0.250|    0.400 |       4 |
    Positive |     0.727 |   0.941|    0.821 |      17 |
    accuracy |            |       |      0.771|      35
   macro avg |     0.854  |   0.635 |    0.663 |       35
weighted avg |     0.801  |   0.771 |    0.752 |       35


PolartLexicon and the hybrid dictionary show very bad results, hence, should not be used any further. 

#### PolartLexicon Confusion Matrix

|             | precision |   recall | f1-score |  support|
|-------------|-----------|----------|----------|---------|
 |   Negative   |   0.769  |   1.000   |  0.870   |     10|
     Neutral   |   0.000   |  0.000  |   0.000   |      3
    Positive   |   0.882  |   0.882  |   0.882   |     17
    accuracy   |           |          |  0.833   |     30
   macro avg   |   0.551   |  0.627   |  0.584   |     30
weighted avg   |   0.756   |  0.833   |  0.790   |     30

#### Hybrid Confusion Matrix

  |            |precision   | recall | f1-score |  support
|---|---|---|---|---
    Negative   |   0.370   |  1.000   |  0.541    |    10
     Neutral    |  0.000    | 0.000  |  0.000      |   3
    Positive     | 1.000     |0.176   |  0.300     |   17
    accuracy      |         |         |  0.433      |  30
   macro avg     | 0.457    |0.392    | 0.280      |  30
weighted avg     | 0.690   | 0.433    | 0.350      |  30

In [14]:
def list_for_confmat(file, length):
    """this function returns a list of judgments 
    used in confusion matrices"""
    
    if file.endswith('xlsx'):
        df = pd.read_excel(file)
        df_slice = df.loc[:length]
    else:
        return 'Wrong format'
    return df_slice['Judgment'].tolist()

In [23]:
stern_zu_pred = list_for_confmat('Stern_Sentiment_Zurich.xlsx', 30)
stern_ws_pred = list_for_confmat('Stern_Sentiment.xlsx', 30)

stern_actual = ['negative', 'negative', 'positive', 'negative', 'negative', 'negative',
               'negative', 'negative', 'negative', 'negative', 'negative', 'negative',
               'negative', 'positive', 'negative', 'negative', 'negative', 'negative',
               'negative', 'negative', 'negative', 'negative', 'positive', 'negative', 
               'neutral', 'positive', 'negative', 'negative', 'positive', 'neutral',
               'positive']

print(metrics.confusion_matrix(stern_ws_pred,stern_actual))
print(metrics.classification_report(stern_ws_pred,stern_actual, digits=3))

print(metrics.confusion_matrix(stern_zu_pred,stern_actual))
print(metrics.classification_report(stern_zu_pred,stern_actual, digits=3))

[[13  1  1]
 [ 3  0  3]
 [ 7  1  2]]
              precision    recall  f1-score   support

    negative      0.565     0.867     0.684        15
     neutral      0.000     0.000     0.000         6
    positive      0.333     0.200     0.250        10

    accuracy                          0.484        31
   macro avg      0.300     0.356     0.311        31
weighted avg      0.381     0.484     0.412        31

[[17  1  0]
 [ 0  1  0]
 [ 6  0  6]]
              precision    recall  f1-score   support

    negative      0.739     0.944     0.829        18
     neutral      0.500     1.000     0.667         1
    positive      1.000     0.500     0.667        12

    accuracy                          0.774        31
   macro avg      0.746     0.815     0.721        31
weighted avg      0.832     0.774     0.761        31



For the social-democratic source, PolartLexicon with the accuracy of almost 80% seems to be the optimal choice. SentiWS is not much better than a flipping a coin, and the latter is arguably easier to implement :) 

#### SentiWS

 |             |precision |   recall|  f1-score  | support
|---|---|---|---|---|
    negative  |    0.565  |   0.867  |   0.684   |     15
     neutral   |   0.000   |  0.000   |  0.000    |     6
    positive    |  0.333    | 0.200   |  0.250     |   10
    accuracy    |            |         | 0.484     |   31
   macro avg     | 0.300     |0.356    | 0.311     |   31
weighted avg      |0.381     |0.484    | 0.412     |   31


#### PolartLexicon


|             | precision  |  recall | f1-score |  support
             |---|---|---|---|---|
    negative  |    0.739 |    0.944  |   0.829    |    18
     neutral   |   0.500 |    1.000   |  0.667    |     1
    positive    |  1.000  |   0.500   |  0.667    |    12
    accuracy   |           |          |  0.774    |    31
   macro avg    |  0.746   |  0.815   |  0.721    |    31
weighted avg    |  0.832   |  0.774   |  0.761    |    31

In [21]:
dk_ws_pred = list_for_confmat('DK Sentiment.xlsx', 30)

dk_zu_pred = list_for_confmat('DK_Sentiment_Zurich.xlsx', 30)
dk_hyb_pred = list_for_confmat('Dk_Sentiment_Hybrid.xlsx', 30)

dk_actual = ['negative', 'negative', 'negative', 'neutral', 'negative', 'negative', 'negative',
            'negative', 'negative', 'neutral', 'negative', 'negative', 'negative', 'negative',
            'negative', 'negative', 'positive', 'negative', 'neutral', 'negative', 'negative',
            'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative',
            'negative', 'negative', 'negative']

print(metrics.confusion_matrix(dk_ws_pred, dk_actual))
print(metrics.classification_report(dk_ws_pred, dk_actual, digits = 3))

print(metrics.confusion_matrix(dk_zu_pred, dk_actual))
print(metrics.classification_report(dk_zu_pred, dk_actual, digits = 3))

print(metrics.confusion_matrix(dk_hyb_pred, dk_actual))
print(metrics.classification_report(dk_hyb_pred, dk_actual, digits = 3))

[[12  1  0]
 [ 9  1  1]
 [ 6  1  0]]
              precision    recall  f1-score   support

    negative      0.444     0.923     0.600        13
     neutral      0.333     0.091     0.143        11
    positive      0.000     0.000     0.000         7

    accuracy                          0.419        31
   macro avg      0.259     0.338     0.248        31
weighted avg      0.305     0.419     0.302        31

[[10  1  0]
 [ 2  0  0]
 [15  2  1]]
              precision    recall  f1-score   support

    negative      0.370     0.909     0.526        11
     neutral      0.000     0.000     0.000         2
    positive      1.000     0.056     0.105        18

    accuracy                          0.355        31
   macro avg      0.457     0.322     0.211        31
weighted avg      0.712     0.355     0.248        31

[[16  2  0]
 [ 0  0  0]
 [11  1  1]]
              precision    recall  f1-score   support

    negative      0.593     0.889     0.711        18
     neutral      

  _warn_prf(average, modifier, msg_start, len(result))


For the conservative source, we proceeded with the hybrid version of two dictionaries - it seemed to be an optimal choice.   

#### SentiWS 
  
|     |precision  |  recall | f1-score |  support |
| ---|---|---|---|---|
   negative  |    0.444   |  0.923  |   0.600    |    13
     neutral  |    0.333   |  0.091  |   0.143    |    11
    positive   |   0.000    | 0.000   |  0.000     |    7
    accuracy    |            |        |  0.419     |   31
   macro avg    |  0.259    | 0.338   |  0.248     |   31
weighted avg    |  0.305   | 0.419   | 0.302       | 31

#### PolartLexicon

 |         |    precision  |  recall | f1-score |  support|
 |---|---|---|---|---|
    negative   |   0.370 |    0.909  |   0.526   |     11
     neutral   |   0.000  |   0.000   |  0.000    |     2
    positive    |  1.000   |  0.056   |  0.105    |    18
    accuracy    |          |          |  0.355    |    31
   macro avg  |    0.457   |  0.322  |   0.211    |    31
weighted avg  |    0.712   |  0.355   |  0.248     |   31

#### Hybrid

|         | precision |   recall  |f1-score  | support|
|---|---|---|---|---|
    negative    |  0.593    | 0.889 |    0.711       | 18
     neutral     | 0.000    | 0.000  |   0.000      |   0
    positive     | 1.000    | 0.077   |  0.143     |   13
    accuracy      |          |        |  0.700     |   31
   macro avg      |0.531    | 0.322   |  0.285    |    31
weighted avg      |0.763   |  0.548   |  0.473   |     31

Making humans (us) manually validate all texts would be painful. This is why we samples around 1/5 of texts for human reading. But can we be sure that the smaller samples represent the original samples well? What if they contain mostly extreme values and skew the picture? To avoid this, we applied Student's t-test. So far so good: there is not significant difference between the two for each source. 

In [77]:
def t_test(filename, length):
    '''this function checks whether drawing a smaller sample 
    from a big sample may skew the results'''
    df = pd.read_excel(filename)
    df_slice_for_testing = df.loc[:length]
    list_for_testing_small = df_slice_for_testing['Value'].tolist()
    list_for_testing_big = df['Value'].tolist()
    return stats.ttest_ind(list_for_testing_small, list_for_testing_big)

print('Taz WS', t_test('Taz Sentiment.xlsx', 30))
print('Taz Zu', t_test('Taz_Sentiment_Zurich.xlsx', 30))

print('DK WS', t_test('DK Sentiment.xlsx', 30))
print('DK Zu', t_test('DK_Sentiment_Zurich.xlsx', 30))

print('Stern WS', t_test('Stern_Sentiment.xlsx', 30))
print('Stern Zu', t_test('Stern_Sentiment_Zurich.xlsx', 30))

Taz WS Ttest_indResult(statistic=0.23352894920570835, pvalue=0.8156419312334099)
Taz Zu Ttest_indResult(statistic=0.3519095262971695, pvalue=0.7253578796107707)
DK WS Ttest_indResult(statistic=0.6671381789173757, pvalue=0.5055991574300808)
DK Zu Ttest_indResult(statistic=1.5021159239971085, pvalue=0.13494445959895884)
Stern WS Ttest_indResult(statistic=0.8092765972524079, pvalue=0.41943650500320184)
Stern Zu Ttest_indResult(statistic=-1.4297644923319337, pvalue=0.15453744986583487)


Taz WS Ttest_indResult(statistic=0.23352894920570835, pvalue=0.8156419312334099)
Taz Zu Ttest_indResult(statistic=0.3519095262971695, pvalue=0.7253578796107707)

DK WS Ttest_indResult(statistic=0.6671381789173757, pvalue=0.5055991574300808)
DK Zu Ttest_indResult(statistic=1.5021159239971085, pvalue=0.13494445959895884)

Stern WS Ttest_indResult(statistic=0.8092765972524079, pvalue=0.41943650500320184)
Stern Zu Ttest_indResult(statistic=-1.4297644923319337, pvalue=0.15453744986583487)