# Categorization using log-likelihood of word2vec models

In [4]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from pandas import DataFrame
import numpy as np

construct the file names for the learned models and the validation sets

In [5]:
category_names = ['Sonstiges', 'Lifestyle', 
          'Wirtschaft', 'Finanzen', 'Lokal', 
          'Politik', 'Sport', 'Technologie', 'Kultur']

num_models = len(category_names)
model_paths = ["data/corpus{}+base.word2vec.model".format(x) for x in category_names]
validation_paths = ["data/corpus{}.validation.txt".format(x) for x in category_names]

load the validation sets

In [6]:
validators = [LineSentence(path) for path in validation_paths]
num_validation_entries = [sum([1 for _ in x]) for x in validators]
validation_stats = DataFrame(num_validation_entries, category_names, ['dim(V_i)'])
print('number of articles used for validation:')
print(validation_stats) 

number of articles used for validation:
             dim(V_i)
Sonstiges         447
Lifestyle         203
Wirtschaft        502
Finanzen          159
Lokal             151
Politik           890
Sport             262
Technologie       216
Kultur            132


## Validation

validate each validation set with all models and calculate the score (log likelihood) of each sentence for each model.

scores is an array in the form:

    [for each model:
        [for each validation set:
            [score of each sentence]
        ]
    ]

**Note**:
There is no cross validation implemented, since it would require learning the category models ${ M }_{ i }$ on-the-fly which is in the current implementation too needy for memory

In [7]:
def get_scores(model_path):
    # load the model
    print('loading model {}'.format(model_path))
    model =  Word2Vec.load(model_path)
    # calculate the score (log likelihood) of each validation set for this model
    print('validating model...')
    scores = [model.score(validator) for validator in validators]
    return scores

main loop that calculates the scores for all models

In [8]:
#container to hold the calculated scores
scores = []

for model in model_paths:
    model_scores = get_scores(model)
    scores.append(model_scores)

loading model data/corpusSonstiges+base.word2vec.model
validating model...
loading model data/corpusLifestyle+base.word2vec.model
validating model...
loading model data/corpusWirtschaft+base.word2vec.model
validating model...
loading model data/corpusFinanzen+base.word2vec.model
validating model...
loading model data/corpusLokal+base.word2vec.model
validating model...
loading model data/corpusPolitik+base.word2vec.model
validating model...
loading model data/corpusSport+base.word2vec.model
validating model...
loading model data/corpusTechnologie+base.word2vec.model
validating model...
loading model data/corpusKultur+base.word2vec.model
validating model...


## Log Likelihood

output the average score (log likelihood) of all sentences in a validation set for one model

row = average likelihood that an item of this category is generated by the model in the row

e.g: the lowest value in each column is the category a sentence of this model is most likely classified to

$$ { S }_{ i,j }=\frac { \sum _{ s=1 }^{ dim({ V }_{ i }) }{ score({ S(V }_{ i },\quad s),\quad { M }_{ j }) }  }{ dim({ V }_{ i }) } $$

where:
* ${ V }_{ i } $ is the validationset for category $i$
* ${ M }_{ i } $ is the model trained for category $i$
* ${ S }_{ i,j }$ is the score ($i$ is the column, $j$ is the row in the table)
* $S({V}_{i}, x)$ is the $x$th sentence in the validation set ${V}_{i}$
* $dim({ V }_{ i })$ is the number of elements in the validation set
* $score(s, m)$ is the log-likelihood that the word is generated by the model [[Taddy, Matt. Document Classification by Inversion of Distributed Language Representations](https://arxiv.org/pdf/1504.07295.pdf)]


In [9]:
average_scores = []
for score_set in scores:
    average_scores.append([sum(x) / len(x) for x in score_set])

#transpose the array before creating the DataFrame because pandas is row-oriented
result = DataFrame(np.transpose(average_scores), category_names, category_names)
print(result)

               Sonstiges    Lifestyle   Wirtschaft     Finanzen        Lokal  \
Sonstiges   -2883.616902 -3410.463742 -3547.662959 -3614.414392 -3579.666276   
Lifestyle   -3681.461870 -3159.540947 -3655.325537 -3923.542843 -3904.480044   
Wirtschaft  -2119.378694 -1932.718305 -1640.188265 -2157.852859 -2254.527094   
Finanzen    -2800.564684 -2973.300450 -2823.798655 -2201.317234 -3228.492117   
Lokal       -4969.932780 -5048.889821 -5475.762360 -5186.843923 -3581.847029   
Politik     -3254.685915 -3239.631214 -3256.058618 -3473.693254 -3498.426982   
Sport       -4275.419466 -4378.874614 -4568.987498 -4521.432009 -4382.287570   
Technologie -3678.785152 -3675.204371 -3730.093747 -3799.082383 -3899.725014   
Kultur      -6804.892096 -6822.291942 -7274.635491 -7297.410869 -7319.332980   

                 Politik        Sport  Technologie       Kultur  
Sonstiges   -3469.315377 -3618.162423 -3492.961093 -3588.139992  
Lifestyle   -3722.618492 -3890.242349 -3728.899020 -3854.253210  
W

## Classification Quality
calculate the number of categorizations for every category

rows = categories of training set

columns = number of items of the train set classified in the category

e.g: the highest number in each row should be on the diagonal of the matrix

first step is to transpose the model (switch the first two dimensions from model->category to category->model), then numpy.argmax is used to find the index of the model that has the highest score for this category

$${ C }_{ i,j }=\sum { \begin{cases} 1\quad if\quad { S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) }  \\ 0\quad otherwise \end{cases} } $$

where:
* ${ C }_{ i,j }$ is the number of elements in ${V}_{i}$ that have are classified to ${M}_{i}$
* ${ S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) } $ is the classification rule that assigns the article to the class that yields the highest score for the model

In [10]:
classification_matrix = np.empty([num_models, num_models], dtype=int)

for category_index in range(num_models):
    #transpose the scores array to form [model][category][sentence_score] to [category][model][sentence_score]
    category_scores = [model[category_index] for model in scores]
    #get the classification matrix in each model
    #the values represent the category index they were assigned to
    classifications = np.argmax(category_scores, axis = 0)
    
    #convert the classification matrix to a count of classification in each category
    classification_count = [np.sum(classifications == x) for x in range(len(category_names))]
    classification_matrix[category_index]=classification_count
    
result = DataFrame(classification_matrix, category_names, category_names)
print(result)   

             Sonstiges  Lifestyle  Wirtschaft  Finanzen  Lokal  Politik  \
Sonstiges          336         58          19         0      5       19   
Lifestyle           18        157          12         2      0        8   
Wirtschaft          17         17         417         9      0       19   
Finanzen             0          3          35       120      0        0   
Lokal               13          7           7         0    106       17   
Politik             59         22          71         3      1      730   
Sport                3          4           2         0      1        1   
Technologie          8          8          17         1      0        1   
Kultur              21         31           4         1      0        1   

             Sport  Technologie  Kultur  
Sonstiges        0            5       5  
Lifestyle        0            6       0  
Wirtschaft       0           20       3  
Finanzen         0            1       0  
Lokal            0            0       1

## Accuracy

Calculate the accuracy matrix. Accuracy is defined as 
$\frac { TP +TN }{ total\ elements }$

$${ ACC }_{ i,j }=\frac { { C }_{ i,j } }{ \sum { { C }_{ j } }  } $$

**Note**:
True accuracy measures as per definition are only found on the diagonal of the matrix. The other values are the ratio of false positives in the corresponding category

In [11]:
# the max(, 1) function surrounding sum makes sure wo don't divide by 0 if no match occurred
accuracy_matrix = [category / float(max([sum(category) ,1])) for category in classification_matrix]

result = DataFrame(accuracy_matrix, category_names, category_names)
print(result) 

             Sonstiges  Lifestyle  Wirtschaft  Finanzen     Lokal   Politik  \
Sonstiges     0.751678   0.129754    0.042506  0.000000  0.011186  0.042506   
Lifestyle     0.088670   0.773399    0.059113  0.009852  0.000000  0.039409   
Wirtschaft    0.033865   0.033865    0.830677  0.017928  0.000000  0.037849   
Finanzen      0.000000   0.018868    0.220126  0.754717  0.000000  0.000000   
Lokal         0.086093   0.046358    0.046358  0.000000  0.701987  0.112583   
Politik       0.066292   0.024719    0.079775  0.003371  0.001124  0.820225   
Sport         0.011450   0.015267    0.007634  0.000000  0.003817  0.003817   
Technologie   0.037037   0.037037    0.078704  0.004630  0.000000  0.004630   
Kultur        0.159091   0.234848    0.030303  0.007576  0.000000  0.007576   

                Sport  Technologie    Kultur  
Sonstiges    0.000000     0.011186  0.011186  
Lifestyle    0.000000     0.029557  0.000000  
Wirtschaft   0.000000     0.039841  0.005976  
Finanzen     0.000000

calculate the average accuracy of the diagonal to get a accuracy score

$$score=\frac { \sum _{ i=0 }^{ dim({ V }_{ i }) }{ { ACC }_{ i,i } * dim({ V }_{ i }) }  }{  \sum _{ i=0 }^{ dim({ V }_{ i }) }{dim({ V }_{ i }) }   } $$

This value is comparable with the accuracy validation score of scikitlearn

In [12]:
true_positives = 0.0
num_samples = 0
for x in range(num_models):
    true_positives += classification_matrix[x][x]
    num_samples += sum(classification_matrix[x])
    
average_score = true_positives / num_samples

print('score: {}'.format(average_score))

score: 0.79810938555
