# Categorization using log-likelihood of word2vec models

In [1]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from pandas import DataFrame
import numpy as np



construct the file names for the learned models and the validation sets

In [6]:
category_names = ['Sonstiges', 'Aktuell', 'Lifestyle', 
          'Wirtschaft', 'Finanzen', 'Ausland', 'Lokal', 
          'Politik', 'Sport', 'Technologie', 'Kultur']

num_models = len(category_names)
model_paths = ["data/corpus{}+base.200dim.word2vec.model".format(x) for x in category_names]
validation_paths = ["data/corpus{}.validation.txt".format(x) for x in category_names]

load the validation sets

In [7]:
validators = [LineSentence(path) for path in validation_paths]

num_validation_entries = [sum([1 for _ in x]) for x in validators]
validation_stats = DataFrame(num_validation_entries, category_names, ['dim(V_i)'])
print('number of articles used for validation:')
print(validation_stats) 

number of articles used for validation:
             dim(V_i)
Sonstiges         770
Aktuell            20
Lifestyle         328
Wirtschaft        817
Finanzen          333
Ausland           483
Lokal             249
Politik          1596
Sport             449
Technologie       361
Kultur            217


## Validation

validate each validation set with all models and calculate the score (log likelihood) of each sentence for each model.

scores is an array in the form:

    [for each model:
        [for each validation set:
            [score of each sentence]
        ]
    ]

**Note**:
There is no cross validation implemented, since it would require learning the category models ${ M }_{ i }$ on-the-fly which is in the current implementation too needy for memory

In [8]:
def get_scores(model_path):
    # load the model
    print('loading model {}'.format(model_path))
    model =  Word2Vec.load(model_path)
    # calculate the score (log likelihood) of each validation set for this model
    print('validating model...')
    scores = [model.score(validator) for validator in validators]
    return scores

main loop that calculates the scores for all models

In [9]:
#container to hold the calculated scores
scores = []

for model in model_paths:
    model_scores = get_scores(model)
    scores.append(model_scores)

loading model data/corpusSonstiges+base.200dim.word2vec.model
validating model...
loading model data/corpusAktuell+base.200dim.word2vec.model
validating model...
loading model data/corpusLifestyle+base.200dim.word2vec.model
validating model...
loading model data/corpusWirtschaft+base.200dim.word2vec.model
validating model...
loading model data/corpusFinanzen+base.200dim.word2vec.model
validating model...
loading model data/corpusAusland+base.200dim.word2vec.model
validating model...
loading model data/corpusLokal+base.200dim.word2vec.model
validating model...
loading model data/corpusPolitik+base.200dim.word2vec.model
validating model...
loading model data/corpusSport+base.200dim.word2vec.model
validating model...
loading model data/corpusTechnologie+base.200dim.word2vec.model
validating model...
loading model data/corpusKultur+base.200dim.word2vec.model
validating model...


## Log Likelihood

output the average score (log likelihood) of all sentences in a validation set for one model

row = average likelihood that an item of this category is generated by the model in the row

e.g: the lowest value in each column is the category a sentence of this model is most likely classified to

$$ { S }_{ i,j }=\frac { \sum _{ s=1 }^{ dim({ V }_{ i }) }{ score({ S(V }_{ i },\quad s),\quad { M }_{ j }) }  }{ dim({ V }_{ i }) } $$

where:
* ${ V }_{ i } $ is the validationset for category $i$
* ${ M }_{ i } $ is the model trained for category $i$
* ${ S }_{ i,j }$ is the score ($i$ is the column, $j$ is the row in the table)
* $S({V}_{i}, x)$ is the $x$th sentence in the validation set ${V}_{i}$
* $dim({ V }_{ i })$ is the number of elements in the validation set
* $score(s, m)$ is the log-likelihood that the word is generated by the model [[Taddy, Matt. Document Classification by Inversion of Distributed Language Representations](https://arxiv.org/pdf/1504.07295.pdf)]


In [10]:
average_scores = []
for score_set in scores:
    average_scores.append([sum(x) / len(x) for x in score_set])

#transpose the array before creating the DataFrame because pandas is row-oriented
result = DataFrame(np.transpose(average_scores), category_names, category_names)
print(result)

               Sonstiges      Aktuell    Lifestyle   Wirtschaft     Finanzen  \
Sonstiges   -3010.084928 -3620.930076 -3316.684047 -3428.110842 -3470.883899   
Aktuell     -3342.287811 -3502.018492 -3442.068571 -3354.793344 -3558.918169   
Lifestyle   -3411.059577 -3741.202155 -3233.621596 -3353.026731 -3641.836474   
Wirtschaft  -2220.297472 -2389.661734 -2083.490276 -1863.786035 -2240.937092   
Finanzen    -2820.391919 -3376.509877 -3022.316987 -2871.141353 -2434.042256   
Ausland     -2118.389157 -2299.874385 -2225.302254 -2162.516560 -2305.146807   
Lokal       -4992.469881 -5883.736465 -5182.716055 -5553.870577 -5203.016032   
Politik     -3222.012936 -3683.076455 -3300.886094 -3248.723149 -3440.617581   
Sport       -4089.661399 -4552.034989 -4251.830995 -4440.834318 -4350.140081   
Technologie -3643.723178 -3964.762998 -3671.437263 -3708.586836 -3770.256772   
Kultur      -8901.370265 -9732.151271 -9192.834764 -9610.559394 -9710.615421   

                 Ausland        Lokal  

## Classification Quality
calculate the number of categorizations for every category

rows = categories of training set

columns = number of items of the train set classified in the category

e.g: the highest number in each row should be on the diagonal of the matrix

first step is to transpose the model (switch the first two dimensions from model->category to category->model), then numpy.argmax is used to find the index of the model that has the highest score for this category

$${ C }_{ i,j }=\sum { \begin{cases} 1\quad if\quad { S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) }  \\ 0\quad otherwise \end{cases} } $$

where:
* ${ C }_{ i,j }$ is the number of elements in ${V}_{i}$ that have are classified to ${M}_{i}$
* ${ S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) } $ is the classification rule that assigns the article to the class that yields the highest score for the model

In [11]:
classification_matrix = np.empty([num_models, num_models], dtype=int)

for category_index in range(num_models):
    #transpose the scores array to form [model][category][sentence_score] to [category][model][sentence_score]
    category_scores = [model[category_index] for model in scores]
    #get the classification matrix in each model
    #the values represent the category index they were assigned to
    classifications = np.argmax(category_scores, axis = 0)
    
    #convert the classification matrix to a count of classification in each category
    classification_count = [np.sum(classifications == x) for x in range(len(category_names))]
    classification_matrix[category_index]=classification_count
    
result = DataFrame(classification_matrix, category_names, category_names)
print(result)   

             Sonstiges  Aktuell  Lifestyle  Wirtschaft  Finanzen  Ausland  \
Sonstiges          507        0         85          28        10       31   
Aktuell              8        1          1           6         0        1   
Lifestyle           87        0        145          27         6       13   
Wirtschaft          43        0         34         575        66       18   
Finanzen            11        0          5          91       212        1   
Ausland             39        1          8          29         3      184   
Lokal               23        0          7          23         0        2   
Politik            118        0         19          97         7      175   
Sport               14        0          1           7         2        0   
Technologie         25        0         22          27         3        5   
Kultur              71        0         34           9         1        8   

             Lokal  Politik  Sport  Technologie  Kultur  
Sonstiges       1

## Accuracy

Calculate the accuracy matrix. Accuracy is defined as 
$\frac { TP +TN }{ total\ elements }$

$${ ACC }_{ i,j }=\frac { { C }_{ i,j } }{ \sum { { C }_{ j } }  } $$

**Note**:
True accuracy measures as per definition are only found on the diagonal of the matrix. The other values are the ratio of false positives in the corresponding category

In [12]:
# the max(, 1) function surrounding sum makes sure wo don't divide by 0 if no match occurred
accuracy_matrix = [category / float(max([sum(category) ,1])) for category in classification_matrix]

result = DataFrame(accuracy_matrix, category_names, category_names)
print(result) 

             Sonstiges  Aktuell  Lifestyle  Wirtschaft  Finanzen   Ausland  \
Sonstiges     0.658442  0.00000   0.110390    0.036364  0.012987  0.040260   
Aktuell       0.400000  0.05000   0.050000    0.300000  0.000000  0.050000   
Lifestyle     0.265244  0.00000   0.442073    0.082317  0.018293  0.039634   
Wirtschaft    0.052632  0.00000   0.041616    0.703794  0.080783  0.022032   
Finanzen      0.033033  0.00000   0.015015    0.273273  0.636637  0.003003   
Ausland       0.080745  0.00207   0.016563    0.060041  0.006211  0.380952   
Lokal         0.092369  0.00000   0.028112    0.092369  0.000000  0.008032   
Politik       0.073935  0.00000   0.011905    0.060777  0.004386  0.109649   
Sport         0.031180  0.00000   0.002227    0.015590  0.004454  0.000000   
Technologie   0.069252  0.00000   0.060942    0.074792  0.008310  0.013850   
Kultur        0.327189  0.00000   0.156682    0.041475  0.004608  0.036866   

                Lokal   Politik     Sport  Technologie    Kultu

calculate the average accuracy of the diagonal to get a accuracy score

$$score=\frac { \sum _{ i=0 }^{ dim({ V }_{ i }) }{ { ACC }_{ i,i } * dim({ V }_{ i }) }  }{  \sum _{ i=0 }^{ dim({ V }_{ i }) }{dim({ V }_{ i }) }   } $$

This value is comparable with the accuracy validation score of scikitlearn

In [13]:
true_positives = 0.0
num_samples = 0
for x in range(num_models):
    true_positives += classification_matrix[x][x]
    num_samples += sum(classification_matrix[x])
    
average_score = true_positives / num_samples

print('score: {}'.format(average_score))

score: 0.658723101547
