# Categorization using log-likelihood of word2vec models

In [15]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from pandas import DataFrame
import numpy as np

construct the file names for the learned models and the validation sets

In [16]:
category_names = ['Sonstiges', 'Aktuell', 'Lifestyle', 
          'Wirtschaft', 'Finanzen', 'Ausland', 'Lokal', 
          'Politik', 'Sport', 'Technologie', 'Kultur']

num_models = len(category_names)
model_paths = ["data/corpus{}+base.word2vec.model".format(x) for x in category_names]
validation_paths = ["data/corpus{}.validation.txt".format(x) for x in category_names]

load the validation sets

In [17]:
validators = [LineSentence(path) for path in validation_paths]

num_validation_entries = [sum([1 for _ in x]) for x in validators]
validation_stats = DataFrame(num_validation_entries, category_names, ['dim(V_i)'])
print('number of articles used for validation:')
print(validation_stats) 

number of articles used for validation:
             dim(V_i)
Sonstiges         770
Aktuell            20
Lifestyle         328
Wirtschaft        817
Finanzen          333
Ausland           483
Lokal             249
Politik          1596
Sport             449
Technologie       361
Kultur            217


## Validation

validate each validation set with all models and calculate the score (log likelihood) of each sentence for each model.

scores is an array in the form:

    [for each model:
        [for each validation set:
            [score of each sentence]
        ]
    ]

**Note**:
There is no cross validation implemented, since it would require learning the category models ${ M }_{ i }$ on-the-fly which is in the current implementation too needy for memory

In [18]:
def get_scores(model_path):
    # load the model
    print('loading model {}'.format(model_path))
    model =  Word2Vec.load(model_path)
    # calculate the score (log likelihood) of each validation set for this model
    print('validating model...')
    scores = [model.score(validator) for validator in validators]
    return scores

main loop that calculates the scores for all models

In [21]:
#container to hold the calculated scores
scores = []

for model in model_paths:
    model_scores = get_scores(model)
    scores.append(model_scores)

loading model data/corpusSonstiges+base.word2vec.model
validating model...
loading model data/corpusAktuell+base.word2vec.model
validating model...
loading model data/corpusLifestyle+base.word2vec.model
validating model...
loading model data/corpusWirtschaft+base.word2vec.model
validating model...
loading model data/corpusFinanzen+base.word2vec.model
validating model...
loading model data/corpusAusland+base.word2vec.model
validating model...
loading model data/corpusLokal+base.word2vec.model
validating model...
loading model data/corpusPolitik+base.word2vec.model
validating model...
loading model data/corpusSport+base.word2vec.model
validating model...
loading model data/corpusTechnologie+base.word2vec.model
validating model...
loading model data/corpusKultur+base.word2vec.model
validating model...


## Log Likelihood

output the average score (log likelihood) of all sentences in a validation set for one model

row = average likelihood that an item of this category is generated by the model in the row

e.g: the lowest value in each column is the category a sentence of this model is most likely classified to

$$ { S }_{ i,j }=\frac { \sum _{ s=1 }^{ dim({ V }_{ i }) }{ score({ S(V }_{ i },\quad s),\quad { M }_{ j }) }  }{ dim({ V }_{ i }) } $$

where:
* ${ V }_{ i } $ is the validationset for category $i$
* ${ M }_{ i } $ is the model trained for category $i$
* ${ S }_{ i,j }$ is the score ($i$ is the column, $j$ is the row in the table)
* $S({V}_{i}, x)$ is the $x$th sentence in the validation set ${V}_{i}$
* $dim({ V }_{ i })$ is the number of elements in the validation set
* $score(s, m)$ is the log-likelihood that the word is generated by the model [[Taddy, Matt. Document Classification by Inversion of Distributed Language Representations](https://arxiv.org/pdf/1504.07295.pdf)]


In [22]:
average_scores = []
for score_set in scores:
    average_scores.append([sum(x) / len(x) for x in score_set])

#transpose the array before creating the DataFrame because pandas is row-oriented
result = DataFrame(np.transpose(average_scores), category_names, category_names)
print(result)

               Sonstiges      Aktuell    Lifestyle   Wirtschaft     Finanzen  \
Sonstiges   -3023.305563 -3591.851440 -3316.458243 -3458.459421 -3502.693819   
Aktuell     -3406.847200 -3478.971946 -3452.729374 -3400.865962 -3605.820707   
Lifestyle   -3470.167490 -3697.053330 -3230.019704 -3379.121821 -3694.238708   
Wirtschaft  -2255.291836 -2371.583910 -2080.033885 -1850.201417 -2273.048428   
Finanzen    -2819.153567 -3333.887122 -3010.546884 -2862.410102 -2414.655164   
Ausland     -2153.040332 -2278.715084 -2212.477327 -2172.399341 -2328.481654   
Lokal       -5072.434925 -5768.085391 -5172.428481 -5601.607124 -5249.026092   
Politik     -3259.656002 -3642.098892 -3270.765876 -3251.658240 -3473.375445   
Sport       -4126.546658 -4519.755654 -4262.982045 -4460.560663 -4381.310551   
Technologie -3697.492521 -3914.482529 -3691.743123 -3755.177561 -3831.746594   
Kultur      -9062.581682 -9565.431608 -9165.819357 -9754.343478 -9762.327799   

                 Ausland        Lokal  

## Classification Quality
calculate the number of categorizations for every category

rows = categories of training set

columns = number of items of the train set classified in the category

e.g: the highest number in each row should be on the diagonal of the matrix

first step is to transpose the model (switch the first two dimensions from model->category to category->model), then numpy.argmax is used to find the index of the model that has the highest score for this category

$${ C }_{ i,j }=\sum { \begin{cases} 1\quad if\quad { S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) }  \\ 0\quad otherwise \end{cases} } $$

where:
* ${ C }_{ i,j }$ is the number of elements in ${V}_{i}$ that have are classified to ${M}_{i}$
* ${ S }_{ i,j }\quad >\quad \underset { k\in { V }\setminus { V }_{ i } }{ max({ S }_{ k,j }) } $ is the classification rule that assigns the article to the class that yields the highest score for the model

In [23]:
classification_matrix = np.empty([num_models, num_models], dtype=int)

for category_index in range(num_models):
    #transpose the scores array to form [model][category][sentence_score] to [category][model][sentence_score]
    category_scores = [model[category_index] for model in scores]
    #get the classification matrix in each model
    #the values represent the category index they were assigned to
    classifications = np.argmax(category_scores, axis = 0)
    
    #convert the classification matrix to a count of classification in each category
    classification_count = [np.sum(classifications == x) for x in range(len(category_names))]
    classification_matrix[category_index]=classification_count
    
result = DataFrame(classification_matrix, category_names, category_names)
print(result)   

             Sonstiges  Aktuell  Lifestyle  Wirtschaft  Finanzen  Ausland  \
Sonstiges          422        1        151          30        10       57   
Aktuell              4        1          2           7         0        3   
Lifestyle           50        0        179          29         6       13   
Wirtschaft          27        1         52         575        66       28   
Finanzen             9        0          9          97       208        4   
Ausland             24        1         10          30         4      260   
Lokal               20        0         11          21         1       11   
Politik             88        1         40         106        11      247   
Sport               12        2          2           9         2        2   
Technologie         13        2         30          25         3       12   
Kultur              41        1         57           6         2       16   

             Lokal  Politik  Sport  Technologie  Kultur  
Sonstiges       1

## Accuracy

Calculate the accuracy matrix. Accuracy is defined as 
$\frac { TP +TN }{ total\ elements }$

$${ ACC }_{ i,j }=\frac { { C }_{ i,j } }{ \sum { { C }_{ j } }  } $$

**Note**:
True accuracy measures as per definition are only found on the diagonal of the matrix. The other values are the ratio of false positives in the corresponding category

In [24]:
# the max(, 1) function surrounding sum makes sure wo don't divide by 0 if no match occurred
accuracy_matrix = [category / float(max([sum(category) ,1])) for category in classification_matrix]

result = DataFrame(accuracy_matrix, category_names, category_names)
print(result) 

             Sonstiges   Aktuell  Lifestyle  Wirtschaft  Finanzen   Ausland  \
Sonstiges     0.548052  0.001299   0.196104    0.038961  0.012987  0.074026   
Aktuell       0.200000  0.050000   0.100000    0.350000  0.000000  0.150000   
Lifestyle     0.152439  0.000000   0.545732    0.088415  0.018293  0.039634   
Wirtschaft    0.033048  0.001224   0.063647    0.703794  0.080783  0.034272   
Finanzen      0.027027  0.000000   0.027027    0.291291  0.624625  0.012012   
Ausland       0.049689  0.002070   0.020704    0.062112  0.008282  0.538302   
Lokal         0.080321  0.000000   0.044177    0.084337  0.004016  0.044177   
Politik       0.055138  0.000627   0.025063    0.066416  0.006892  0.154762   
Sport         0.026726  0.004454   0.004454    0.020045  0.004454  0.004454   
Technologie   0.036011  0.005540   0.083102    0.069252  0.008310  0.033241   
Kultur        0.188940  0.004608   0.262673    0.027650  0.009217  0.073733   

                Lokal   Politik     Sport  Technolo

calculate the average accuracy of the diagonal to get a accuracy score

$$score=\frac { \sum _{ i }^{ dim({ V }_{ i }) }{ { ACC }_{ i,i } }  }{ dim({ V }_{ i }) } $$

This value is comparable with the accuracy validation score of scikitlearn

In [27]:
true_positives = 0.0
num_samples = 0
for x in range(num_models):
    true_positives += classification_matrix[x][x]
    num_samples += sum(classification_matrix[x])
    
average_score = true_positives / num_samples

print('score: {}'.format(average_score))

score: 0.645740707807
