## Topic model for EUR-lex collection classification.

### Murat Apishev, great-mel@yandex.ru

#### Dataset and experiment description

We'd learn a topic model of EUR-lex collection in this experiment. It has two modalities --- regular tokens and class labels, and has next features (after pre-processing):
- 20000 documents, about 18000 of them are in the train sample (in batches, each batch contains 1000 documents), and about 1950 are in test sample in single batch;
- 21000 regular tokens in the vocabulary;
- 3900 classes;
- each document is refers to 3-6 classes in average.

The goal of the experiment is to create a topic model for classification with high quality. The quality measures are:
- the area under ROC curve (AUC-ROC)
- the area under precision-recall curve (AUC-PR)
- the percent of documents with the wrong most probable assigned label (OneError)
- the percent of documents without perfect classification (IsError)
- an average precision: for each right label we count the part of right labels, that were ranged higher than given one, and after that we average this value first across the labels in document and than across al l documents (AvecPrec).

AUC-measures are counted between the vector of probabilities of classes for the document and the vector of right answers for one document and than all values are average across all documents.

All described measures are counted on test sample, the documents of test batch doesn't contain information about their class labels.

All measures were got from article T. Rubin, A. Chambers, P. Smyth, M. Steyvers: Statistical topic models for multi-label document classification.

#### The steps of the experiment

At first let's include all necessary Python packages and new BigARTM API:

In [None]:
from __future__ import division

import os
import sys
import glob
import pickle
import numpy
import sklearn.metrics

from artm.artm_model import *

Now we'll define two help functions that will be used for some quality measures counting:

In [None]:
def perfect_classification(true_labels, probs):
    temp_true_labels = list(true_labels)
    temp_probs = list(probs)
    for i in xrange(sum(true_labels)):
        idx = temp_probs.index(max(temp_probs))
        
        if temp_true_labels[idx] == 0:
            return False
        
        del temp_true_labels[idx]
        del temp_probs[idx]
    return True

In [None]:
def count_precision(true_labels, probs):
    retval, index = 0, -1
    for label in true_labels:
        denominator, numerator = 0, 0
        index += 1
        if label:
            for prob_idx in xrange(len(probs)):
                if probs[prob_idx] > probs[index]:
                    denominator += 1
                    if true_labels[prob_idx] == 1:
                        numerator += 1
        if denominator > 0:
            retval += numerator / denominator
    retval /= sum(true_labels)
    return retval

Now let's define several helpful static constants. They are:
- the name of the modalities (ones, that were used by parser during creation of batches and diciotnary);
- the full path to folder containing batches;
- the full name of file with information about labels for test documents;
- the name of file with '.batch_test' extension containing test documents;

In [None]:
labels_class = '@labels_class'
tokens_class = '@default_class'

data_folder         = 'D:/Work/University/course_work/bigartm/multimodal_experiments/eurlex_data'
test_labels_file    = os.path.join(data_folder, 'test_labels.eurlex_artm')
test_documents_file = '7d6a65e7-712a-43e5-bdad-529075961598.batch_test'

We'll load the information about labels of test documents at once:

In [None]:
with open(test_labels_file, 'rb') as f:
    true_p_cd = [[int(p_cd) for p_cd in p_d] for p_d in pickle.load(f)]

Our model will be defined by set of the parameters. They are
- the number of topics
- the number of iterations over whole collection
- the number of iterations over single document (+)
- the weight of the modality "class labels" (+)
- the weight of the modality "tokens" (+)
- the coefficient of smoothing of Theta matrix (+)
- the coefficient of smoothing of Phi matrix (+)
- the coefficient of smoothing of Psi matrix (+)
- the coefficient of LabelRegularization regularizer (+)

(+) --- means that the variable is a list of values, each value for one iteration of collection scan.

Besides this key values we also need to define one technical variable --- list with numbers of iterations, on which the quality measures will be counted.

In [None]:
num_topics            = 500
num_collection_passes = 10

num_document_passes   = [16] * num_collection_passes
labels_class_weight   = [1.0, 1.0, 0.9, 0.9, 0.9, 0.8, 0.8, 0.8, 0.7, 0.7]
tokens_class_weight   = [1] * num_collection_passes

smooth_theta_tau      = [0.02] * num_collection_passes
smooth_phi_tau        = [0.01] * num_collection_passes

smooth_psi_tau        = [0.01] * num_collection_passes
label_psi_tau         = [0.0] * num_collection_passes

count_scores_iters = [num_collection_passes - 1]

Now let's create the model and initialize it with the dictionary:

In [None]:
model = ArtmModel(num_topics=num_topics, num_document_passes=1)

In [None]:
model.load_dictionary(dictionary_name='dictionary', dictionary_path=os.path.join(data_folder, 'dictionary.eurlex_artm'))
model.initialize(dictionary_name='dictionary')

The next step is to create the regularizers of smoothig for all three matrices and the LableRegularization regularizer for Psi matrix:

In [None]:
model.regularizers.add(SmoothSparsePhiRegularizer(name='SmoothPsiRegularizer', class_ids=[labels_class]))
model.regularizers.add(LabelRegularizationPhiRegularizer(name='LabelPsiRegularizer', class_ids=[labels_class]))

model.regularizers.add(SmoothSparsePhiRegularizer(name='SmoothPhiRegularizer', class_ids=[tokens_class]))
model.regularizers.add(SmoothSparseThetaRegularizer(name='SmoothThetaRegularizer'))

Now we are able to start model learning. During each scan of the collection we'll update the values of coefficients of the regularization, the weights of the modalities and the number of iterations over single document. After that we will call the learning method. If we need to find the values of quality measures on this iterations, next operations will be performed:
- create Theta matrix for test batch according to current model state;
- extract the Psi matrix and count the values of labels in documents as p(c|d) = sum_t p(c|t) * p(t|d);
- call functions defined earlier or from sklearn package with p(c|d) vectors and right answers as input data;
- average all results across test documents and print them.

In [None]:
for iter in xrange(num_collection_passes):
    print 'Iter #' + str(iter)
    model.regularizers['SmoothPsiRegularizer'].tau = smooth_psi_tau[iter]
    model.regularizers['LabelPsiRegularizer'].tau = label_psi_tau[iter]
    model.regularizers['SmoothPhiRegularizer'].tau = smooth_phi_tau[iter]
    model.regularizers['SmoothThetaRegularizer'].tau = smooth_theta_tau[iter]
    
    model.class_ids = {tokens_class: tokens_class_weight[iter], labels_class: labels_class_weight[iter]}

    model.num_document_passes = num_document_passes[iter]

    model.fit_offline(num_collection_passes=1, data_path=data_folder)
    
    test_theta = model.find_theta(data_path=data_folder, batches=[test_documents_file])
    Psi = model.get_phi(class_ids=[labels_class]).as_matrix()
    
    items_auc_roc, items_auc_pr = [], []
    one_error, is_error, precision = 0, 0, 0
    
    if iter in count_scores_iters:
        print 'Start processing iteration #' + str(iter) + '...'
        for item_index in xrange(len(test_theta.columns)):
            p_cd = [numpy.dot(test_theta[item_index], p_w) for p_w in Psi]

            items_auc_roc.append(sklearn.metrics.roc_auc_score(true_p_cd[item_index], p_cd))
            prec, rec, _ = sklearn.metrics.precision_recall_curve(true_p_cd[item_index], p_cd)
            items_auc_pr.append(sklearn.metrics.auc(rec, prec))

            if true_p_cd[item_index][p_cd.index(max(p_cd))] == 0:
                one_error += 1

            if not perfect_classification(true_p_cd[item_index], p_cd):
                is_error += 1

            precision += count_precision(true_p_cd[item_index], p_cd)

        average_auc       = sum(items_auc_roc) / len(items_auc_roc)
        average_auc_pr    = sum(items_auc_pr) / len(items_auc_roc)
        average_one_error = (one_error / len(items_auc_roc)) * 100
        average_is_error  = (is_error / len(items_auc_roc)) * 100
        average_precision = precision / len(items_auc_roc)

        print "AUC-ROC = %.3f " % average_auc,
        print "| OneError = %.1f " % average_one_error,
        print "| IsError = %.1f " % average_is_error,
        print "| AverPrec = %.3f " % average_precision,
        print "| AUC-PR = %.3f" % average_auc_pr