docs/tutorials/python_userguide/coherence.txt

6. Tokens Co-occurrence and Coherence Computation
=================================================

* **Interpretability and Coherence of Topics**

The one of main requirements of topic models is interpretability (i.e., do the topics contain tokens that, according to subjective human judgment, are representative of single coherent concept). `Newman at al. <http://www.aclweb.org/anthology/N10-1012>`_ showed the human evaluation of interpretability is well correlated with the following automated quality measure called coherence. The coherence formula for topic is defined as

:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum\limits_{i = 1}^{k - 1} \sum\limits_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,

where value is some symetric pairwise information about tokens in collection, which is provided by user according to his goals, for instance:

* positive PMI: :math:`value(u, v)=\left[\log\cfrac{p(u, v)}{p(u)p(v)}\right]_{+}`,

where `p(u, v)` is joint probability of tokens `u` and `v` in corpora acording to some probabilistic model. We will require joint probablities to be symetrical.

Some models of tokens co-occurrence are implemented in BigARTM, you can automatically calculate them or use your own model to provide pairwise information to calculate coherence.


* **Tokens Co-occurrence Dictionary**

BigARTM provides automatic gathering of co-occurrence statistics and coherence computation. Co-occurrence gathering tool is available from :doc:`../bigartm_cli`. Probabilities in PPMI formula can be estimated by frequencies in corpora:

:math:`p(u, v) = \cfrac{n(u, v)}{n}`;

:math:`p(u) = \cfrac{n(u)}{n}`;

:math:`n(u) = \sum\limits_{w \in W}{n(u, w)}`;

:math:`n = \sum\limits_{w \in W}{n(w)}`.

All depends of the way to calculate joint frequencies (i.e. co-occurrences). Here are types of co-occurrences available now:

* Cooc TF: :math:`n(u, v) = \sum\limits_{d = 1}^{|D|} \sum\limits_{i = 1}^{N_d} \sum\limits_{j = 1}^{N_d} [0 < |i - j| \leq k] [w_{di} = u] [w_{dj} = v]`,

* Cooc DF: :math:`n_(u, v) = \sum\limits_{d = 1}^{|D|} [\, \exists \, i, j : w_{di} = u, w_{dj} = v, 0 < |i - j| \leq k]`,

where k is parameter of window width which can be specified by user, D is collection, :math:`N_{d}` is length of document d. In brief cooc TF measures how many times the given pair occurred in the collection in a window and cooc DF measures in how many documents the given pair occurred at least once in a window of a given width.

This document should be enough to gather co-occurrence statistics, but also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).

* **Coherence Computation**

Let's assume you have file `cooc.txt` with some pairwise information in Vowpal Wabbit format, which means that lines of that file look like this: ::

    token_u token_v:cooc_uv token_w:cooc_uw

Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file.

.. note::

   If tokens has **nondefault modalities** in collection, you should specify their modalities in `vocab.txt` (in `cooc.txt` they're added automatically).

To upload co-occurrence data into BigARTM you should use ``artm.Dictionary`` object and method ``gather``: ::

	cooc_dict = artm.Dictionary()
	cooc_dict.gather(
	    data_path='batches_folder',
	    cooc_file_path='cooc.txt',
	    vocab_file_path='vocab.txt',
	    symmetric_cooc_values=True)

Where 

* ``data_path`` is path to folder with your collection in internal BigARTM format (batches);
* ``cooc_file_path`` is path to file with co-occurrences of tokens;
* ``vocab_file_path`` is path to vocabulary file corresponding to `cooc.txt`;
* ``symmetric_cooc_values`` is Boolean argument. ``False`` means that co-occurrence information is not symmetric by order of tokens, i.e. `value(w1, w2)` and `value(w2, w1)` can be different.


So, now you can create coherence computation score: ::

	coherence_score = artm.TopTokensScore(
	                            name='TopTokensCoherenceScore',
	                            class_id='@default_class', 
	                            num_tokens=10, 
	                            topic_names=[u'topic_0',u'topic_1'],
	                            dictionary=cooc_dict)

Arguments:

* ``name`` is name of score;
* ``class_id`` is name of modality, which contains tokens with co-occurrence information;
* ``num_tokens`` is number ``k`` of used top tokens for topic coherence computation;
* ``topic_names`` is list of topic names for coherence computation;
* ``dictionary`` is artm.Dictionary with co-occurrence statistic.

To add ``coherence_score`` to the model use next line: ::

	model_artm.scores.add(coherence_score)

To access to results of coherence computation use, for example: ::

	model.score_tracker['TopTokensCoherenceScore'].average_coherence

General explanations and details about scores usage can be found here: :doc:`regularizers_and_scores`.