/
coherence.txt
95 lines (56 loc) · 4.91 KB
/
coherence.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
6. Tokens Co-occurrence and Coherence Computation
=================================================
* **Interpretability and Coherence of Topics**
The one of main requirements of topic models is interpretability (i.e., do the topics contain tokens that, according to subjective human judgment, are representative of single coherent concept). `Newman at al. <http://www.aclweb.org/anthology/N10-1012>`_ showed the human evaluation of interpretability is well correlated with the following automated quality measure called coherence. The coherence formula for topic is defined as
:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum\limits_{i = 1}^{k - 1} \sum\limits_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,
where value is some symetric pairwise information about tokens in collection, which is provided by user according to his goals, for instance:
* positive PMI: :math:`value(u, v)=\left[\log\cfrac{p(u, v)}{p(u)p(v)}\right]_{+}`,
where `p(u, v)` is joint probability of tokens `u` and `v` in corpora acording to some probabilistic model. We will require joint probablities to be symetrical.
Some models of tokens co-occurrence are implemented in BigARTM, you can automatically calculate them or use your own model to provide pairwise information to calculate coherence.
* **Tokens Co-occurrence Dictionary**
BigARTM provides automatic gathering of co-occurrence statistics and coherence computation. Co-occurrence gathering tool is available from :doc:`../bigartm_cli`. Probabilities in PPMI formula can be estimated by frequencies in corpora:
:math:`p(u, v) = \cfrac{n(u, v)}{n}`;
:math:`p(u) = \cfrac{n(u)}{n}`;
:math:`n(u) = \sum\limits_{w \in W}{n(u, w)}`;
:math:`n = \sum\limits_{w \in W}{n(w)}`.
All depends of the way to calculate joint frequencies (i.e. co-occurrences). Here are types of co-occurrences available now:
* Cooc TF: :math:`n(u, v) = \sum\limits_{d = 1}^{|D|} \sum\limits_{i = 1}^{N_d} \sum\limits_{j = 1}^{N_d} [0 < |i - j| \leq k] [w_{di} = u] [w_{dj} = v]`,
* Cooc DF: :math:`n_(u, v) = \sum\limits_{d = 1}^{|D|} [\, \exists \, i, j : w_{di} = u, w_{dj} = v, 0 < |i - j| \leq k]`,
where k is parameter of window width which can be specified by user, D is collection, :math:`N_{d}` is length of document d. In brief cooc TF measures how many times the given pair occurred in the collection in a window and cooc DF measures in how many documents the given pair occurred at least once in a window of a given width.
This document should be enough to gather co-occurrence statistics, but also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).
* **Coherence Computation**
Let's assume you have file `cooc.txt` with some pairwise information in Vowpal Wabbit format, which means that lines of that file look like this: ::
token_u token_v:cooc_uv token_w:cooc_uw
Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file.
.. note::
If tokens has **nondefault modalities** in collection, you should specify their modalities in `vocab.txt` (in `cooc.txt` they're added automatically).
To upload co-occurrence data into BigARTM you should use ``artm.Dictionary`` object and method ``gather``: ::
cooc_dict = artm.Dictionary()
cooc_dict.gather(
data_path='batches_folder',
cooc_file_path='cooc.txt',
vocab_file_path='vocab.txt',
symmetric_cooc_values=True)
Where
* ``data_path`` is path to folder with your collection in internal BigARTM format (batches);
* ``cooc_file_path`` is path to file with co-occurrences of tokens;
* ``vocab_file_path`` is path to vocabulary file corresponding to `cooc.txt`;
* ``symmetric_cooc_values`` is Boolean argument. ``False`` means that co-occurrence information is not symmetric by order of tokens, i.e. `value(w1, w2)` and `value(w2, w1)` can be different.
So, now you can create coherence computation score: ::
coherence_score = artm.TopTokensScore(
name='TopTokensCoherenceScore',
class_id='@default_class',
num_tokens=10,
topic_names=[u'topic_0',u'topic_1'],
dictionary=cooc_dict)
Arguments:
* ``name`` is name of score;
* ``class_id`` is name of modality, which contains tokens with co-occurrence information;
* ``num_tokens`` is number ``k`` of used top tokens for topic coherence computation;
* ``topic_names`` is list of topic names for coherence computation;
* ``dictionary`` is artm.Dictionary with co-occurrence statistic.
To add ``coherence_score`` to the model use next line: ::
model_artm.scores.add(coherence_score)
To access to results of coherence computation use, for example: ::
model.score_tracker['TopTokensCoherenceScore'].average_coherence
General explanations and details about scores usage can be found here: :doc:`regularizers_and_scores`.