Fixes in documentation (#918)

* refreshed documentation, add link to bigartm-book
bigartm · Jul 24, 2018 · da19394 · da19394
1 parent 76f137d
commit da19394
Show file tree

Hide file tree

Showing 12 changed files with 88 additions and 60 deletions.
diff --git a/docs/tutorials/bigartm_cli.txt b/docs/tutorials/bigartm_cli.txt
@@ -14,18 +14,27 @@ Then you can use ``bigartm`` as described by ``bigartm --help``.
 You may also get more information about builtin regularizers by typing ``bigartm --help --regularizer``.
 
 
+* **Gathering of Co-occurrence Statistics File**
+
+In order to gather co-ooccurrence statistics files you need to have 2 files: collection in Vowpal Wabbit format and file of tokens (so called `"vocab"`) in UCI format. Vocab is needed to filter tokens of collection, which means that co-occurrence won't be calculated if some pair isn't presented in vocab. There are 2 types of co-occurrences available now: tf and df (see description below). Also you may want to calculate Positive PMI of gathered co-occurrence values. This information can be usefull if you want to use co-occurrence dictionary in coherence computation. The utility produces files with pointwise information in as pseudocollection in Vowpal Wabbit format.
+
 .. note::
 
-   In order to gather co-ooccurrence dictionary you need to have 2 files: collection in Vowpal Wabbit format and file of tokens in UCI vocab format. If you want to use co-occurrence dictionary in coherence computation you'll need to form file with ppmi values and use that as co-occurrence dictionary (it matches the format of co-ooccurrence dictionary).
-   Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:
+   If you want to compute co-occurrences of tokens of **non-default modality**, you should specify that modalities in vocab file. For more information about UCI files format, please visit :doc:`./datasets`.
+
+Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:
 
-   .. code-block:: bash
+.. code-block:: bash
 
-      bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_
+   bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_
+
+Numbers and names of files can be changed, it's just an example.
+To know the description of each key, see below in table of all available keys.
+
+Also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).
 
-   Numbers and names of files can be changed, it's just an example.
-   To know the description of each key, see below in table of all available keys.
 
+* **BigARTM CLI keys**
 
 .. code-block:: bash
 

diff --git a/docs/tutorials/python_userguide/coherence.txt b/docs/tutorials/python_userguide/coherence.txt
@@ -5,26 +5,50 @@
 
 The one of main requirements of topic models is interpretability (i.e., do the topics contain tokens that, according to subjective human judgment, are representative of single coherent concept). `Newman at al. <http://www.aclweb.org/anthology/N10-1012>`_ showed the human evaluation of interpretability is well correlated with the following automated quality measure called coherence. The coherence formula for topic is defined as
 
-:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k} \mathrm{value}(w_i, w_j)`,
+:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum\limits_{i = 1}^{k - 1} \sum\limits_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,
+
+where value is some symetric pairwise information about tokens in collection, which is provided by user according to his goals, for instance:
+
+* positive PMI: :math:`value(u, v)=\left[\log\cfrac{p(u, v)}{p(u)p(v)}\right]_{+}`,
+
+where `p(u, v)` is joint probability of tokens `u` and `v` in corpora acording to some probabilistic model. We will require joint probablities to be symetrical.
+
+Some models of tokens co-occurrence are implemented in BigARTM, you can automatically calculate them or use your own model to provide pairwise information to calculate coherence.
 
-where value is some pairwise information about tokens in collection dictionary, which is provided by user according to his goals.
 
 * **Tokens Co-occurrence Dictionary**
 
-BigARTM provides ability of automatic coherence computation. The only thing you should prepare is file with co-occurrences of tokens (to know how to gather a co-occurrence dictionary using BigARTM CLI, please read :doc:`../bigartm_cli`). Lines of that file should have the next format: ::
+BigARTM provides automatic gathering of co-occurrence statistics and coherence computation. Co-occurrence gathering tool is available from :doc:`../bigartm_cli`. Probabilities in PPMI formula can be estimated by frequencies in corpora:
+
+:math:`p(u, v) = \cfrac{n(u, v)}{n}`;
+
+:math:`p(u) = \cfrac{n(u)}{n}`;
+
+:math:`n(u) = \sum\limits_{w \in W}{n(u, w)}`;
+
+:math:`n = \sum\limits_{w \in W}{n(w)}`.
+
+All depends of the way to calculate joint frequencies (i.e. co-occurrences). Here are types of co-occurrences available now:
+
+* Cooc TF: :math:`n(u, v) = \sum\limits_{d = 1}^{|D|} \sum\limits_{i = 1}^{N_d} \sum\limits_{j = 1}^{N_d} [0 < |i - j| \leq k] [w_{di} = u] [w_{dj} = v]`,
+
+* Cooc DF: :math:`n_(u, v) = \sum\limits_{d = 1}^{|D|} [\, \exists \, i, j : w_{di} = u, w_{dj} = v, 0 < |i - j| \leq k]`,
+
+where k is parameter of window width which can be specified by user, D is collection, :math:`N_{d}` is length of document d. In brief cooc TF measures how many times the given pair occurred in the collection in a window and cooc DF measures in how many documents the given pair occurred at least once in a window of a given width.
+
+This document should be enough to gather co-occurrence statistics, but also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).
 
-	id1 id2 value
+* **Coherence Computation**
 
-where ``id1`` and ``id2`` are indexes of tokens in vocabulary file in UCI format (with 0-based indexing!). Details about UCI format can be found here: :doc:`../datasets`. ``value`` is some pairwise information about tokens with indexes ``id1`` and ``id2``. For instance, ``value`` can be computed by any corpora of texts by formulas:
+Let's assume you have file `cooc.txt` with some pairwise information in Vowpal Wabbit format, which means that lines of that file look like this: ::
 
-* positive PMI: :math:`value(u,v)=\left[\log\cfrac{n(u,v)n}{n(u)n(v)}\right]_{+}`, 
-* smoothed logarithm of conditional probability: :math:`value(u,v)=\log\cfrac{n(u,v)+1}{n(v)}`,
+    token_u token_v:cooc_uv token_w:cooc_uw
 
-where `n(u)` and `n(v)` are frequencies of `u` and `v` in corpora, `n(u,v)` is joint frequency of `u` and `v` in corpora in window with some fixed width, `n` is length of corpora in words.
+Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file.
 
-* **Usage**
+.. note::
 
-Let's assume you have file `cooc.txt` with co-occurrences of tokens. Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file. 
+   If tokens has **nondefault modalities** in collection, you should specify their modalities in `vocab.txt` (in `cooc.txt` they're added automatically).
 
 To upload co-occurrence data into BigARTM you should use ``artm.Dictionary`` object and method ``gather``: ::
 

diff --git a/docs/tutorials/scores_descr.txt b/docs/tutorials/scores_descr.txt
@@ -56,7 +56,7 @@ Return ``k`` (= requested number of top tokens) most probable tokens in each req
 
 The coherence formula for topic is defined as
 
-:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k} \mathrm{value}(w_i, w_j)`,
+:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,
 
 where value is some pairwise information about tokens in collection dictionary, which is provided by user according to his goals.
 

diff --git a/python/tests/artm/test_regularizer_biterms.py b/python/tests/artm/test_regularizer_biterms.py
@@ -1,4 +1,4 @@
-# Copyright 2017, Additive Regularization of Topic Models.
+# Copyright 2018, Additive Regularization of Topic Models.
 
 import shutil
 import glob
@@ -102,7 +102,7 @@ def test_func():
         model.regularizers.add(artm.BitermsPhiRegularizer(name='Biterms', tau=biterms_tau, dictionary=dictionary))
 
         assert abs(model.phi_.as_matrix()[0][0] - phi_first_elem) < phi_eps
-    
+
         model.fit_offline(batch_vectorizer=batch_vectorizer)
         for i in range(len(phi_values)):
             for j in range(len(phi_values[0])):

diff --git a/src/artm/core/batch_manager.cc b/src/artm/core/batch_manager.cc
@@ -1,4 +1,4 @@
-// Copyright 2017, Additive Regularization of Topic Models.
+// Copyright 2018, Additive Regularization of Topic Models.
 
 #include "artm/core/batch_manager.h"
 

diff --git a/src/artm/core/batch_manager.h b/src/artm/core/batch_manager.h
@@ -1,4 +1,4 @@
-// Copyright 2017, Additive Regularization of Topic Models.
+// Copyright 2018, Additive Regularization of Topic Models.
 
 #pragma once
 

diff --git a/src/artm/core/collection_parser.cc b/src/artm/core/collection_parser.cc
@@ -1,19 +1,19 @@
-// Copyright 2017, Additive Regularization of Topic Models.
+// Copyright 2018, Additive Regularization of Topic Models.
 
 #include "artm/core/collection_parser.h"
 
 #include <algorithm>
-#include <sstream>
-#include <vector>
-#include <string>
+#include <atomic>
+#include <iostream>  // NOLINT
+#include <future>  // NOLINT
 #include <map>
 #include <memory>
-#include <utility>
+#include <sstream>
+#include <string>
 #include <unordered_set>
 #include <unordered_map>
-#include <iostream>  // NOLINT
-#include <future>  // NOLINT
-#include <atomic>
+#include <utility>
+#include <vector>
 
 #include "boost/algorithm/string.hpp"
 #include "boost/algorithm/string/predicate.hpp"
@@ -26,11 +26,11 @@
 #include "artm/utility/ifstream_or_cin.h"
 #include "artm/utility/progress_printer.h"
 
+#include "artm/core/cooccurrence_collector.h"
 #include "artm/core/common.h"
 #include "artm/core/exceptions.h"
 #include "artm/core/helpers.h"
 #include "artm/core/protobuf_helpers.h"
-#include "artm/core/cooccurrence_collector.h"
 
 using ::artm::utility::ifstream_or_cin;
 
@@ -465,9 +465,9 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {
   // During parsing it gathers co-occurrence counters for pairs of tokens (if the correspondent flag == true)
   // Steps 1-4 are repeated in a while loop until there is no content left in docword file.
   // Multiple copies of the function can work in parallel.
-  auto func = [&docword, &global_line_no, &progress, &batch_name_generator, &read_access, &cooc_config_access,
-               &token_map_access, &token_statistics_access, &parser_info, &token_map, &total_num_of_pairs,
-               &cooc_collector, &gather_transaction_cooc, config]() {
+  auto func = [&docword, &global_line_no, &progress, &batch_name_generator, &read_access,
+               &cooc_config_access, &token_map_access, &token_statistics_access, &parser_info,
+               &token_map, &total_num_of_pairs, &cooc_collector, &gather_transaction_cooc, config]() {
     int64_t local_num_of_pairs = 0;  // statistics for future ppmi calculation
     while (true) {
       // The following variable remembers at which line the batch has started.
@@ -510,7 +510,7 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {
       // and then this storage is destroyed
       CooccurrenceStatisticsHolder cooc_stat_holder;
       // For every token from vocab keep the information about the last document this token occured in
-      // std::unordered_map<int> num_of_last_document_token_occured;  // ToDo: think how to add elements here
+
       // ToDo (MichaelSolotky): consider the case if there is no vocab
       std::vector<int> num_of_last_document_token_occured(cooc_collector.vocab_.token_map_.size(), -1);
 
@@ -530,6 +530,7 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {
 
         std::string item_title = strs[0];
 
+        // ToDo: calcualte cross-modality co-occurrurrence (window width is doc length)
         std::vector<ClassId> class_ids = { DefaultClass };
         for (unsigned elem_index = 1; elem_index < strs.size(); ++elem_index) {
           std::string elem = strs[elem_index];

diff --git a/src/artm/core/collection_parser.h b/src/artm/core/collection_parser.h
@@ -1,4 +1,4 @@
-// Copyright 2017, Additive Regularization of Topic Models.
+// Copyright 2018, Additive Regularization of Topic Models.
 
 #pragma once
 

diff --git a/src/artm/core/common.h b/src/artm/core/common.h
@@ -1,4 +1,4 @@
-// Copyright 2017, Additive Regularization of Topic Models.
+// Copyright 2018, Additive Regularization of Topic Models.
 
 // File 'common.h' contains constants, helpers and typedefs used across the entire library.
 // The goal is to keep this file as short as possible.

diff --git a/src/artm/core/cooccurrence_collector.cc b/src/artm/core/cooccurrence_collector.cc
@@ -2,34 +2,34 @@
 
 #include "artm/core/cooccurrence_collector.h"
 
-#include <unordered_map>
-#include <string>
+#include <algorithm>
+#include <cassert>
+#include <iomanip>
 #include <iostream>
 #include <fstream>
-#include <map>
-#include <vector>
 #include <future>  // NOLINT
+#include <map>
+#include <memory>
 #include <mutex>  // NOLINT
-#include <thread>  // NOLINT
+#include <queue>
 #include <sstream>
-#include <iomanip>
-#include <memory>
-#include <cassert>
 #include <stdexcept>
-#include <queue>
-#include <algorithm>
+#include <string>
+#include <thread>  // NOLINT
+#include <unordered_map>
 #include <utility>
+#include <vector>
 
 #include "boost/algorithm/string.hpp"
 #include "boost/filesystem.hpp"
-#include "boost/utility.hpp"
 #include "boost/lexical_cast.hpp"
+#include "boost/utility.hpp"
 #include "boost/uuid/uuid_io.hpp"
 #include "boost/uuid/uuid_generators.hpp"
 
+#include "artm/core/collection_parser.h"
 #include "artm/core/common.h"
 #include "artm/core/exceptions.h"
-#include "artm/core/collection_parser.h"
 
 namespace fs = boost::filesystem;
 

diff --git a/src/artm/core/cooccurrence_collector.h b/src/artm/core/cooccurrence_collector.h
@@ -2,23 +2,22 @@
 
 #pragma once
 
-#include <unordered_map>
-#include <string>
+#include <iomanip>
 #include <iostream>
 #include <fstream>
 #include <map>
-#include <vector>
-#include <sstream>
-#include <iomanip>
 #include <memory>
 #include <mutex>  // NOLINT
+#include <sstream>
+#include <string>
+#include <unordered_map>
+#include <vector>
 
 #include "boost/algorithm/string.hpp"
 #include "boost/filesystem.hpp"
 #include "boost/utility.hpp"
 
 #include "artm/core/collection_parser.h"
-
 #include "artm/core/common.h"
 
 namespace artm {

diff --git a/src/bigartm/srcmain.cc b/src/bigartm/srcmain.cc
@@ -2,12 +2,9 @@
 
 #include <stdlib.h>
 
-#include <cstring>
-#include <ctime>
-#include <cmath>
-
 #include <algorithm>
 #include <chrono>
+#include <cstring>
 #include <fstream>
 #include <future>
 #include <iostream>
@@ -272,7 +269,6 @@ struct artm_options {
   std::string dictionary_max_df;
   int dictionary_size;
   int cooc_window;
-  int doc_per_cooc_batch;
   int cooc_min_df;
   int cooc_min_tf;
 
@@ -1766,7 +1762,6 @@ int main(int argc, char * argv[]) {
       ("cooc-min-tf", po::value(&options.cooc_min_tf)->default_value(0), "minimal value of cooccurrences of a pair of tokens that are saved in dictionary of cooccurrences")
       ("cooc-min-df", po::value(&options.cooc_min_df)->default_value(0), "minimal value of documents in which a specific pair of tokens occurred together closely")
       ("cooc-window", po::value(&options.cooc_window)->default_value(5), "number of tokens around specific token, which are used in calculation of cooccurrences")
-      ("doc-per-cooc-batch", po::value(&options.doc_per_cooc_batch)->default_value(10000), "number of documents which will be processed and written in 1 cooc batch")
       ("dictionary-min-df", po::value(&options.dictionary_min_df)->default_value(""), "filter out tokens present in less than N documents / less than P% of documents")
       ("dictionary-max-df", po::value(&options.dictionary_max_df)->default_value(""), "filter out tokens present in less than N documents / less than P% of documents")
       ("dictionary-size", po::value(&options.dictionary_size)->default_value(0), "limit dictionary size by filtering out tokens with high document frequency")