Skip to content

Commit

Permalink
Fixes in documentation (#918)
Browse files Browse the repository at this point in the history
* refreshed documentation, add link to bigartm-book
  • Loading branch information
MichaelSolotky committed Jul 24, 2018
1 parent 76f137d commit da19394
Show file tree
Hide file tree
Showing 12 changed files with 88 additions and 60 deletions.
21 changes: 15 additions & 6 deletions docs/tutorials/bigartm_cli.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,27 @@ Then you can use ``bigartm`` as described by ``bigartm --help``.
You may also get more information about builtin regularizers by typing ``bigartm --help --regularizer``.


* **Gathering of Co-occurrence Statistics File**

In order to gather co-ooccurrence statistics files you need to have 2 files: collection in Vowpal Wabbit format and file of tokens (so called `"vocab"`) in UCI format. Vocab is needed to filter tokens of collection, which means that co-occurrence won't be calculated if some pair isn't presented in vocab. There are 2 types of co-occurrences available now: tf and df (see description below). Also you may want to calculate Positive PMI of gathered co-occurrence values. This information can be usefull if you want to use co-occurrence dictionary in coherence computation. The utility produces files with pointwise information in as pseudocollection in Vowpal Wabbit format.

.. note::

In order to gather co-ooccurrence dictionary you need to have 2 files: collection in Vowpal Wabbit format and file of tokens in UCI vocab format. If you want to use co-occurrence dictionary in coherence computation you'll need to form file with ppmi values and use that as co-occurrence dictionary (it matches the format of co-ooccurrence dictionary).
Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:
If you want to compute co-occurrences of tokens of **non-default modality**, you should specify that modalities in vocab file. For more information about UCI files format, please visit :doc:`./datasets`.

Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:

.. code-block:: bash
.. code-block:: bash

bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_
bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_

Numbers and names of files can be changed, it's just an example.
To know the description of each key, see below in table of all available keys.

Also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).

Numbers and names of files can be changed, it's just an example.
To know the description of each key, see below in table of all available keys.

* **BigARTM CLI keys**

.. code-block:: bash

Expand Down
44 changes: 34 additions & 10 deletions docs/tutorials/python_userguide/coherence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,50 @@

The one of main requirements of topic models is interpretability (i.e., do the topics contain tokens that, according to subjective human judgment, are representative of single coherent concept). `Newman at al. <http://www.aclweb.org/anthology/N10-1012>`_ showed the human evaluation of interpretability is well correlated with the following automated quality measure called coherence. The coherence formula for topic is defined as

:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k} \mathrm{value}(w_i, w_j)`,
:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum\limits_{i = 1}^{k - 1} \sum\limits_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,

where value is some symetric pairwise information about tokens in collection, which is provided by user according to his goals, for instance:

* positive PMI: :math:`value(u, v)=\left[\log\cfrac{p(u, v)}{p(u)p(v)}\right]_{+}`,

where `p(u, v)` is joint probability of tokens `u` and `v` in corpora acording to some probabilistic model. We will require joint probablities to be symetrical.

Some models of tokens co-occurrence are implemented in BigARTM, you can automatically calculate them or use your own model to provide pairwise information to calculate coherence.

where value is some pairwise information about tokens in collection dictionary, which is provided by user according to his goals.

* **Tokens Co-occurrence Dictionary**

BigARTM provides ability of automatic coherence computation. The only thing you should prepare is file with co-occurrences of tokens (to know how to gather a co-occurrence dictionary using BigARTM CLI, please read :doc:`../bigartm_cli`). Lines of that file should have the next format: ::
BigARTM provides automatic gathering of co-occurrence statistics and coherence computation. Co-occurrence gathering tool is available from :doc:`../bigartm_cli`. Probabilities in PPMI formula can be estimated by frequencies in corpora:

:math:`p(u, v) = \cfrac{n(u, v)}{n}`;

:math:`p(u) = \cfrac{n(u)}{n}`;

:math:`n(u) = \sum\limits_{w \in W}{n(u, w)}`;

:math:`n = \sum\limits_{w \in W}{n(w)}`.

All depends of the way to calculate joint frequencies (i.e. co-occurrences). Here are types of co-occurrences available now:

* Cooc TF: :math:`n(u, v) = \sum\limits_{d = 1}^{|D|} \sum\limits_{i = 1}^{N_d} \sum\limits_{j = 1}^{N_d} [0 < |i - j| \leq k] [w_{di} = u] [w_{dj} = v]`,

* Cooc DF: :math:`n_(u, v) = \sum\limits_{d = 1}^{|D|} [\, \exists \, i, j : w_{di} = u, w_{dj} = v, 0 < |i - j| \leq k]`,

where k is parameter of window width which can be specified by user, D is collection, :math:`N_{d}` is length of document d. In brief cooc TF measures how many times the given pair occurred in the collection in a window and cooc DF measures in how many documents the given pair occurred at least once in a window of a given width.

This document should be enough to gather co-occurrence statistics, but also you can look at launchable examples and common mistakes in `bigartm-book <http://nbviewer.jupyter.org/github/bigartm/bigartm-book/blob/master/junk/cooc_dictionary/example_of_gathering.ipynb>`_ (in russian).

id1 id2 value
* **Coherence Computation**

where ``id1`` and ``id2`` are indexes of tokens in vocabulary file in UCI format (with 0-based indexing!). Details about UCI format can be found here: :doc:`../datasets`. ``value`` is some pairwise information about tokens with indexes ``id1`` and ``id2``. For instance, ``value`` can be computed by any corpora of texts by formulas:
Let's assume you have file `cooc.txt` with some pairwise information in Vowpal Wabbit format, which means that lines of that file look like this: ::

* positive PMI: :math:`value(u,v)=\left[\log\cfrac{n(u,v)n}{n(u)n(v)}\right]_{+}`,
* smoothed logarithm of conditional probability: :math:`value(u,v)=\log\cfrac{n(u,v)+1}{n(v)}`,
token_u token_v:cooc_uv token_w:cooc_uw

where `n(u)` and `n(v)` are frequencies of `u` and `v` in corpora, `n(u,v)` is joint frequency of `u` and `v` in corpora in window with some fixed width, `n` is length of corpora in words.
Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file.

* **Usage**
.. note::

Let's assume you have file `cooc.txt` with co-occurrences of tokens. Also you should have vocabulary file in UCI format `vocab.txt` corresponding to `cooc.txt` file.
If tokens has **nondefault modalities** in collection, you should specify their modalities in `vocab.txt` (in `cooc.txt` they're added automatically).

To upload co-occurrence data into BigARTM you should use ``artm.Dictionary`` object and method ``gather``: ::

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/scores_descr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Return ``k`` (= requested number of top tokens) most probable tokens in each req

The coherence formula for topic is defined as

:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k} \mathrm{value}(w_i, w_j)`,
:math:`\mathcal{C}_t = \cfrac{2}{k(k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = i + 1}^{k} \mathrm{value}(w_i, w_j)`,

where value is some pairwise information about tokens in collection dictionary, which is provided by user according to his goals.

Expand Down
4 changes: 2 additions & 2 deletions python/tests/artm/test_regularizer_biterms.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2017, Additive Regularization of Topic Models.
# Copyright 2018, Additive Regularization of Topic Models.

import shutil
import glob
Expand Down Expand Up @@ -102,7 +102,7 @@ def test_func():
model.regularizers.add(artm.BitermsPhiRegularizer(name='Biterms', tau=biterms_tau, dictionary=dictionary))

assert abs(model.phi_.as_matrix()[0][0] - phi_first_elem) < phi_eps

model.fit_offline(batch_vectorizer=batch_vectorizer)
for i in range(len(phi_values)):
for j in range(len(phi_values[0])):
Expand Down
2 changes: 1 addition & 1 deletion src/artm/core/batch_manager.cc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017, Additive Regularization of Topic Models.
// Copyright 2018, Additive Regularization of Topic Models.

#include "artm/core/batch_manager.h"

Expand Down
2 changes: 1 addition & 1 deletion src/artm/core/batch_manager.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017, Additive Regularization of Topic Models.
// Copyright 2018, Additive Regularization of Topic Models.

#pragma once

Expand Down
27 changes: 14 additions & 13 deletions src/artm/core/collection_parser.cc
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
// Copyright 2017, Additive Regularization of Topic Models.
// Copyright 2018, Additive Regularization of Topic Models.

#include "artm/core/collection_parser.h"

#include <algorithm>
#include <sstream>
#include <vector>
#include <string>
#include <atomic>
#include <iostream> // NOLINT
#include <future> // NOLINT
#include <map>
#include <memory>
#include <utility>
#include <sstream>
#include <string>
#include <unordered_set>
#include <unordered_map>
#include <iostream> // NOLINT
#include <future> // NOLINT
#include <atomic>
#include <utility>
#include <vector>

#include "boost/algorithm/string.hpp"
#include "boost/algorithm/string/predicate.hpp"
Expand All @@ -26,11 +26,11 @@
#include "artm/utility/ifstream_or_cin.h"
#include "artm/utility/progress_printer.h"

#include "artm/core/cooccurrence_collector.h"
#include "artm/core/common.h"
#include "artm/core/exceptions.h"
#include "artm/core/helpers.h"
#include "artm/core/protobuf_helpers.h"
#include "artm/core/cooccurrence_collector.h"

using ::artm::utility::ifstream_or_cin;

Expand Down Expand Up @@ -465,9 +465,9 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {
// During parsing it gathers co-occurrence counters for pairs of tokens (if the correspondent flag == true)
// Steps 1-4 are repeated in a while loop until there is no content left in docword file.
// Multiple copies of the function can work in parallel.
auto func = [&docword, &global_line_no, &progress, &batch_name_generator, &read_access, &cooc_config_access,
&token_map_access, &token_statistics_access, &parser_info, &token_map, &total_num_of_pairs,
&cooc_collector, &gather_transaction_cooc, config]() {
auto func = [&docword, &global_line_no, &progress, &batch_name_generator, &read_access,
&cooc_config_access, &token_map_access, &token_statistics_access, &parser_info,
&token_map, &total_num_of_pairs, &cooc_collector, &gather_transaction_cooc, config]() {
int64_t local_num_of_pairs = 0; // statistics for future ppmi calculation
while (true) {
// The following variable remembers at which line the batch has started.
Expand Down Expand Up @@ -510,7 +510,7 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {
// and then this storage is destroyed
CooccurrenceStatisticsHolder cooc_stat_holder;
// For every token from vocab keep the information about the last document this token occured in
// std::unordered_map<int> num_of_last_document_token_occured; // ToDo: think how to add elements here

// ToDo (MichaelSolotky): consider the case if there is no vocab
std::vector<int> num_of_last_document_token_occured(cooc_collector.vocab_.token_map_.size(), -1);

Expand All @@ -530,6 +530,7 @@ CollectionParserInfo CollectionParser::ParseVowpalWabbit() {

std::string item_title = strs[0];

// ToDo: calcualte cross-modality co-occurrurrence (window width is doc length)
std::vector<ClassId> class_ids = { DefaultClass };
for (unsigned elem_index = 1; elem_index < strs.size(); ++elem_index) {
std::string elem = strs[elem_index];
Expand Down
2 changes: 1 addition & 1 deletion src/artm/core/collection_parser.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017, Additive Regularization of Topic Models.
// Copyright 2018, Additive Regularization of Topic Models.

#pragma once

Expand Down
2 changes: 1 addition & 1 deletion src/artm/core/common.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017, Additive Regularization of Topic Models.
// Copyright 2018, Additive Regularization of Topic Models.

// File 'common.h' contains constants, helpers and typedefs used across the entire library.
// The goal is to keep this file as short as possible.
Expand Down
24 changes: 12 additions & 12 deletions src/artm/core/cooccurrence_collector.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,34 @@

#include "artm/core/cooccurrence_collector.h"

#include <unordered_map>
#include <string>
#include <algorithm>
#include <cassert>
#include <iomanip>
#include <iostream>
#include <fstream>
#include <map>
#include <vector>
#include <future> // NOLINT
#include <map>
#include <memory>
#include <mutex> // NOLINT
#include <thread> // NOLINT
#include <queue>
#include <sstream>
#include <iomanip>
#include <memory>
#include <cassert>
#include <stdexcept>
#include <queue>
#include <algorithm>
#include <string>
#include <thread> // NOLINT
#include <unordered_map>
#include <utility>
#include <vector>

#include "boost/algorithm/string.hpp"
#include "boost/filesystem.hpp"
#include "boost/utility.hpp"
#include "boost/lexical_cast.hpp"
#include "boost/utility.hpp"
#include "boost/uuid/uuid_io.hpp"
#include "boost/uuid/uuid_generators.hpp"

#include "artm/core/collection_parser.h"
#include "artm/core/common.h"
#include "artm/core/exceptions.h"
#include "artm/core/collection_parser.h"

namespace fs = boost::filesystem;

Expand Down
11 changes: 5 additions & 6 deletions src/artm/core/cooccurrence_collector.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,22 @@

#pragma once

#include <unordered_map>
#include <string>
#include <iomanip>
#include <iostream>
#include <fstream>
#include <map>
#include <vector>
#include <sstream>
#include <iomanip>
#include <memory>
#include <mutex> // NOLINT
#include <sstream>
#include <string>
#include <unordered_map>
#include <vector>

#include "boost/algorithm/string.hpp"
#include "boost/filesystem.hpp"
#include "boost/utility.hpp"

#include "artm/core/collection_parser.h"

#include "artm/core/common.h"

namespace artm {
Expand Down
7 changes: 1 addition & 6 deletions src/bigartm/srcmain.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,9 @@

#include <stdlib.h>

#include <cstring>
#include <ctime>
#include <cmath>

#include <algorithm>
#include <chrono>
#include <cstring>
#include <fstream>
#include <future>
#include <iostream>
Expand Down Expand Up @@ -272,7 +269,6 @@ struct artm_options {
std::string dictionary_max_df;
int dictionary_size;
int cooc_window;
int doc_per_cooc_batch;
int cooc_min_df;
int cooc_min_tf;

Expand Down Expand Up @@ -1766,7 +1762,6 @@ int main(int argc, char * argv[]) {
("cooc-min-tf", po::value(&options.cooc_min_tf)->default_value(0), "minimal value of cooccurrences of a pair of tokens that are saved in dictionary of cooccurrences")
("cooc-min-df", po::value(&options.cooc_min_df)->default_value(0), "minimal value of documents in which a specific pair of tokens occurred together closely")
("cooc-window", po::value(&options.cooc_window)->default_value(5), "number of tokens around specific token, which are used in calculation of cooccurrences")
("doc-per-cooc-batch", po::value(&options.doc_per_cooc_batch)->default_value(10000), "number of documents which will be processed and written in 1 cooc batch")
("dictionary-min-df", po::value(&options.dictionary_min_df)->default_value(""), "filter out tokens present in less than N documents / less than P% of documents")
("dictionary-max-df", po::value(&options.dictionary_max_df)->default_value(""), "filter out tokens present in less than N documents / less than P% of documents")
("dictionary-size", po::value(&options.dictionary_size)->default_value(0), "limit dictionary size by filtering out tokens with high document frequency")
Expand Down

0 comments on commit da19394

Please sign in to comment.