/
cpp_client.txt
104 lines (87 loc) · 6.33 KB
/
cpp_client.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
============================
BigARTM command line utility
============================
This document provides an overview of ``cpp_client``,
a simple command-line utility shipped with BigARTM.
To run *cpp_client* you need to download input data (a textual collection represented in bag-of-words format).
We recommend to download *vocab* and *docword* files by links provided in :ref:`TutorialParseCollection` section of the tutorial.
Then you can use *cpp_client* as follows:
.. code-block:: bash
cpp_client -d docword.kos.txt -v vocab.kos.txt
You may append the following options to customize the resulting topic model:
* ``-t`` or ``--num_topic`` sets the number of topics in the resulting topic model.
* ``-i`` or ``--num_iters`` sets the number of iterative scans over the collection.
* ``--num_inner_iters`` sets the number of updates of theta matrix performed on each iteration.
* ``--reuse_theta`` enables caching of Theta matrix and re-uses last Theta matrix from
the previous iteration as initial approximation on the next iteration. The default alternative
without ``--reuse_theta`` switch is to generate random approximation of Theta matrix on each iteration.
* ``--tau_phi``, ``--tau_theta`` and ``--tau_decor`` allows you to specify weights of different regularizers.
Currently *cpp_client* does not allow you to customize regularizer weights for different topics
and for different iterations. This limitation is only related to *cpp_client*,
and you can simply achieve this by using BigARTM interface (either in Python or in C++).
* ``--online_decay`` and ``--online_period`` are the parameters of online algorithm.
*online_period* specifies time interval in milliseconds between model synchronization.
Once per this time interval the topic model will be re-calculated as follows:
*n_wt* counters in the old model are decreased according to *online_decay* coefficient,
and increased by *n_wt* counters calculated by all processors since the last model synchronization.
The suggested way to choose *online_period* is to measure single iteration time for *cpp_client* in offline mode
(e.g. without *online_period* option). Then set *online_period* to ``lambda = 0.25`` of the iteration time.
For larger collections it may be more efficient to set *online_period* to ``lambda = 0.1`` of the iteration time,
but then remember to update *online_decay* parameter accordingly (``1-lambda`` should be a reasonable value).
You may also apply the following optimizations that should not change the resulting model
* ``--reuse_batches`` skips parsing of *docword* and *vocab* files, and tries to use batches located in ``--batch_folder``.
You may download pre-parsed batches by links provided in :ref:`TutorialParseCollection` section of the tutorial.
* ``-p`` allows you to specify number of concurrent processors.
The recommended value is to use the number of logical cores on your machine.
* ``--no_scores`` disables calculation and visualization of all scores.
This is a clean way of measuring pure performance of BigARTM,
because at the moment some scores takes unnecessary long time to calculate.
* ``--disk_cache_folder`` applies only together with ``--reuse_theta``.
This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix
between iterations to avoid storing it in main memory.
.. code-block:: bash
>cpp_client --help
BigARTM - library for advanced topic modeling (http://bigartm.org):
Basic options:
-h [ --help ] display this help message
-d [ --docword ] arg docword file in UCI format
-v [ --vocab ] arg vocab file in UCI format
-t [ --num_topic ] arg (=16) number of topics
-p [ --num_processors ] arg (=2) number of concurrent processors
-i [ --num_iters ] arg (=10) number of outer iterations
--num_inner_iters arg (=10) number of inner iterations
--reuse_theta reuse theta between iterations
--batch_folder arg (=batches) temporary folder to store batches
--dictionary_file arg (=filename of dictionary file)
--reuse_batches reuse batches found in batch_folder
(default = false)
--items_per_batch arg (=500) number of items per batch
--tau_phi arg (=0) regularization coefficient for PHI
matrix
--tau_theta arg (=0) regularization coefficient for THETA
matrix
--tau_decor arg (=0) regularization coefficient for topics
decorrelation (use with care, since
this value heavily depends on the size
of the dataset)
--paused wait for keystroke (allows to attach a
debugger)
--no_scores disable calculation of all scores
--online_period arg (=0) period in milliseconds between model
synchronization on the online algorithm
--online_decay arg (=0.75) decay coefficient [0..1] for online
algorithm
--parsing_format arg (=0) parsing format (0 - UCI, 1 - matrix
market)
--disk_cache_folder arg disk cache folder
Networking options (experimental):
--nodes arg endpoints of the remote nodes (enables network
modus operandi)
--localhost arg (=localhost) DNS name or the IP address of the localhost
--port arg (=5550) port to use for master node
--proxy arg proxy endpoint
--timeout arg (=1000) network communication timeout in milliseconds
Examples:
cpp_client -d docword.kos.txt -v vocab.kos.txt
set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt
For further details please refer to the `source code <https://raw.githubusercontent.com/bigartm/bigartm/master/src/cpp_client/srcmain.cc>`_ of cpp_client.