docs/ref/cpp_client.txt

============================
BigARTM command line utility
============================

This document provides an overview of ``cpp_client``,
a simple command-line utility shipped with BigARTM.

To run *cpp_client* you need to download input data (a textual collection represented in bag-of-words format).
We recommend to download *vocab* and *docword* files by links provided in :ref:`TutorialParseCollection` section of the tutorial.
Then you can use *cpp_client* as follows:

.. code-block:: bash

   cpp_client -d docword.kos.txt -v vocab.kos.txt

You may append the following options to customize the resulting topic model:

  * ``-t`` or ``--num_topic`` sets the number of topics in the resulting topic model.
  * ``-i`` or ``--num_iters`` sets the number of iterative scans over the collection.
  * ``--num_inner_iters`` sets the number of updates of theta matrix performed on each iteration.
  * ``--reuse_theta`` enables caching of Theta matrix and re-uses last Theta matrix from
    the previous iteration as initial approximation on the next iteration. The default alternative
    without ``--reuse_theta`` switch is to generate random approximation of Theta matrix on each iteration.
  * ``--tau_phi``, ``--tau_theta`` and ``--tau_decor`` allows you to specify weights of different regularizers.
    Currently *cpp_client* does not allow you to customize regularizer weights for different topics
    and for different iterations. This limitation is only related to *cpp_client*,
    and you can simply achieve this by using BigARTM interface (either in Python or in C++).
  * ``--online_decay`` and ``--online_period`` are the parameters of online algorithm.
    *online_period* specifies time interval in milliseconds between model synchronization.
    Once per this time interval the topic model will be re-calculated as follows:
    *n_wt* counters in the old model are decreased according to *online_decay* coefficient,
    and increased by *n_wt* counters calculated by all processors since the last model synchronization.
    The suggested way to choose *online_period* is to measure single iteration time for *cpp_client* in offline mode
    (e.g. without *online_period* option). Then set *online_period* to ``lambda = 0.25`` of the iteration time.
    For larger collections it may be more efficient to set *online_period* to ``lambda = 0.1`` of the iteration time,
    but then remember to update *online_decay* parameter accordingly (``1-lambda`` should be a reasonable value).

You may also apply the following optimizations that should not change the resulting model

  * ``--reuse_batches`` skips parsing of *docword* and *vocab* files, and tries to use batches located in ``--batch_folder``.
    You may download pre-parsed batches by links provided in :ref:`TutorialParseCollection` section of the tutorial.

  * ``-p`` allows you to specify number of concurrent processors.
    The recommended value is to use the number of logical cores on your machine.

  * ``--no_scores`` disables calculation and visualization of all scores.
    This is a clean way of measuring pure performance of BigARTM,
    because at the moment some scores takes unnecessary long time to calculate.

  * ``--disk_cache_folder`` applies only together with ``--reuse_theta``.
    This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix
    between iterations to avoid storing it in main memory.

.. code-block:: bash

   >cpp_client --help
   BigARTM - library for advanced topic modeling (http://bigartm.org):

   Basic options:
     -h [ --help ]                         display this help message
     -d [ --docword ] arg                  docword file in UCI format
     -v [ --vocab ] arg                    vocab file in UCI format
     -t [ --num_topic ] arg (=16)          number of topics
     -p [ --num_processors ] arg (=2)      number of concurrent processors
     -i [ --num_iters ] arg (=10)          number of outer iterations
     --num_inner_iters arg (=10)           number of inner iterations
     --reuse_theta                         reuse theta between iterations
     --batch_folder arg (=batches)         temporary folder to store batches
     --dictionary_file arg (=filename of dictionary file)
     --reuse_batches                       reuse batches found in batch_folder
                                           (default = false)
     --items_per_batch arg (=500)          number of items per batch
     --tau_phi arg (=0)                    regularization coefficient for PHI
                                           matrix
     --tau_theta arg (=0)                  regularization coefficient for THETA
                                           matrix
     --tau_decor arg (=0)                  regularization coefficient for topics
                                           decorrelation (use with care, since
                                           this value heavily depends on the size
                                           of the dataset)
     --paused                              wait for keystroke (allows to attach a
                                           debugger)
     --no_scores                           disable calculation of all scores
     --online_period arg (=0)              period in milliseconds between model
                                           synchronization on the online algorithm
     --online_decay arg (=0.75)            decay coefficient [0..1] for online
                                           algorithm
     --parsing_format arg (=0)             parsing format (0 - UCI, 1 - matrix
                                           market)
     --disk_cache_folder arg               disk cache folder
   
   Networking options (experimental):
     --nodes arg                  endpoints of the remote nodes (enables network
                                  modus operandi)
     --localhost arg (=localhost) DNS name or the IP address of the localhost
     --port arg (=5550)           port to use for master node
     --proxy arg                  proxy endpoint
     --timeout arg (=1000)        network communication timeout in milliseconds

   Examples:
           cpp_client -d docword.kos.txt -v vocab.kos.txt
           set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt

For further details please refer to the `source code <https://raw.githubusercontent.com/bigartm/bigartm/master/src/cpp_client/srcmain.cc>`_ of cpp_client.