docs/formats.txt

Formats
=======

This page describes input data formats compatible with BigARTM.
Currently all formats correspond to `Bag-of-words representation <https://en.wikipedia.org/wiki/Bag-of-words_model>`_,
meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.

1. `Vowpal Wabbit <https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format>`_ is a single-format file, based on the following principles:

   * each document is depresented in a single line
   * all tokens are represented as strings (no need to convert them into an integer identifier)
   * token frequency defaults to ``1.0``, and can be optionally specified after a colon (:)
   * namespaces (:attr:`Batch.class_id`) can be identified by a pipe (|)
  
   *Example 1*
 
   .. code-block:: bash

      doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
      doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov

   *Example 2*

   .. code-block:: bash
 
      user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
      user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
  
2. `UCI Bag-of-words <https://archive.ics.uci.edu/ml/datasets/Bag+of+Words>`_
   format consists of two files - ``vocab.*.txt`` and ``docword.*.txt``.
   The format of the ``docword.*.txt`` file is 3 header lines, followed by NNZ triples:

   .. code-block:: bash

      D
      W
      NNZ
      docID wordID count
      docID wordID count
      ...
      docID wordID count

   The file must be sorted on docID.
   Values of wordID must be unity-based (not zero-based).
   The format of the ``vocab.*.txt`` file is line containing wordID=n.
   Note that words must not have spaces or tabs.
   In ``vocab.*.txt`` file it is also possible to specify
   the namespace (:attr:`Batch.class_id`) for tokens, as it is shown in this example:

   .. code-block:: bash

      token1 @default_class
      token2 custom_class
      token3 @default_class
      token4

   Use space or tab to separate token from its class.
   Token that are not followed by class label automatically
   get ''@default_class'' as a label (see ''token4'' in the example).

   **Unicode support**. For non-ASCII characters save ``vocab.*.txt`` file in **UTF-8** format.
  
3. Batches (binary BigARTM-specific format).

   This is compact and efficient format, based on several protobuf messages in public BigARTM interface (:ref:`Batch <Batch>`, :ref:`Item <Item>` and :ref:`Field <Field>`).
   
   * A batch is a collection of several items
   * An item is a collection of several fields
   * A field is a collection of pairs ``(token_id, token_weight)``.
   
   The following example shows a Python code that generates a synthetic batch.

   .. code-block:: bash
   
      import artm.messages, random, uuid

      num_tokens = 60
      num_items = 100
      batch = artm.messages.Batch()
      batch.id = str(uuid.uuid4())
      for token_id in range(0, num_tokens):
          batch.token.append('token' + str(token_id))

      for item_id in range(0, num_items):
          item = batch.item.add()
          item.id = item_id
          field = item.field.add()
          for token_id in range(0, num_tokens):
              field.token_id.append(token_id)
              background_count = random.randint(1, 5) if (token_id >= 40) else 0
              topical_count = 10 if (token_id < 40) and ((token_id % 10) == (item_id % 10)) else 0
              field.token_weight.append(background_count + topical_count)
			  
   Note that the batch has its local dictionary, ``batch.token``.
   This dictionary which maps ``token_id`` into the actual token.
   In order to create a batch from textual files involve one needs to find all distinct words,
   and map them into sequential indices.

   ``batch.id`` must be set to a unique GUID in a format of ``00000000-0000-0000-0000-000000000000``.