-
Notifications
You must be signed in to change notification settings - Fork 117
/
formats.txt
99 lines (72 loc) · 4.04 KB
/
formats.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Formats
=======
This page describes input data formats compatible with BigARTM.
Currently all formats correspond to `Bag-of-words representation <https://en.wikipedia.org/wiki/Bag-of-words_model>`_,
meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.
1. `Vowpal Wabbit <https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format>`_ is a single-format file, based on the following principles:
* each document is depresented in a single line
* all tokens are represented as strings (no need to convert them into an integer identifier)
* token frequency defaults to ``1.0``, and can be optionally specified after a colon (:)
* namespaces (:attr:`Batch.class_id`) can be identified by a pipe (|)
*Example 1*
.. code-block:: bash
doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov
*Example 2*
.. code-block:: bash
user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
2. `UCI Bag-of-words <https://archive.ics.uci.edu/ml/datasets/Bag+of+Words>`_
format consists of two files - ``vocab.*.txt`` and ``docword.*.txt``.
The format of the ``docword.*.txt`` file is 3 header lines, followed by NNZ triples:
.. code-block:: bash
D
W
NNZ
docID wordID count
docID wordID count
...
docID wordID count
The file must be sorted on docID.
Values of wordID must be unity-based (not zero-based).
The format of the ``vocab.*.txt`` file is line containing wordID=n.
Note that words must not have spaces or tabs.
In ``vocab.*.txt`` file it is also possible to specify
the namespace (:attr:`Batch.class_id`) for tokens, as it is shown in this example:
.. code-block:: bash
token1 @default_class
token2 custom_class
token3 @default_class
token4
Use space or tab to separate token from its class.
Token that are not followed by class label automatically
get ''@default_class'' as a label (see ''token4'' in the example).
**Unicode support**. For non-ASCII characters save ``vocab.*.txt`` file in **UTF-8** format.
3. Batches (binary BigARTM-specific format).
This is compact and efficient format, based on several protobuf messages in public BigARTM interface (:ref:`Batch <Batch>`, :ref:`Item <Item>` and :ref:`Field <Field>`).
* A batch is a collection of several items
* An item is a collection of several fields
* A field is a collection of pairs ``(token_id, token_weight)``.
The following example shows a Python code that generates a synthetic batch.
.. code-block:: bash
import artm.messages, random, uuid
num_tokens = 60
num_items = 100
batch = artm.messages.Batch()
batch.id = str(uuid.uuid4())
for token_id in range(0, num_tokens):
batch.token.append('token' + str(token_id))
for item_id in range(0, num_items):
item = batch.item.add()
item.id = item_id
field = item.field.add()
for token_id in range(0, num_tokens):
field.token_id.append(token_id)
background_count = random.randint(1, 5) if (token_id >= 40) else 0
topical_count = 10 if (token_id < 40) and ((token_id % 10) == (item_id % 10)) else 0
field.token_weight.append(background_count + topical_count)
Note that the batch has its local dictionary, ``batch.token``.
This dictionary which maps ``token_id`` into the actual token.
In order to create a batch from textual files involve one needs to find all distinct words,
and map them into sequential indices.
``batch.id`` must be set to a unique GUID in a format of ``00000000-0000-0000-0000-000000000000``.