/
typical_python_example.txt
218 lines (168 loc) · 11.6 KB
/
typical_python_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
Typical python example
======================
This page is obsolete, please use the high-level API described in
ARTM notebook
(`in Russian <http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/BigARTM_example_RU.ipynb>`_
or `in English <http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/BigARTM_example_EN.ipynb>`_).
Examples of low-level API
-------------------------
Folder ``C:\BigARTM\python\examples`` contains several toy examples:
* `example01_synthetic_collection.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example01_synthetic_collection.py>`_
* `example02_parse_collection.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example02_parse_collection.py>`_
* `example03_concurrency.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example03_concurrency.py>`_
* `example04_online_algorithm.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example04_online_algorithm.py>`_
* `example05_train_and_test_stream.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example05_train_and_test_stream.py>`_
* `example06_use_dictionaries.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example06_use_dictionaries.py>`_
* `example09_regularizers.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example09_regularizers.py>`_
* `example10_multimodal.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example10_multimodal.py>`_
* `example11_get_theta_matrix.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example11_get_theta_matrix.py>`_
* `example12_get_topic_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example12_get_topic_model.py>`_
* `example13_overwrite_topic_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example13_overwrite_topic_model.py>`_
* `example14_initialize_topic_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example14_initialize_topic_model.py>`_
* `example15_import_export_topic_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example15_import_export_topic_model.py>`_
* `example17_process_batches.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example17_process_batches.py>`_
* `example18_merge_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example18_merge_model.py>`_
* `example19_regularize_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example19_regularize_model.py>`_
* `example20_attach_model.py <https://raw.githubusercontent.com/bigartm/bigartm/master/python/examples/example20_attach_model.py>`_
All examples does not have any parameters, and you may run them without arguments:
.. code-block:: bash
C:\BigARTM\python\examples>python example02_parse_collection.py
No batches found, parsing them from textual collection... OK.
Iter#0 : Perplexity = 6885.223 , Phi sparsity = 0.050 , Theta sparsity = 0.012
Iter#1 : Perplexity = 2409.510 , Phi sparsity = 0.113 , Theta sparsity = 0.063
Iter#2 : Perplexity = 2075.445 , Phi sparsity = 0.203 , Theta sparsity = 0.174
Iter#3 : Perplexity = 1855.196 , Phi sparsity = 0.293 , Theta sparsity = 0.261
Iter#4 : Perplexity = 1728.749 , Phi sparsity = 0.370 , Theta sparsity = 0.302
Iter#5 : Perplexity = 1661.044 , Phi sparsity = 0.429 , Theta sparsity = 0.317
Iter#6 : Perplexity = 1621.851 , Phi sparsity = 0.475 , Theta sparsity = 0.327
Iter#7 : Perplexity = 1596.965 , Phi sparsity = 0.511 , Theta sparsity = 0.331
Top tokens per topic:
Topic#1: poll(0.05) iraq(0.04) people(0.02) news(0.02) john(0.01) media(0.01)
Topic#2: republican(0.02) party(0.02) state(0.02) general(0.01) democrats(0.01)
Topic#3: dean(0.04) edwards(0.02) percent(0.02) primary(0.02) clark(0.02)
Topic#4: forces(0.01) baghdad(0.01) iraqis(0.01) coburn(0.01) carson(0.01)
Topic#5: military(0.01) officials(0.01) intelligence(0.01) american(0.01)
Topic#6: electoral(0.04) labor(0.02) culture(0.02) exit(0.02) scoop(0.01)
Topic#7: law(0.01) court(0.01) marriage(0.01) gay(0.01) amendment(0.01)
Topic#8: president(0.03) administration(0.02) campaign(0.01) million(0.01)
Topic#9: years(0.01) ballot(0.01) rights(0.01) nader(0.01) life(0.01)
Topic#10: house(0.08) war(0.03) republicans(0.02) voting(0.02) vote(0.02)
Snippet of theta matrix:
Item#3000: 0.432 0.507 0.059 0.000 0.000 0.000 0.000 0.000 0.002 0.000
Item#2991: 0.249 0.382 0.269 0.000 0.000 0.025 0.016 0.034 0.000 0.026
Item#2992: 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.851 0.000 0.147
Item#2993: 0.358 0.058 0.030 0.141 0.152 0.000 0.002 0.248 0.000 0.010
Item#2994: 0.051 0.142 0.056 0.000 0.000 0.146 0.000 0.000 0.000 0.604
Item#2995: 0.004 0.593 0.000 0.000 0.128 0.005 0.168 0.040 0.030 0.033
Item#2996: 0.069 0.063 0.054 0.000 0.000 0.107 0.008 0.004 0.000 0.696
Item#2997: 0.000 0.194 0.000 0.000 0.043 0.000 0.471 0.228 0.062 0.002
Item#2998: 0.026 0.085 0.042 0.001 0.180 0.000 0.146 0.485 0.022 0.012
Item#2999: 0.312 0.547 0.099 0.000 0.000 0.004 0.008 0.017 0.013 0.000
This simple example loads a text collection from disk and uses iterative scans over the collection to infer a topic model.
Then it outputs top words in each topic and topic distributions of last processed documents.
For further information about this example refer to :doc:`typical_python_example`.
Parse collection step
---------------------
The following python script parses ``docword.kos.txt`` and ``vocab.kos.txt`` files
and converts them into a set of binary-serialized :ref:`batches <Batch>`, stored on disk.
In addition the script creates a :ref:`dictionary <DictionaryConfig>` with all unique tokens
in the collection and stored it on disk.
The script also detects if it had been already executed, and in this case it just loads the
dictionary and save it in *unique_tokens* variable.
The same logic is implemented in a helper-method
:py:meth:`ParseCollectionOrLoadDictionary <artm.library.Library.ParseCollectionOrLoadDictionary>`
method.
.. code-block:: python
data_folder = sys.argv[1] if (len(sys.argv) >= 2) else ''
target_folder = 'kos'
collection_name = 'kos'
batches_found = len(glob.glob(target_folder + "/*.batch"))
if batches_found == 0:
print "No batches found, parsing them from textual collection...",
parser_config = artm.messages_pb2.CollectionParserConfig();
parser_config.format = artm.library.CollectionParserConfig_Format_BagOfWordsUci
parser_config.docword_file_path = data_folder + 'docword.'+ collection_name + '.txt'
parser_config.vocab_file_path = data_folder + 'vocab.'+ collection_name + '.txt'
parser_config.target_folder = target_folder
parser_config.dictionary_file_name = 'dictionary'
unique_tokens = artm.library.Library().ParseCollection(parser_config);
print " OK."
else:
print "Found " + str(batches_found) + " batches, using them."
unique_tokens = artm.library.Library().LoadDictionary(target_folder + '/dictionary');
You may also download larger collections from the following links.
You can get the original collection (docword file and vocab file)
or an already precompiled batches and dictionary.
MasterComponent
---------------
Master component is you main entry-point to all BigARTM functionality.
The following script creates master component and configures it with
several regularizers and score calculators.
.. code-block:: python
with artm.library.MasterComponent(disk_path = target_folder) as master:
perplexity_score = master.CreatePerplexityScore()
sparsity_theta_score = master.CreateSparsityThetaScore()
sparsity_phi_score = master.CreateSparsityPhiScore()
top_tokens_score = master.CreateTopTokensScore()
theta_snippet_score = master.CreateThetaSnippetScore()
dirichlet_theta_reg = master.CreateDirichletThetaRegularizer()
dirichlet_phi_reg = master.CreateDirichletPhiRegularizer()
decorrelator_reg = master.CreateDecorrelatorPhiRegularizer()
Master component must be configured with a disk path, which should contain a set of batches
produced in the previous step of this tutorial.
Score calculators allows you to retrieve important quality measures for your topic model.
Perplexity, sparsity of theta and phi matrices, lists of tokens with highest probability
within each topic are all examples of such scores.
By default BigARTM does not calculate any scores, so you have to create in master component.
The same is true for regularizers, that allow you to customize your topic model.
For further details about master component refer to :ref:`MasterComponentConfig`.
Configure Topic Model
---------------------
Topic model configuration defins the number of topics in the model,
the list of scores to be calculated, and the list of regularizers to apply to the model.
For further details about model configuration refer to :ref:`ModelConfig`.
.. code-block:: python
model = master.CreateModel(topics_count = 10, inner_iterations_count = 10)
model.EnableScore(perplexity_score)
model.EnableScore(sparsity_phi_score)
model.EnableScore(sparsity_theta_score)
model.EnableScore(top_tokens_score)
model.EnableScore(theta_snippet_score)
model.EnableRegularizer(dirichlet_theta_reg, -0.1)
model.EnableRegularizer(dirichlet_phi_reg, -0.2)
model.EnableRegularizer(decorrelator_reg, 1000000)
model.Initialize(unique_tokens) # Setup initial approximation for Phi matrix.
Note that on the last step we configured the initial approximation of Phi matrix.
This step is optional --- BigARTM is able to collect all tokens dynamically
during first scan of the collection. However, a deterministic initial approximation
helps to reproduce the same results from run to run.
Invoke Iterations
-----------------
The following script performs several scans over the set of batches.
Depending on the size of the collection this step might be quite time-consuming.
It is good idea to output some information after every step.
.. code-block:: python
for iter in range(0, 8):
master.InvokeIteration(1) # Invoke one scan of the entire collection...
master.WaitIdle(); # and wait until it completes.
model.Synchronize(); # Synchronize topic model.
print "Iter#" + str(iter),
print ": Perplexity = %.3f" % perplexity_score.GetValue(model).value,
print ", Phi sparsity = %.3f" % sparsity_phi_score.GetValue(model).value,
print ", Theta sparsity = %.3f" % sparsity_theta_score.GetValue(model).value
If your collection is very large you may want to utilize online algorithm
that updates topic model several times during each iteration,
as it is demonstrated by the following script:
.. code-block:: python
master.InvokeIteration(1) # Invoke one scan of the entire collection...
while True:
done = master.WaitIdle(100) # wait 100 ms
model.Synchronize(0.9) # decay weights in current topic model by 0.9,
if (done): # append all increments and invoke all regularizers.
break;
Retrieve and visualize scores
-----------------------------
Finally, you are interested in retrieving and visualizing all collected scores.
.. code-block:: python
artm.library.Visualizers.PrintTopTokensScore(top_tokens_score.GetValue(model))
artm.library.Visualizers.PrintThetaSnippetScore(theta_snippet_score.GetValue(model))