/
windows_basic.txt
138 lines (105 loc) · 6.22 KB
/
windows_basic.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
Basic BigARTM tutorial for Windows users
========================================
This tutorial gives guidelines for installing and running existing BigARTM examples via command-line interface and from Python environment.
Download
--------
Download latest binary distribution of BigARTM from https://github.com/bigartm/bigartm/releases.
Explicit download links can be found at :doc:`/download` section (for 32 bit and 64 bit configurations).
The distribution will contain pre-build binaries, command-line interface and BigARTM API for Python.
The distribution also contains a simple dataset and few python examples that we will be running in this tutorial.
More datasets in BigARTM-compatible format are available in the :doc:`/download` section.
Refer to :doc:`/ref/windows_distribution` for details about other files, included in the binary distribution package.
Running BigARTM from command line
---------------------------------
No installation steps are required to run BigARTM from command line.
After unpacking binary distribution simply open command prompt (``cmd.exe``),
change current directory to ``bin`` folder inside BigARTM package, and run ``cpp_client.exe`` application as in the following example.
As an optional step, we recommend to add ``bin`` folder of the BigARTM distribution to your ``PATH`` system variable.
.. code-block:: bash
>C:\BigARTM\bin>set PATH=%PATH%;C:\BigARTM\bin
>C:\BigARTM\bin>cpp_client.exe -v ../python/examples/vocab.kos.txt -d ../python/examples/docword.kos.txt -t 4
Parsing text collection... OK.
Iteration 1 took 197 milliseconds.
Test perplexity = 7108.35,
Train perplexity = 7106.18,
Test spatsity theta = 0,
Train sparsity theta = 0,
Spatsity phi = 0.000144802,
Test items processed = 343,
Train items processed = 3087,
Kernel size = 5663,
Kernel purity = 0.958901,
Kernel contrast = 0.292389
Iteration 2 took 195 milliseconds.
Test perplexity = 2563.31,
Train perplexity = 2517.07,
Test spatsity theta = 0,
Train sparsity theta = 0,
Spatsity phi = 0.000144802,
Test items processed = 343,
Train items processed = 3087,
Kernel size = 5559.5,
Kernel purity = 0.956709,
Kernel contrast = 0.298198
...
#1: november(0.054) poll(0.015) bush(0.013) kerry(0.012) polls(0.012) governor(0.011)
#2: bush(0.0083) president(0.0059) republicans(0.0047) house(0.0042) people(0.0039) administration(0.0036)
#3: bush(0.031) iraq(0.018) war(0.012) kerry(0.0096) president(0.0078) administration(0.0076)
#4: kerry(0.018) democratic(0.013) dean(0.012) campaign(0.0097) poll(0.0095) race(0.0082)
ThetaMatrix (last 7 processed documents, ids = 1995,1996,1997,1998,1992,2000,1994):
Topic0: 0.02104 0.02155 0.00604 0.00835 0.00965 0.00006 0.91716
Topic1: 0.15441 0.76643 0.06484 0.11643 0.20409 0.00006 0.00957
Topic2: 0.00399 0.16135 0.00093 0.03890 0.10498 0.00001 0.00037
Topic3: 0.82055 0.05066 0.92819 0.83632 0.68128 0.99987 0.07289
We recommend to download larger datasets, available in :doc:`/download` section.
All ``docword`` and ``vocab`` files can be consumed by BigARTM exactly as in the previous example.
Internally BigARTM always parses such files into batches format
(for example, `enron_1k (7.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_).
If you have downloaded such pre-parsed collection, you may feed it into BigARTM as follows:
.. code-block:: bash
>C:\BigARTM\bin>cpp_client.exe --batch_folder C:\BigARTM\enron
Reuse 40 batches in folder 'enron'
Loading dictionary file... OK.
Iteration 1 took 2502 milliseconds.
For more information about ``cpp_client.exe`` refer to :doc:`/ref/cpp_client` section.
Configure BigARTM Python API
----------------------------
1. Install Python, for example from the following links:
* Python 2.7.9, 64 bit -- https://www.python.org/ftp/python/2.7.9/python-2.7.9.amd64.msi, or
* Python 2.7.9, 32 bit -- https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi
Remember that the version of BigARTM package must match your version Python installed on your machine.
If you have 32 bit operating system then you must select 32 bit for Python and BigARTM package.
If you have 64 bit operating system then you are free to select either version.
However, please note that memory usage of 32 bit processes is limited by 2 GB.
For this reason we recommend to select 64 bit configurations.
Also you need to have several Python libraries to be installed on your machine:
* numpy >= 1.9.2
* scipy >= 0.15.0
* pandas >= 0.16.2
* scikit-learn >= 0.16.1
2. Add ``C:\BigARTM\bin`` folder to your ``PATH`` system variable, and
add ``C:\BigARTM\python`` to your ``PYTHONPATH`` system variable:
.. code-block:: bash
set PATH=%PATH%;C:\BigARTM\bin
set PATH=%PATH%;C:\Python27;C:\Python27\Scripts
set PYTHONPATH=%PYTHONPATH%;C:\BigARTM\Python
Remember to change ``C:\BigARTM`` and ``C:\Python27`` with your local folders.
3. Setup *Google Protocol Buffers* library, included in the BigARTM release package.
* Copy ``C:\BigARTM\bin\protoc.exe`` file into ``C:\BigARTM\protobuf\src`` folder
* Run the following commands from command prompt
.. code-block:: bash
cd C:\BigARTM\protobuf\Python
python setup.py build
python setup.py install
Avoid ``python setup.py test`` step, as it produces several confusing errors. Those errors are harmless.
For further details about protobuf installation refer to `protobuf/python/README
<https://raw.githubusercontent.com/bigartm/bigartm/master/3rdparty/protobuf/python/README.txt>`_.
If you are getting errors when configuring or using Python API,
please refer to Troubleshooting chapter in :doc:`/tutorials/linux_basic`.
The list of issues is common between Windows and Linux.
Running BigARTM from Python API
-------------------------------
Refer to ARTM notebook
(`in Russian <http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/BigARTM_example_RU.ipynb>`_
or `in English <http://nbviewer.ipython.org/github/bigartm/bigartm-book/blob/master/BigARTM_example_EN.ipynb>`_),
which describes high-level Python API of BigARTM.