Skip to content

Commit

Permalink
Include experiment02_artm.py and its documentation [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
sashafrey committed Feb 10, 2015
1 parent cd8b9a3 commit f11a024
Show file tree
Hide file tree
Showing 8 changed files with 470 additions and 36 deletions.
10 changes: 6 additions & 4 deletions docs/download.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
Download
========

* Windows (latest, experimental)
* https://s3-eu-west-1.amazonaws.com/artmdev/BigARTM_v0.5.5_x64_testing.7z
* https://s3-eu-west-1.amazonaws.com/artmdev/BigARTM_v0.5.5_x32_testing.7z
* Windows - latest release
* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_x64.7z

* Windows (previous releases)
* Windows - previous releases
* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.3/BigARTM_v0.5.3_win32.7z
Expand Down
1 change: 1 addition & 0 deletions docs/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Welcome to BigARTM's documentation!
download
tutorial
network
stories/index
faq
devguide
ref/index
Expand Down
62 changes: 61 additions & 1 deletion docs/network.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ in one of you target machines.
Or, if you launched several nodes, you can utilize all of them
by configuring your remote MasterComponent to work in Network modus operandi.

.. code-block:: bash
.. code-block:: python

library = ArtmLibrary('artm.dll')

Expand All @@ -76,3 +76,63 @@ by configuring your remote MasterComponent to work in Network modus operandi.

with library.CreateMasterComponent(master_proxy_config) as master_proxy:
# Use master_proxy in the same way you usually use master component

Combining network modus operandi with proxy
-------------------------------------------


This python script assumes that you have started local node_controller process as follows:

.. code-block:: bash

set GLOG_logtostderr=1 & node_controller.exe tcp://*:5000 tcp://*:5556 tcp://*:5557

This python script will use ports as follows:

* 5000 - port of the MasterComponent to communicate between MasterComponent and Proxy
(this endpoint must be created by node_controller)
* 5550 - port of the MasterComponent to communicate between MasterComponent and Nodes
(this endpoint will be automatically created by the master component)
* 5556, 5557 - ports of the NodeControllerComponent to communicate between MasterComponent
and Nodes (this endpoint must be created by node_controller)

.. code-block:: python

import artm.messages_pb2, artm.library, sys

# Network path of a shared folder with batches to process.
# The folder must be reachable from all remote node controllers.
target_folder = 'D:\\datasets\\nips'

# Dictionary file (must be located on developer's box that runs python script)
dictionary_file = 'D:\\datasets\\nips\\dictionary'

unique_tokens = artm.library.Library().LoadDictionary(dictionary_file)

# Create master component and infer topic model
proxy_config = artm.messages_pb2.MasterProxyConfig()
proxy_config.node_connect_endpoint = 'tcp://localhost:5000'
proxy_config.communication_timeout = 10000 # timeout (in ms) for communication between proxy and master component
proxy_config.polling_frequency = 50 # polling frequency (in ms) for long-lasting operations, for example WaitIdle()
proxy_config.config.modus_operandi = artm.library.MasterComponentConfig_ModusOperandi_Network
proxy_config.config.communication_timeout = 2000 # timeout (in ms) for communication between master component and nodes
proxy_config.config.disk_path = target_folder
proxy_config.config.create_endpoint = 'tcp://*:5550'
proxy_config.config.connect_endpoint = 'tcp://localhost:5550'
proxy_config.config.node_connect_endpoint.append('tcp://localhost:5556')
proxy_config.config.node_connect_endpoint.append('tcp://localhost:5557')
proxy_config.config.processors_count = 1 # number of processors to create at every node

with artm.library.MasterComponent(config = proxy_config) as master:
dictionary = master.CreateDictionary(unique_tokens)
perplexity_score = master.CreatePerplexityScore()
model = master.CreateModel(topics_count = 10, inner_iterations_count = 10)
model.EnableScore(perplexity_score)
model.Initialize(dictionary)

for iter in range(0, 8):
master.InvokeIteration(1) # Invoke one scan of the entire collection...
master.WaitIdle() # and wait until it completes.
model.Synchronize() # Synchronize topic model.
print "Iter#" + str(iter),
print ": Perplexity = %.3f" % perplexity_score.GetValue(model).value
Binary file added docs/stories/_images/experiment02_artm.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
70 changes: 70 additions & 0 deletions docs/stories/experiment02_artm.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Enabling Basic BigARTM Regularizers
===================================

This paper describes the experiment with topic model regularization in BigARTM library using
`experiment02_artm.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/experiments/experiment02_artm.py>`_.
The script provides the possibility to learn topic model with three regularizers
(sparsing Phi, sparsing Theta and pairwise topic decorrelation in Phi).
It also allows the monitoring of learning process by using quality measures as hold-out perplexity,
Phi and Theta sparsity and average topic kernel characteristics.

.. warning::

Note that perplexity estimation can influence the learning process in the online algorithm,
so we evaluate perplexity only once per 20 synchronizations to avoid this influence.
You can change the frequency using ``test_every`` variable.

We suggest you to have BigARTM installed in ``$YOUR_HOME_DIRECTORY``.
To proceed the experiment you need to execute the following steps:

1. Download the collection, represented as BigARTM batches:

* https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z
* https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z

This data represents a complete dump of the English Wikipedia (approximately 3.7 million documents).
The size of one batch in first version is 1000 documents and 10000 in the second one. We used 10000.
The decompressed folder with batches should be put into ``$YOUR_HOME_DIRECTORY``.
You also need to move there the dictionary file from the batches folder.

The batch, you’d like to use for hold-out perplexity estimation, also must be placed into ``$YOUR_HOME_DIRECTORY``.
In our experiment we used the batch named ``243af5b8-beab-4332-bb42-61892df5b044.batch``.

2. The next step is the script preparation. Open it’s code and find the declaration(-s) of variable(-s)

* ``home_folder`` (line 8) and assign it the path ``$YOUR_HOME_DIRECTORY``;
* ``batch_size`` (line 28) and assign it the chosen size of batch;
* ``batches_disk_path`` (line 36) and replace the string 'wiki_10k' with the name of your directory with batches;
* ``test_batch_name`` (line 43) and replace the string with direct batch’s name with the name of your test batch;
* ``tau_decor``, ``tau_phi`` and ``tau_theta`` (lines 54-56) and substitute the values you’d like to use.

3. If you want to estimate the final perplexity on another, larger test sample, put chosen batches into test folder (in ``$YOUR_HOME_DIRECTORY`` directory).
Then find in the code of the script the declaration of variable ``save_and_test_model`` (line 30) and assign it ``True``.

4. After all launch the script. Current measures values will be printed into console.
Note, that after synchronizations without perplexity estimation it’s value will be replaced with string ‘NO’.
The results of synchronizations with perplexity estimation in addition will be put in corresponding files in results folder.
The file format is general for all measures: the set of strings «(accumulated number of processed documents, measure value)»:

.. code-block:: bash

(10000, 0.018)
(220000, 0.41)
(430000, 0.456)
(640000, 0.475)
...

These files can be used for plot building.

If desired, you can easy change values of any variable in the code of script since it’s sense is clearly commented.
If you used all parameters and data identical our experiment you should get the results, close to these ones

.. image:: _images/experiment02_artm.png
:alt: experiment02_artm

Here you can see the results of comparison between ARTM and LDA models.
To make the experiment with LDA instead of ARTM you only need to change the values of variables tau_decor,
tau_phi and tau_theta to 0, 1 / topics_count and 1 / topics_count respectively and run the script again.

.. warning::
Note, that we used machine with 8 cores and 15 Gb RAM for our experiment.
14 changes: 14 additions & 0 deletions docs/stories/index.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.. BigARTM documentation master file, created by
sphinx-quickstart on Sun Jul 13 20:00:11 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

.. _reference:

BigARTM Stories
===============

.. toctree::
:maxdepth: 2

experiment02_artm
67 changes: 36 additions & 31 deletions docs/tutorial.txt
Original file line number Diff line number Diff line change
Expand Up @@ -222,42 +222,47 @@ You may also download larger collections from the following links.
You can get the original collection (docword file and vocab file)
or an already precompiled batches and dictionary.

========= ========= ======= ======= ========================================================================================================
Task Source #Words #Items Files
========= ========= ======= ======= ========================================================================================================
kos `UCI`_ 6906 3430 * `docword.kos.txt.gz (1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz>`_
* `vocab.kos.txt (54 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt>`_
* `kos_1k (700 KB) <https://s3-eu-west-1.amazonaws.com/artm/kos_1k.7z>`_
* `kos_dictionary <https://s3-eu-west-1.amazonaws.com/artm/kos_dictionary>`_

nips `UCI`_ 12419 1500 * `docword.nips.txt.gz (2.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nips.txt.gz>`_
* `vocab.nips.txt (98 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nips.txt>`_
* `nips_200 (1.5 MB) <https://s3-eu-west-1.amazonaws.com/artm/nips_200.7z>`_
* `nips_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nips_dictionary>`_

enron `UCI`_ 28102 39861 * `docword.enron.txt.gz (11.7 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.enron.txt.gz>`_
* `vocab.enron.txt (230 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.enron.txt>`_
* `enron_1k (7.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_
* `enron_dictionary <https://s3-eu-west-1.amazonaws.com/artm/enron_dictionary>`_

nytimes `UCI`_ 102660 300000 * `docword.nytimes.txt.gz (223 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nytimes.txt.gz>`_
* `vocab.nytimes.txt (1.2 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nytimes.txt>`_
* `nytimes_1k (131 MB) <https://s3-eu-west-1.amazonaws.com/artm/nytimes_1k.7z>`_
* `nytimes_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nytimes_dictionary>`_

pubmed `UCI`_ 141043 8200000 * `docword.pubmed.txt.gz (1.7 GB) <https://s3-eu-west-1.amazonaws.com/artm/docword.pubmed.txt.gz>`_
* `vocab.pubmed.txt (1.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.pubmed.txt>`_
* `pubmed_10k (1 GB) <https://s3-eu-west-1.amazonaws.com/artm/pubmed_10k.7z>`_
* `pubmed_dictionary <https://s3-eu-west-1.amazonaws.com/artm/pubmed_dictionary>`_

wiki `Gensim`_ 100000 3665223 * `wiki_10k (1.1 GB) <https://s3-eu-west-1.amazonaws.com/artm/wiki_10k.7z>`_
* `wiki_dictionary <https://s3-eu-west-1.amazonaws.com/artm/wiki_dictionary>`_
========= ========= ======= ======= ========================================================================================================
========= ========= ======= ======= ================== ==================================================================================================================
Task Source #Words #Items class_id(s) Files
========= ========= ======= ======= ================== ==================================================================================================================
kos `UCI`_ 6906 3430 * @default_class * `docword.kos.txt.gz (1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz>`_
* `vocab.kos.txt (54 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt>`_
* `kos_1k (700 KB) <https://s3-eu-west-1.amazonaws.com/artm/kos_1k.7z>`_
* `kos_dictionary <https://s3-eu-west-1.amazonaws.com/artm/kos_dictionary>`_

nips `UCI`_ 12419 1500 * @default_class * `docword.nips.txt.gz (2.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nips.txt.gz>`_
* `vocab.nips.txt (98 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nips.txt>`_
* `nips_200 (1.5 MB) <https://s3-eu-west-1.amazonaws.com/artm/nips_200.7z>`_
* `nips_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nips_dictionary>`_

enron `UCI`_ 28102 39861 * @default_class * `docword.enron.txt.gz (11.7 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.enron.txt.gz>`_
* `vocab.enron.txt (230 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.enron.txt>`_
* `enron_1k (7.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_
* `enron_dictionary <https://s3-eu-west-1.amazonaws.com/artm/enron_dictionary>`_

nytimes `UCI`_ 102660 300000 * @default_class * `docword.nytimes.txt.gz (223 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nytimes.txt.gz>`_
* `vocab.nytimes.txt (1.2 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nytimes.txt>`_
* `nytimes_1k (131 MB) <https://s3-eu-west-1.amazonaws.com/artm/nytimes_1k.7z>`_
* `nytimes_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nytimes_dictionary>`_

pubmed `UCI`_ 141043 8200000 * @default_class * `docword.pubmed.txt.gz (1.7 GB) <https://s3-eu-west-1.amazonaws.com/artm/docword.pubmed.txt.gz>`_
* `vocab.pubmed.txt (1.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.pubmed.txt>`_
* `pubmed_10k (1 GB) <https://s3-eu-west-1.amazonaws.com/artm/pubmed_10k.7z>`_
* `pubmed_dictionary <https://s3-eu-west-1.amazonaws.com/artm/pubmed_dictionary>`_

wiki `Gensim`_ 100000 3665223 * @default_class * `enwiki-20141208_10k (1.2 GB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z>`_
* `enwiki-20141208_1k (1.4 GB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z>`_
* `enwiki-20141208_dictionary (3.6 MB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_dictionary>`_

wiki_enru `Wiki`_ 196749 216175 * @english * `wiki_enru (282 MB) <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru.7z>`_
* @russian * `wiki_enru_dictionary (5.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru_dictionary>`_
========= ========= ======= ======= ================== ==================================================================================================================

.. _UCI: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

.. _Gensim: http://radimrehurek.com/gensim/wiki.html

.. _Wiki: http://dumps.wikimedia.org

MasterComponent
---------------
Expand Down

0 comments on commit f11a024

Please sign in to comment.