Include experiment02_artm.py and its documentation [skip ci]

bigartm · Feb 10, 2015 · f11a024 · f11a024
1 parent cd8b9a3
commit f11a024
Show file tree

Hide file tree

Showing 8 changed files with 470 additions and 36 deletions.
diff --git a/docs/download.txt b/docs/download.txt
@@ -1,11 +1,13 @@
 Download
 ========
 
-* Windows (latest, experimental)
-    * https://s3-eu-west-1.amazonaws.com/artmdev/BigARTM_v0.5.5_x64_testing.7z
-    * https://s3-eu-west-1.amazonaws.com/artmdev/BigARTM_v0.5.5_x32_testing.7z
+* Windows - latest release
+	* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_win32.7z
+	* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_x64.7z
 
-* Windows (previous releases)
+* Windows - previous releases
+	* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_win32.7z
+	* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_x64.7z
 	* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_win32.7z
 	* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_x64.7z
 	* https://github.com/bigartm/bigartm/releases/download/v0.5.3/BigARTM_v0.5.3_win32.7z

diff --git a/docs/index.txt b/docs/index.txt
@@ -15,6 +15,7 @@ Welcome to BigARTM's documentation!
    download
    tutorial
    network
+   stories/index
    faq
    devguide
    ref/index

diff --git a/docs/network.txt b/docs/network.txt
@@ -63,7 +63,7 @@ in one of you target machines.
 Or, if you launched several nodes, you can utilize all of them
 by configuring your remote MasterComponent to work in Network modus operandi.
 
-.. code-block:: bash
+.. code-block:: python
 
    library = ArtmLibrary('artm.dll')
 
@@ -76,3 +76,63 @@ by configuring your remote MasterComponent to work in Network modus operandi.
 
    with library.CreateMasterComponent(master_proxy_config) as master_proxy:
      # Use master_proxy in the same way you usually use master component
+
+Combining network modus operandi with proxy
+-------------------------------------------
+
+
+This python script assumes that you have started local node_controller process as follows:
+
+.. code-block:: bash
+
+   set GLOG_logtostderr=1 & node_controller.exe tcp://*:5000 tcp://*:5556 tcp://*:5557
+
+This python script will use ports as follows:
+
+* 5000 - port of the MasterComponent to communicate between MasterComponent and Proxy
+  (this endpoint must be created by node_controller)
+* 5550 - port of the MasterComponent to communicate between MasterComponent and Nodes
+  (this endpoint will be automatically created by the master component)
+* 5556, 5557 - ports of the NodeControllerComponent to communicate between MasterComponent
+  and Nodes (this endpoint must be created by node_controller)
+
+.. code-block:: python
+
+	import artm.messages_pb2, artm.library, sys
+
+	# Network path of a shared folder with batches to process.
+	# The folder must be reachable from all remote node controllers.
+	target_folder = 'D:\\datasets\\nips'
+
+	# Dictionary file (must be located on developer's box that runs python script)
+	dictionary_file = 'D:\\datasets\\nips\\dictionary'
+
+	unique_tokens = artm.library.Library().LoadDictionary(dictionary_file)
+
+	# Create master component and infer topic model
+	proxy_config = artm.messages_pb2.MasterProxyConfig()
+	proxy_config.node_connect_endpoint = 'tcp://localhost:5000'
+	proxy_config.communication_timeout = 10000  # timeout (in ms) for communication between proxy and master component
+	proxy_config.polling_frequency  = 50  # polling frequency (in ms) for long-lasting operations, for example WaitIdle()
+	proxy_config.config.modus_operandi = artm.library.MasterComponentConfig_ModusOperandi_Network
+	proxy_config.config.communication_timeout = 2000  # timeout (in ms) for communication between master component and nodes
+	proxy_config.config.disk_path = target_folder
+	proxy_config.config.create_endpoint = 'tcp://*:5550'
+	proxy_config.config.connect_endpoint = 'tcp://localhost:5550'
+	proxy_config.config.node_connect_endpoint.append('tcp://localhost:5556')
+	proxy_config.config.node_connect_endpoint.append('tcp://localhost:5557')
+	proxy_config.config.processors_count = 1  # number of processors to create at every node
+
+	with artm.library.MasterComponent(config = proxy_config) as master:
+	  dictionary = master.CreateDictionary(unique_tokens)
+	  perplexity_score = master.CreatePerplexityScore()
+	  model = master.CreateModel(topics_count = 10, inner_iterations_count = 10)
+	  model.EnableScore(perplexity_score)
+	  model.Initialize(dictionary)
+
+	  for iter in range(0, 8):
+		master.InvokeIteration(1)        # Invoke one scan of the entire collection...
+		master.WaitIdle()                # and wait until it completes.
+		model.Synchronize()              # Synchronize topic model.
+		print "Iter#" + str(iter),
+		print ": Perplexity = %.3f" % perplexity_score.GetValue(model).value
diff --git a/docs/stories/_images/experiment02_artm.png b/docs/stories/_images/experiment02_artm.png
diff --git a/docs/stories/experiment02_artm.txt b/docs/stories/experiment02_artm.txt
@@ -0,0 +1,70 @@
+Enabling Basic BigARTM Regularizers
+===================================
+
+This paper describes the experiment with topic model regularization in BigARTM library using 
+`experiment02_artm.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/experiments/experiment02_artm.py>`_.
+The script provides the possibility to learn topic model with three regularizers
+(sparsing Phi, sparsing Theta and pairwise topic decorrelation in Phi).
+It also allows the monitoring of learning process by using quality measures as hold-out perplexity,
+Phi and Theta sparsity and average topic kernel characteristics.
+
+.. warning::
+
+    Note that perplexity estimation can influence the learning process in the online algorithm,
+    so we evaluate perplexity only once per 20 synchronizations to avoid this influence.
+    You can change the frequency using ``test_every`` variable.
+
+We suggest you to have BigARTM installed in ``$YOUR_HOME_DIRECTORY``.
+To proceed the experiment you need to execute the following steps:
+
+1. Download the collection, represented as BigARTM batches:
+
+    * https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z
+    * https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z
+
+    This data represents a complete dump of the English Wikipedia (approximately 3.7 million documents).
+    The size of one batch in first version is 1000 documents and 10000 in the second one. We used 10000.
+    The decompressed folder with batches should be put into ``$YOUR_HOME_DIRECTORY``.
+    You also need to move there the dictionary file from the batches folder.
+
+    The batch, you’d like to use for hold-out perplexity estimation, also must be placed into ``$YOUR_HOME_DIRECTORY``.
+    In our experiment we used the batch named ``243af5b8-beab-4332-bb42-61892df5b044.batch``.
+
+2. The next step is the script preparation. Open it’s code and find the declaration(-s) of variable(-s)
+
+    * ``home_folder`` (line 8) and assign it the path ``$YOUR_HOME_DIRECTORY``;
+    * ``batch_size`` (line 28) and assign it the chosen size of batch;
+    * ``batches_disk_path`` (line 36) and replace the string 'wiki_10k' with the name of your directory with batches;
+    * ``test_batch_name`` (line 43) and replace the string with direct batch’s name with the name of your test batch;
+    * ``tau_decor``, ``tau_phi`` and ``tau_theta`` (lines 54-56) and substitute the values you’d like to use.
+
+3. If you want to estimate the final perplexity on another, larger test sample, put chosen batches into test folder (in ``$YOUR_HOME_DIRECTORY`` directory).
+    Then find in the code of the script the declaration of variable ``save_and_test_model`` (line 30) and assign it ``True``.
+
+4. After all launch the script. Current measures values will be printed into console.
+    Note, that after synchronizations without perplexity estimation it’s value will be replaced with string ‘NO’.
+    The results of synchronizations with perplexity estimation in addition will be put in corresponding files in results folder.
+    The file format is general for all measures: the set of strings «(accumulated number of processed documents, measure value)»:
+
+    .. code-block:: bash
+
+        (10000, 0.018)
+        (220000, 0.41)
+        (430000, 0.456)
+        (640000, 0.475)
+        ...
+
+    These files can be used for plot building.
+
+If desired, you can easy change values of any variable in the code of script since it’s sense is clearly commented.
+If you used all parameters and data identical our experiment you should get the results, close to these ones
+
+.. image:: _images/experiment02_artm.png
+   :alt: experiment02_artm
+
+Here you can see the results of comparison between ARTM and LDA models.
+To make the experiment with LDA instead of ARTM you only need to change the values of variables tau_decor,
+tau_phi and tau_theta to 0, 1 / topics_count and 1 / topics_count respectively and run the script again.
+
+.. warning::
+   Note, that we used machine with 8 cores and 15 Gb RAM for our experiment.    
diff --git a/docs/stories/index.txt b/docs/stories/index.txt
@@ -0,0 +1,14 @@
+.. BigARTM documentation master file, created by
+   sphinx-quickstart on Sun Jul 13 20:00:11 2014.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+.. _reference:
+
+BigARTM Stories
+===============
+
+.. toctree::
+   :maxdepth: 2
+
+   experiment02_artm
diff --git a/docs/tutorial.txt b/docs/tutorial.txt
@@ -222,42 +222,47 @@ You may also download larger collections from the following links.
 You can get the original collection (docword file and vocab file)
 or an already precompiled batches and dictionary.
 
-========= ========= ======= ======= ========================================================================================================
-Task      Source    #Words  #Items  Files
-========= ========= ======= ======= ========================================================================================================
-kos       `UCI`_    6906    3430    * `docword.kos.txt.gz (1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz>`_
-                                    * `vocab.kos.txt (54 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt>`_      
-                                    * `kos_1k (700 KB)     <https://s3-eu-west-1.amazonaws.com/artm/kos_1k.7z>`_           
-                                    * `kos_dictionary      <https://s3-eu-west-1.amazonaws.com/artm/kos_dictionary>`_
-
-nips      `UCI`_    12419   1500    * `docword.nips.txt.gz (2.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nips.txt.gz>`_
-                                    * `vocab.nips.txt (98 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nips.txt>`_
-                                    * `nips_200 (1.5 MB)   <https://s3-eu-west-1.amazonaws.com/artm/nips_200.7z>`_         
-                                    * `nips_dictionary     <https://s3-eu-west-1.amazonaws.com/artm/nips_dictionary>`_
-
-enron     `UCI`_    28102   39861   * `docword.enron.txt.gz (11.7 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.enron.txt.gz>`_
-                                    * `vocab.enron.txt (230 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.enron.txt>`_
-                                    * `enron_1k (7.1 MB)   <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_         
-                                    * `enron_dictionary    <https://s3-eu-west-1.amazonaws.com/artm/enron_dictionary>`_
-
-nytimes   `UCI`_    102660  300000  * `docword.nytimes.txt.gz (223 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nytimes.txt.gz>`_
-                                    * `vocab.nytimes.txt (1.2 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nytimes.txt>`_
-                                    * `nytimes_1k (131 MB) <https://s3-eu-west-1.amazonaws.com/artm/nytimes_1k.7z>`_
-                                    * `nytimes_dictionary  <https://s3-eu-west-1.amazonaws.com/artm/nytimes_dictionary>`_
-
-pubmed    `UCI`_    141043  8200000 * `docword.pubmed.txt.gz (1.7 GB) <https://s3-eu-west-1.amazonaws.com/artm/docword.pubmed.txt.gz>`_
-                                    * `vocab.pubmed.txt (1.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.pubmed.txt>`_
-                                    * `pubmed_10k (1 GB)   <https://s3-eu-west-1.amazonaws.com/artm/pubmed_10k.7z>`_
-                                    * `pubmed_dictionary   <https://s3-eu-west-1.amazonaws.com/artm/pubmed_dictionary>`_
-
-wiki      `Gensim`_ 100000  3665223 * `wiki_10k (1.1 GB)   <https://s3-eu-west-1.amazonaws.com/artm/wiki_10k.7z>`_
-                                    * `wiki_dictionary     <https://s3-eu-west-1.amazonaws.com/artm/wiki_dictionary>`_
-========= ========= ======= ======= ========================================================================================================
+========= ========= ======= ======= ================== ==================================================================================================================
+Task      Source    #Words  #Items  class_id(s)        Files
+========= ========= ======= ======= ================== ==================================================================================================================
+kos       `UCI`_    6906    3430    * @default_class   * `docword.kos.txt.gz (1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz>`_
+                                                       * `vocab.kos.txt (54 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt>`_      
+                                                       * `kos_1k (700 KB)     <https://s3-eu-west-1.amazonaws.com/artm/kos_1k.7z>`_           
+                                                       * `kos_dictionary      <https://s3-eu-west-1.amazonaws.com/artm/kos_dictionary>`_
+
+nips      `UCI`_    12419   1500    * @default_class   * `docword.nips.txt.gz (2.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nips.txt.gz>`_
+                                                       * `vocab.nips.txt (98 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nips.txt>`_
+                                                       * `nips_200 (1.5 MB)   <https://s3-eu-west-1.amazonaws.com/artm/nips_200.7z>`_         
+                                                       * `nips_dictionary     <https://s3-eu-west-1.amazonaws.com/artm/nips_dictionary>`_
+
+enron     `UCI`_    28102   39861   * @default_class   * `docword.enron.txt.gz (11.7 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.enron.txt.gz>`_
+                                                       * `vocab.enron.txt (230 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.enron.txt>`_
+                                                       * `enron_1k (7.1 MB)   <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_         
+                                                       * `enron_dictionary    <https://s3-eu-west-1.amazonaws.com/artm/enron_dictionary>`_
+
+nytimes   `UCI`_    102660  300000  * @default_class   * `docword.nytimes.txt.gz (223 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nytimes.txt.gz>`_
+                                                       * `vocab.nytimes.txt (1.2 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nytimes.txt>`_
+                                                       * `nytimes_1k (131 MB) <https://s3-eu-west-1.amazonaws.com/artm/nytimes_1k.7z>`_
+                                                       * `nytimes_dictionary  <https://s3-eu-west-1.amazonaws.com/artm/nytimes_dictionary>`_
+
+pubmed    `UCI`_    141043  8200000 * @default_class   * `docword.pubmed.txt.gz (1.7 GB) <https://s3-eu-west-1.amazonaws.com/artm/docword.pubmed.txt.gz>`_
+                                                       * `vocab.pubmed.txt (1.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.pubmed.txt>`_
+                                                       * `pubmed_10k (1 GB)   <https://s3-eu-west-1.amazonaws.com/artm/pubmed_10k.7z>`_
+                                                       * `pubmed_dictionary   <https://s3-eu-west-1.amazonaws.com/artm/pubmed_dictionary>`_
+
+wiki      `Gensim`_ 100000  3665223 * @default_class   * `enwiki-20141208_10k (1.2 GB)   <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z>`_
+                                                       * `enwiki-20141208_1k (1.4 GB)    <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z>`_
+                                                       * `enwiki-20141208_dictionary (3.6 MB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_dictionary>`_
+
+wiki_enru `Wiki`_   196749  216175  * @english         * `wiki_enru (282 MB)  <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru.7z>`_
+                                    * @russian         * `wiki_enru_dictionary (5.3 MB)  <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru_dictionary>`_
+========= ========= ======= ======= ================== ==================================================================================================================
 
 .. _UCI: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
 
 .. _Gensim: http://radimrehurek.com/gensim/wiki.html
 
+.. _Wiki: http://dumps.wikimedia.org
 
 MasterComponent
 ---------------