-
Notifications
You must be signed in to change notification settings - Fork 117
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Include experiment02_artm.py and its documentation [skip ci]
- Loading branch information
Showing
8 changed files
with
470 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ Welcome to BigARTM's documentation! | |
download | ||
tutorial | ||
network | ||
stories/index | ||
faq | ||
devguide | ||
ref/index | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
Enabling Basic BigARTM Regularizers | ||
=================================== | ||
|
||
This paper describes the experiment with topic model regularization in BigARTM library using | ||
`experiment02_artm.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/experiments/experiment02_artm.py>`_. | ||
The script provides the possibility to learn topic model with three regularizers | ||
(sparsing Phi, sparsing Theta and pairwise topic decorrelation in Phi). | ||
It also allows the monitoring of learning process by using quality measures as hold-out perplexity, | ||
Phi and Theta sparsity and average topic kernel characteristics. | ||
|
||
.. warning:: | ||
|
||
Note that perplexity estimation can influence the learning process in the online algorithm, | ||
so we evaluate perplexity only once per 20 synchronizations to avoid this influence. | ||
You can change the frequency using ``test_every`` variable. | ||
|
||
We suggest you to have BigARTM installed in ``$YOUR_HOME_DIRECTORY``. | ||
To proceed the experiment you need to execute the following steps: | ||
|
||
1. Download the collection, represented as BigARTM batches: | ||
|
||
* https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z | ||
* https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z | ||
|
||
This data represents a complete dump of the English Wikipedia (approximately 3.7 million documents). | ||
The size of one batch in first version is 1000 documents and 10000 in the second one. We used 10000. | ||
The decompressed folder with batches should be put into ``$YOUR_HOME_DIRECTORY``. | ||
You also need to move there the dictionary file from the batches folder. | ||
|
||
The batch, you’d like to use for hold-out perplexity estimation, also must be placed into ``$YOUR_HOME_DIRECTORY``. | ||
In our experiment we used the batch named ``243af5b8-beab-4332-bb42-61892df5b044.batch``. | ||
|
||
2. The next step is the script preparation. Open it’s code and find the declaration(-s) of variable(-s) | ||
|
||
* ``home_folder`` (line 8) and assign it the path ``$YOUR_HOME_DIRECTORY``; | ||
* ``batch_size`` (line 28) and assign it the chosen size of batch; | ||
* ``batches_disk_path`` (line 36) and replace the string 'wiki_10k' with the name of your directory with batches; | ||
* ``test_batch_name`` (line 43) and replace the string with direct batch’s name with the name of your test batch; | ||
* ``tau_decor``, ``tau_phi`` and ``tau_theta`` (lines 54-56) and substitute the values you’d like to use. | ||
|
||
3. If you want to estimate the final perplexity on another, larger test sample, put chosen batches into test folder (in ``$YOUR_HOME_DIRECTORY`` directory). | ||
Then find in the code of the script the declaration of variable ``save_and_test_model`` (line 30) and assign it ``True``. | ||
|
||
4. After all launch the script. Current measures values will be printed into console. | ||
Note, that after synchronizations without perplexity estimation it’s value will be replaced with string ‘NO’. | ||
The results of synchronizations with perplexity estimation in addition will be put in corresponding files in results folder. | ||
The file format is general for all measures: the set of strings «(accumulated number of processed documents, measure value)»: | ||
|
||
.. code-block:: bash | ||
|
||
(10000, 0.018) | ||
(220000, 0.41) | ||
(430000, 0.456) | ||
(640000, 0.475) | ||
... | ||
|
||
These files can be used for plot building. | ||
|
||
If desired, you can easy change values of any variable in the code of script since it’s sense is clearly commented. | ||
If you used all parameters and data identical our experiment you should get the results, close to these ones | ||
|
||
.. image:: _images/experiment02_artm.png | ||
:alt: experiment02_artm | ||
|
||
Here you can see the results of comparison between ARTM and LDA models. | ||
To make the experiment with LDA instead of ARTM you only need to change the values of variables tau_decor, | ||
tau_phi and tau_theta to 0, 1 / topics_count and 1 / topics_count respectively and run the script again. | ||
|
||
.. warning:: | ||
Note, that we used machine with 8 cores and 15 Gb RAM for our experiment. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
.. BigARTM documentation master file, created by | ||
sphinx-quickstart on Sun Jul 13 20:00:11 2014. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
|
||
.. _reference: | ||
|
||
BigARTM Stories | ||
=============== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
experiment02_artm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.