Skip to content

Commit

Permalink
Merge pull request #141 from sashafrey/master
Browse files Browse the repository at this point in the history
Refactor BigARTM tutorial [skip ci]
  • Loading branch information
bigartm committed Mar 4, 2015
2 parents 53e260d + 1917442 commit 3598f82
Show file tree
Hide file tree
Showing 14 changed files with 568 additions and 409 deletions.
52 changes: 50 additions & 2 deletions docs/devguide.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,37 @@
BigARTM Developer's Guide
=========================

This document describes the development process of BigARTM
library.
This document describes the development process of BigARTM library.

You should not follow this guide if you are using pre-built BigARTM library via command-line interface or from Python environment.
(refer to to :doc:`/tutorials/windows_basic` or :doc:`/tutorials/linux_basic` depending on your operating system).

Downloads (Windows)
-------------------

Download and install the following tools:

* Git for Windows from http://git-scm.com/download/win
* https://github.com/msysgit/msysgit/releases/download/Git-1.9.5-preview20141217/Git-1.9.5-preview20141217.exe
* Github for Windows from https://windows.github.com/
* https://github-windows.s3.amazonaws.com/GitHubSetup.exe
* Visual Studio 2013 Express for Windows Desktop from https://www.visualstudio.com/en-us/products/visual-studio-express-vs.aspx
* CMake from http://www.cmake.org/download/
* http://www.cmake.org/files/v3.0/cmake-3.0.2-win32-x86.exe
* Prebuilt Boost binaries from http://sourceforge.net/projects/boost/files/boost-binaries/, for example these two:
* http://sourceforge.net/projects/boost/files/boost-binaries/1.57.0/boost_1_57_0-msvc-12.0-32.exe/download
* http://sourceforge.net/projects/boost/files/boost-binaries/1.57.0/boost_1_57_0-msvc-12.0-64.exe/download
* Python from https://www.python.org/downloads/
* https://www.python.org/ftp/python/2.7.9/python-2.7.9.amd64.msi
* https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi
* (optional) If you plan to build documentation, download and install sphinx-doc as described here: http://sphinx-doc.org/latest/index.html
* (optional) 7-zip -- http://www.7-zip.org/a/7z920-x64.msi

All explicit links are given just for convenience if you are setting up new environment.
You are free to choose other versions or tools, and most likely they will work just fine for BigARTM.
Remember to match the following:
* Visual Studio version must match Boost binaries version, unless you build Boost yourself
* Use the same configuration (32 bit or 64 bit) for your Python and BigARTM binaries

Source code
-----------
Expand Down Expand Up @@ -222,3 +251,22 @@ Then execute the script like this:
On Windows you may run this master-script to check all required files:

``$(BIGARTM_ROOT/utils/cpplint_all.bat``.

Intel Math Kernel Library
-------------------------

BigARTM can utilize Intel Math Kernel Library to achieve better performance.
This only applies when :attr:`ModelConfig.use_sparse_bow` is ``false``
(in sparse version BigARTM has better built-in algorithm that does not use Intel MKL).

To enable MKL usage on Windows add the path to MKL library to your ``PATH`` system variable

.. code-block:: bash

set PATH=%PATH%;"C:\Program Files (x86)\Intel\Composer XE 2013 SP1\redist\intel64\mkl"

To enable MKL usage on Linux create a new system variable ``MKL_PATH`` and set it as follows

.. code-block:: bash

export MKL_PATH="/opt/intel/mkl/lib/intel64/"
98 changes: 62 additions & 36 deletions docs/download.txt
Original file line number Diff line number Diff line change
@@ -1,36 +1,62 @@
Download
========

* Windows - latest release
* https://github.com/bigartm/bigartm/releases/download/v0.5.8/BigARTM_v0.5.8_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.8/BigARTM_v0.5.8_x64.7z

* Windows - previous releases
* https://github.com/bigartm/bigartm/releases/download/v0.5.7/BigARTM_v0.5.7_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.7/BigARTM_v0.5.7_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.6/BigARTM_v0.5.6_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.5/BigARTM_v0.5.5_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.4/BigARTM_v0.5.4_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.3/BigARTM_v0.5.3_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.3/BigARTM_v0.5.3_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.2/BigARTM_v0.5.2_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.2/BigARTM_v0.5.2_x64.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.1/BigARTM_v0.5.1_win32.7z
* https://github.com/bigartm/bigartm/releases/download/v0.5.1/BigARTM_v0.5.1_x64.7z

Please refer to :doc:`tutorial` chapter for installation guide.

* Linux, Mac OS-X
Currently there is distribution package for Linux or Mac OS-X.
To run BigARTM you need to download the source code and built it on your machine.
Detailed procesure is available in :doc:`tutorial` and :doc:`devguide` chapters.

Other tools
-----------

* 7-zip -- http://www.7-zip.org/a/7z920-x64.msi
* Python 2.7.9, 64 bit -- https://www.python.org/ftp/python/2.7.9/python-2.7.9.amd64.msi
* Python 2.7.9, 32 bit -- https://www.python.org/ftp/python/2.7.9/python-2.7.9.msi
Downloads
=========

* **Windows**

* Latest 32 bit release: https://github.com/bigartm/bigartm/releases/download/v0.5.8/BigARTM_v0.5.8_win32.7z
* Latest 64 bit release: https://github.com/bigartm/bigartm/releases/download/v0.5.8/BigARTM_v0.5.8_x64.7z
* All previous releases are available at https://github.com/bigartm/bigartm/releases

Please refer to :doc:`tutorials/windows_basic` for step by step installation procedure.

* **Linux, Mac OS-X**

To run BigARTM on Linux and Mac OS-X you need to clone BigARTM repository
(https://github.com/bigartm/bigartm) and build it as described in
:doc:`tutorials/linux_basic`.

* **Datasets**

========= ========= ======= ======= ==================================================================================================================
Task Source #Words #Items Files
========= ========= ======= ======= ==================================================================================================================
kos `UCI`_ 6906 3430 * `docword.kos.txt.gz (1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz>`_
* `vocab.kos.txt (54 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt>`_
* `kos_1k (700 KB) <https://s3-eu-west-1.amazonaws.com/artm/kos_1k.7z>`_
* `kos_dictionary <https://s3-eu-west-1.amazonaws.com/artm/kos_dictionary>`_


nips `UCI`_ 12419 1500 * `docword.nips.txt.gz (2.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nips.txt.gz>`_
* `vocab.nips.txt (98 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nips.txt>`_
* `nips_200 (1.5 MB) <https://s3-eu-west-1.amazonaws.com/artm/nips_200.7z>`_
* `nips_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nips_dictionary>`_

enron `UCI`_ 28102 39861 * `docword.enron.txt.gz (11.7 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.enron.txt.gz>`_
* `vocab.enron.txt (230 KB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.enron.txt>`_
* `enron_1k (7.1 MB) <https://s3-eu-west-1.amazonaws.com/artm/enron_1k.7z>`_
* `enron_dictionary <https://s3-eu-west-1.amazonaws.com/artm/enron_dictionary>`_

nytimes `UCI`_ 102660 300000 * `docword.nytimes.txt.gz (223 MB) <https://s3-eu-west-1.amazonaws.com/artm/docword.nytimes.txt.gz>`_
* `vocab.nytimes.txt (1.2 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.nytimes.txt>`_
* `nytimes_1k (131 MB) <https://s3-eu-west-1.amazonaws.com/artm/nytimes_1k.7z>`_
* `nytimes_dictionary <https://s3-eu-west-1.amazonaws.com/artm/nytimes_dictionary>`_

pubmed `UCI`_ 141043 8200000 * `docword.pubmed.txt.gz (1.7 GB) <https://s3-eu-west-1.amazonaws.com/artm/docword.pubmed.txt.gz>`_
* `vocab.pubmed.txt (1.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/vocab.pubmed.txt>`_
* `pubmed_10k (1 GB) <https://s3-eu-west-1.amazonaws.com/artm/pubmed_10k.7z>`_
* `pubmed_dictionary <https://s3-eu-west-1.amazonaws.com/artm/pubmed_dictionary>`_

wiki `Gensim`_ 100000 3665223 * `enwiki-20141208_10k (1.2 GB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_10k.7z>`_
* `enwiki-20141208_1k (1.4 GB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_1k.7z>`_
* `enwiki-20141208_dictionary (3.6 MB) <https://s3-eu-west-1.amazonaws.com/artm/enwiki-20141208_dictionary>`_

wiki_enru `Wiki`_ 196749 216175 * `wiki_enru (282 MB) <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru.7z>`_
* `wiki_enru_dictionary (5.3 MB) <https://s3-eu-west-1.amazonaws.com/artm/wiki_enru_dictionary>`_
* class_id(s): ``@english``, ``@russian``
========= ========= ======= ======= ==================================================================================================================

.. _UCI: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

.. _Gensim: http://radimrehurek.com/gensim/wiki.html

.. _Wiki: http://dumps.wikimedia.org
4 changes: 2 additions & 2 deletions docs/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ Welcome to BigARTM's documentation!

intro
download
tutorial
network
tutorials/index
stories/index
faq
devguide
ref/index
publications
legacy_pages

Indices and tables
==================
Expand Down
16 changes: 16 additions & 0 deletions docs/legacy_pages.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. BigARTM documentation master file, created by
sphinx-quickstart on Sun Jul 13 20:00:11 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

.. _legacy_pages:

Legacy documentation pages
==========================

Legacy pages are kept to preserve existing user's links (favourites in browser, etc).

.. toctree::
:maxdepth: 2

tutorial
4 changes: 2 additions & 2 deletions docs/ref/cpp_client.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This document provides an overview of ``cpp_client``,
a simple command-line utility shipped with BigARTM.

To run *cpp_client* you need to download input data (a textual collection represented in bag-of-words format).
We recommend to download *vocab* and *docword* files by links provided in :ref:`TutorialParseCollection` section of the tutorial.
We recommend to download *vocab* and *docword* files by links provided in :doc:`/download` section of the tutorial.
Then you can use *cpp_client* as follows:

.. code-block:: bash
Expand All @@ -31,7 +31,7 @@ You may append the following options to customize the resulting topic model:
You may also apply the following optimizations that should not change the resulting model

* ``--reuse_batches`` skips parsing of *docword* and *vocab* files, and tries to use batches located in ``--batch_folder``.
You may download pre-parsed batches by links provided in :ref:`TutorialParseCollection` section of the tutorial.
You may download pre-parsed batches by links provided in :doc:`/download` section.

* ``-p`` allows you to specify number of concurrent processors.
The recommended value is to use the number of logical cores on your machine.
Expand Down
2 changes: 2 additions & 0 deletions docs/ref/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ BigARTM Reference
c_interface
cpp_interface
cpp_client
windows_distribution


74 changes: 74 additions & 0 deletions docs/ref/windows_distribution.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
====================
Windows distribution
====================

This chapter describes content of BigARTM distribution package for Windows, available at https://github.com/bigartm/bigartm/releases.

=========================== ==========================================================
``bin/`` | Precompiled binaries of BigARTM for Windows.
| This folder must be added to ``PATH`` system variable.

``bin/artm.dll`` | Core functionality of the BigARTM library.

``bin/node_controller.exe`` | Executable that hosts BigARTM nodes in a distributed
| setting.

``bin/cpp_client.exe`` | Command line utility allows to perform simple experiments
| with BigARTM. Remember that not all BigARTM features are
| available through cpp_client, but it can serve as a good
| starting point to learn basic functionality. For further
| details refer to :doc:`/ref/cpp_client`.

``protobuf/`` | A minimalistic version of Google Protocol Buffers
| (https://code.google.com/p/protobuf/)
| library, required to run BigARTM from Python.
| To setup this package follow the instructions in
| ``protobuf/python/README`` file.

``python/artm/`` | Python programming interface to BigARTM library.
| This folder must be added to ``PYTHONPATH``
| system variable.

``library.py`` | Implements all classes of BigARTM python interface.

``messages_pb2.py`` | Contains all protobuf messages that can be transfered in
| and out BigARTM core library. Most common features are
| exposed with their own API methods, so normally you
| do not use python protobuf messages to operate BigARTM.

``python/examples/`` | Python examples of how to use BigARTM:

* `example01_synthetic_collection.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example01_synthetic_collection.py>`_

* `example02_parse_collection.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example02_parse_collection.py>`_

* `example03_concurrency.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example03_concurrency.py>`_

* `example04_online_algorithm.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example04_online_algorithm.py>`_

* `example05_train_and_test_stream.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example05_train_and_test_stream.py>`_

* `example06_use_dictionaries.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example06_use_dictionaries.py>`_

* `example07_master_component_proxy.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example07_master_component_proxy.py>`_

* `example08_network_modus_operandi.py <https://raw.githubusercontent.com/bigartm/bigartm/master/src/python/examples/example08_network_modus_operandi.py>`_

| Files ``docword.kos.txt`` and ``vocab.kos.txt`` represent
| a simple collection of text files in Bag-Of-Words format.
| The files are taken from UCI Machine Learning Repository
| (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).

``src/`` | Several programming interfaces to BigARTM library.

``src/c_interface.h`` | :doc:`Low-level BigARTM interface </ref/c_interface>` in C.

``cpp_interface.h,cc`` | :doc:`C++ interface of BigARTM </ref/cpp_interface>`

``messages.pb.h,cc`` | Protobuf messages for C++ interface

``messages.proto`` | Protobuf description for all messages that appear in the
| API of BigARTM. Documented :doc:`here </ref/messages>`.

``LICENSE`` License file of BigARTM.
=========================== ==========================================================
6 changes: 3 additions & 3 deletions docs/stories/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

.. _reference:
.. _stories:

BigARTM Stories
===============
Advanced experiments
====================

.. toctree::
:maxdepth: 2
Expand Down

0 comments on commit 3598f82

Please sign in to comment.