No description or website provided.
C++ Makefile Shell
Latest commit 05de4bc Jun 22, 2016 @manzilzaheer manzilzaheer small bug in alias

README.md

Single Machine implementation of LDA

Modules

  1. parallelLDA contains various implementation of multi threaded LDA
  2. singleLDA contains various implementation of single threaded LDA
  3. topwords a tool to explore topics learnt by the LDA/HDP
  4. perplexity a tool to calculate perplexity on another dataset using word|topic matrix
  5. datagen packages txt files for our program
  6. preprocessing for converting from UCI or cLDA to simple txt file having one document per line

Organisation

  1. All codes are under src within respective folder
  2. For running Topic Models many template scripts are provided under scripts
  3. data is a placeholder folder where to put the data
  4. build and dist folder will be created to hold the executables

Requirements

  1. gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
  2. split >= 8.21 (part of GNU coreutils)

How to use

We will show how to run our LDA on an UCI bag of words dataset

  1. First of all compile by hitting make

     make
  2. Download example dataset from UCI repository. For this a script has been provided.

     scripts/get_data.sh
  3. Prepare the data for our program

     scripts/prepare.sh data/nytimes 1

    For other datasets replace nytimes with dataset name or location.

  4. Run LDA!

     scripts/lda_runner.sh

    Inside the lda_runner.sh all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under out/. Also you can specify which inference algorithm of LDA you want to run:

    1. simpleLDA: Plain vanilla Gibbs sampling by Griffiths04
    2. sparseLDA: Sparse LDA of Yao09
    3. aliasLDA: Alias LDA
    4. FTreeLDA: F++LDA (inspired from Yu14
    5. lightLDA: light LDA of Yuan14

The make file has some useful features:

  • if you have Intel® C++ Compiler, then you can instead

     make intel
  • or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit

     make inteltogether
  • Also you can selectively compile individual modules by specifying

     make <module-name>
  • or clean individually by

     make clean-<module-name>

Performance

Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.

Datasets

Dataset V L D L/V L/D
NY Times 101,330 99,542,127 299,753 982.36 332.08
PubMed 141,043 737,869,085 8,200,000 5,231.52 89.98
Wikipedia 210,218 1,614,349,889 3,731,325 7,679.41 432.65

Experimental datasets and their statistics. V denotes vocabulary size, L denotes the number of training tokens, D denotes the number of documents, L/V indicates the average number of occurrences of a word, L/D indicates the average length of a document.

log-Perplexity with time