gLDA is a paralleled C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation.
- Enter Paracel's home directory
- Generate test dataset, size of 1000 documents.
python tool/datagen.py -m lda -n 1000 -o data.txt
- Set up link library path:
- Create a json file named
cfg.json, see example in Parameters section below.
- Run (
20servers, mesos mode in the following example)
./prun.py -w 100 -p 20 -c cfg.json -m mesos your_paracel_install_path/bin/gLDA
Default parameters are set in a JSON format file. For example, we create a cfg.json as below(modify
"input" : "data.txt",
"output" : "/your_output_path/model",
"alpha" : 0.1,
"beta" : 0.1,
"update_lib" : "your_paracel_install_path/lib/libgLDA_update.so",
In the above configuration file:
betaare hyper-parameter which control topics distribution from documents and words distribution from topics.
k_topicsis the number of topics,
itersis the number of Gibbs sampling iterations.
top_wordsis the number of most likely words for each topic.
debugmode is enabled, more information such as
Log Likelihoodare displayed, but the time and memory of computing are increased significantly.
Each line is a document, words are spilted by
\t or space.
For example, either
2 19 4 24 9 3 2 2 9 2 2 1 24 24 2 1 1 9 3 24 0 3 2 4 0 0 or
this is a document words are splited by spaces is valid.
Each line is a topic, which contains
top_words most likely words of each topic. Words are sorted decently by probability.
- The documents and words here are abstract and should not only be understood as normal text documents. Also, keep in mind that we should first preprocess the data (removing stop words and rare words, stemming, etc.) before estimating with gLDA.
- Data generated from
tool/datagen.pyare toy data. words are sampled from 10 topics and vocabulary is 5 * 5 = 25, which describes the words distribution matrix from topics. So the output will be rows or columns in distribution matrix. See references for detail.
- David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 2003.
- Thomas L. Griffiths, and Mark Steyvers. Finding scientific topics. In PNAS 2004.
- David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling. Distributed Inference for Latent Dirichlet Allocation. Advances in neural information processing systems, 2007.