A Latent Dirichlet Allocation topic modeling package based on SparseLDA Gibbs Sampling inference algorithm


================ python-sparselda is a Latent Dirichlet Allocation(LDA) topic modeling package based on SparseLDA Gibbs Sampling inference algorithm, and written in Python 2.6 or newer, Python 3.0 or newer excluded.

Frankly, python-sparselda is just a mini project, we hope it can help you better understand the standard LDA and SparseLDA algorithms. RTFSC for more details. Have fun.

Please use the github issue tracker for python-sparselda at:



1. Install Google Protocol Buffers

python-sparselda serialize and persistent store the lda model and checkpoint based on protobuf, so you should install it first.

tar -zxvf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
sudo make install
cd python
python ./ build
sudo python ./ install

cd python-sparselda/common
protoc -I=. --python_out=. lda.proto

2. Training

2.1 Command line

Usage: python [options].

-h, --help   show this help message and exit
        the corpus directory.
        the vocabulary file.
        the num of topics.
        the topic prior alpha.
        the word prior beta.
        the total iteration.
        the model directory.
        the interval to save lda model.
        the accumulated_prob_threshold of topic top words.
        the interval to save checkpoint.
        the checkpoint directory.
        the interval to compute loglikelihood.

2.2 Input corpus format

The corpus for training/estimating the model have the line format as follows:


in which each line is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms.

[documenti] = [wordi1]\t[wordi2]\t...\t[wordiNi]

in which all [wordij] <i=1...M, j=1...Ni> are text strings and they are separated by the tab character.

Note that the terms document and word here are abstract and should not only be understood as normal text documents. This's because LDA can be used to discover the underlying topic structures of any kind of discrete data. Therefore, python-sparselda is not limited to text and natural language processing but can also be applied to other kinds of data like images.

Also, keep in mind that for text/Web data collections, you should first preprocess the data (e.g., word segment, removing stopwords and rare words, stemming, etc.) before estimating with python-sparselda.

2.3 Input vocabulary format

The vocabulary for training/estimating the model have the line format as follows:


in which each line is a unique word. Words only appear in vocabulary will be considered for parameter estimation.

2.4 Outputs

1) LDA Model

It includs three files.

  • lda.topic_word_hist: This file contains the word-topic histograms, i.e., N(word|topic).
  • lda.global_topic_hist: This file contains the global topic histogram, i.e., N(topic).
  • lda.hyper_params: This file contails the hyperparams, i.e., alpha and beta.
2) Checkpoint

Every --save_checkpoint_interval iterations, the lda_trainer will dump current checkpoint for fault tolerance. The checkpoint mainly includes two types files.

  • LDA Model: See above.
  • Corpus: This directory contains serialized documents.
3) Topic words
  • lda.topic_words: This file contains most likely words of each topic. The number of topic top words is depend on --topic_word_accumulated_prob_threshold.

3. Inference

Please refer the example:

Note that we strongly recommend you to use MultiChainGibbsSampler class for trade off between efficiency and effectiveness.

4. Evaluation

Instead of manual evaluation, we want to evaluate topics quality automatically, and filter out a few meaningless topics to enchance the inference effect.



  1. Hyperparameters optimization.
  2. Memory optimization.
  3. More experiments.
  4. Data and model parallelization.



=============== Here are some pointers to other implementations of LDA.

  1. LDA-C: A C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data.
  2. GibbsLDA++: A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
  3. plda/plda+: A parallel C++ implementation of Latent Dirichlet Allocation (LDA).
  4. Mr. LDA: A Latent Dirichlet Allocation topic modeling package based on Variational Bayesian learning approach using MapReduce and Hadoop, developed by a Cloud Computing Research Team in University of Maryland, College Park.
  5. Yahoo_LDA: Y!LDA Topic Modelling Framework, it provides a fast C++ implementation of the inferencing algorithm which can use both multi-core parallelism and multi-machine parallelism using a hadoop cluster. It can infer about a thousand topics on a million document corpus while running for a thousand iterations on an eight core machine in one day.
  6. Mahout: Mahout's goal is to build scalable machine learning libraries.
  7. MALLET : A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
  8. ompi-lda: OpenMP and MPI Based Paralllel Implementation of LDA.
  9. lda-go: Gibbs sampling training and inference of the Latent Dirichlet Allocation model written in Google's Go programming language.
  10. Matlab Topic Modeling Toolbox
  11. lda-j: Java version of LDA-C and a short Java version of Gibbs Sampling for LDA.

Copyright and license

============================== Copyright(c) 2013 python-sparselda project.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License in the LICENSE file, or at:

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


