================ python-sparselda is a Latent Dirichlet Allocation(LDA) topic modeling package based on SparseLDA Gibbs Sampling inference algorithm, and written in Python 2.6 or newer, Python 3.0 or newer excluded.
Frankly, python-sparselda is just a mini project, we hope it can help you better understand the standard LDA and SparseLDA algorithms. RTFSC for more details. Have fun.
Please use the github issue tracker for python-sparselda at: https://github.com/fandywang/python-sparselda/issues
================
python-sparselda serialize and persistent store the lda model and checkpoint based on protobuf, so you should install it first.
wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2
tar -zxvf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure
make
sudo make install
cd python
python ./setup.py build
sudo python ./setup.py install
cd python-sparselda/common
protoc -I=. --python_out=. lda.proto
Usage: python lda_trainer.py [options].
Options:
-h, --help show this help message and exit
--corpus_dir=CORPUS_DIR
the corpus directory.
--vocabulary_file=VOCABULARY_FILE
the vocabulary file.
--num_topics=NUM_TOPICS
the num of topics.
--topic_prior=TOPIC_PRIOR
the topic prior alpha.
--word_prior=WORD_PRIOR
the word prior beta.
--total_iterations=TOTAL_ITERATIONS
the total iteration.
--model_dir=MODEL_DIR
the model directory.
--save_model_interval=SAVE_MODEL_INTERVAL
the interval to save lda model.
--topic_word_accumulated_prob_threshold=TOPIC_WORD_ACCUMULATED_PROB_THRESHOLD
the accumulated_prob_threshold of topic top words.
--save_checkpoint_interval=SAVE_CHECKPOINT_INTERVAL
the interval to save checkpoint.
--checkpoint_dir=CHECKPOINT_DIR
the checkpoint directory.
--compute_loglikelihood_interval=COMPUTE_LOGLIKELIHOOD_INTERVAL
the interval to compute loglikelihood.
The corpus for training/estimating the model have the line format as follows:
[document1]
[document2]
...
[documentM]
in which each line is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms.
[documenti] = [wordi1]\t[wordi2]\t...\t[wordiNi]
in which all [wordij] <i=1...M, j=1...Ni> are text strings and they are separated by the tab character.
Note that the terms document and word here are abstract and should not only be understood as normal text documents. This's because LDA can be used to discover the underlying topic structures of any kind of discrete data. Therefore, python-sparselda is not limited to text and natural language processing but can also be applied to other kinds of data like images.
Also, keep in mind that for text/Web data collections, you should first preprocess the data (e.g., word segment, removing stopwords and rare words, stemming, etc.) before estimating with python-sparselda.
The vocabulary for training/estimating the model have the line format as follows:
[word1]
[word2]
...
[wordV]
in which each line is a unique word. Words only appear in vocabulary will be considered for parameter estimation.
It includs three files.
- lda.topic_word_hist: This file contains the word-topic histograms, i.e., N(word|topic).
- lda.global_topic_hist: This file contains the global topic histogram, i.e., N(topic).
- lda.hyper_params: This file contails the hyperparams, i.e., alpha and beta.
Every --save_checkpoint_interval
iterations, the lda_trainer will dump current checkpoint for fault tolerance.
The checkpoint mainly includes two types files.
- LDA Model: See above.
- Corpus: This directory contains serialized documents.
- lda.topic_words: This file contains most likely words of each topic. The number of topic top words is depend on
--topic_word_accumulated_prob_threshold
.
Please refer the example: lda_inferencer.py.
Note that we strongly recommend you to use MultiChainGibbsSampler
class for trade off between efficiency and effectiveness.
Instead of manual evaluation, we want to evaluate topics quality automatically, and filter out a few meaningless topics to enchance the inference effect.
================
- Hyperparameters optimization.
- Memory optimization.
- More experiments.
- Data and model parallelization.
================
- Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
- Gregor Heinrich. Parameter estimation for text analysis. Technical Note, 2004.
- Griffiths, T. L., & Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences(PNAS), 2004.
- I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In SIGKDD, 2008.
- Limin Yao, David Mimno, Andrew McCallum. Efficient methods for topic model inference on streaming document collections, In SIGKDD, 2009.
- Newman et al. Distributed Inference for Latent Dirichlet Allocation, NIPS 2007.
- X. Wei, W. Bruce Croft. LDA-based document models for ad hoc retrieval. In Proc. SIGIR. 2006.
- Rickjin, LDA 数学八卦. Technical Note, 2013.
- Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications. AAIM 2009.
=============== Here are some pointers to other implementations of LDA.
- LDA-C: A C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data.
- GibbsLDA++: A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
- plda/plda+: A parallel C++ implementation of Latent Dirichlet Allocation (LDA).
- Mr. LDA: A Latent Dirichlet Allocation topic modeling package based on Variational Bayesian learning approach using MapReduce and Hadoop, developed by a Cloud Computing Research Team in University of Maryland, College Park.
- Yahoo_LDA: Y!LDA Topic Modelling Framework, it provides a fast C++ implementation of the inferencing algorithm which can use both multi-core parallelism and multi-machine parallelism using a hadoop cluster. It can infer about a thousand topics on a million document corpus while running for a thousand iterations on an eight core machine in one day.
- Mahout: Mahout's goal is to build scalable machine learning libraries.
- MALLET : A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- ompi-lda: OpenMP and MPI Based Paralllel Implementation of LDA.
- lda-go: Gibbs sampling training and inference of the Latent Dirichlet Allocation model written in Google's Go programming language.
- Matlab Topic Modeling Toolbox
- lda-j: Java version of LDA-C and a short Java version of Gibbs Sampling for LDA.
============================== Copyright(c) 2013 python-sparselda project.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License in the LICENSE file, or at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.