Rethinking LDA: moment matching for Discrete ICA
This project contains Matlab/C++ implementation of the algorithms and the scripts for repruducing all the experiments from the paper:
- A. Podosinnikova, F. Bach, S. Lacoste-Julien. Rethinking LDA: moment matching for discrete ICA. NIPS, 2015.
Please cite this paper if you use this code for your research.
If you are only interested in the implementations of the algorithms, but not reproducing the experiments, check this repo.
This project contains implementations of several moment matching algorithms for topic modeling. In brief, these algorithms are based on the construction of moment/cumulant tensors from the data and matching them to the respective theoretical expressions in order to learn the parameters of the model.
The implementation of the algorithms consitst of two parts. One part contains the efficient implementation for construction of the moment/cumulant tensors, while the other part contains implementations of several so called joint diagonalization type algorithms used for matching the tensors. Any tensor type (see below) can be arbitrarily combined with one of the diagonalization algorithms (leading, in total, to 6 algorithms).
The focus is on the latent Dirichlet allocation (LDA) and discrete independent component analysis (DICA) models. Importantly, the latter model was shown to be similar and sometimes equivalent to the former. Respectively, two types of the tensors are considered: the LDA moments and the DICA cumulants. The theoretical expressions for the LDA moments were previously derived by Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, Yi-Kai Liu in A Spectral Algorithm for Latent Dirichlet Allocation. Algorithmica 72(1): 193-214 (2015). The expressions for the DICA cumulants are derived in our paper (see below).
The diagonalization type algorithms include the spectral algorithm (spectral) based on two eigen decompositions, the orthogonal joint diagonalization algorithm (jd), and the tensor power method (tpm).
- make sure your Matlab recognizes a C++ compiler:
- save all required paths and build mex files:
- reproduce experiments:
- sample the semi-toy data from our experiments:
- the real data from our experiments can be found in
- make the plots without rerunning the experiments:
- when finished, remove all paths with
Please don't hesitate to contact me with questions regarding this code, usage of this algorithm, or bug reports.