Skip to content

funkboy/FanLDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FanLDA

This example is a simple JAVA implementation of variational expectation maximization inference for Latent Dirichlet Allocation(LDA) proposed in the original paper paper by David Blei, Andrew Ng, and Michael Jordan: http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf.

data folder contains a formatted corpus and the original text file of 2245 documents from Associated Press -- acquired from David M. Blei: http://www.cs.princeton.edu/~blei/lda-c/index.html

sample_data contains a list of 3 plain text files, can be used as an example of processing plain text files into formatted data

Data format

Under LDA, the words of each document are assumed exchangeable. Thus, each document is succinctly represented as a sparse vector of word counts. The data is a file where each line is of the form:

 [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

The main function of Utils.java helps convert a plain text file/a list od plain text file into the formatted data set.

You are free to redistribute and/or modify this code for research and acamedic purpose. Please contact liufan.seu@gmail.com if you want use this code for publication or commercial use.

About

implementation of Latent Dirichlet Allocation(LDA)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages