Skip to content
cjrd edited this page Jun 13, 2012 · 11 revisions

Latent Dirichlet Allocation is the most commonly used modern topic model and forms the foundation for a large number of extensions. LDA is described in Blei et al. 2003. TMA uses Dave Blei's LDA implementation.

# Input

#### number of topics The number of topics (thematic groups) present in the analysis data. Several techniques exist to set this parameter; a common technique is to evaluate the held-out likelihood or perplexity.

#### topic initialization This parameter describes how the topics will be initialized:

  • "random" initializes each topic randomly
  • "seeded" initializes each topic to a distribution smoothed from a randomly chosen document

#### Dirichlet hyperparameter The Dirichlet concentration hyperparameter determines the tendency of topics to distribute its probability over a larger number of terms (smaller values) or concentrate in a few terms (larger values).

  • Note: TMA default = 50 / (number of topics); (this heuristic (I think) stems from using ~ average number of words per file / number of topics)
  • Note it is possible to enable sampling of this parameter

#### alpha technique This parameter determines whether to iteratievly 'estimate' the Dirichlet parameter or keep it 'fixed':

#### variational convergence threshold The convergence criteria for variational inference. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that the score is the lower bound on the likelihood for a particular document

#### maximum variational iterations The maximum number of iterations of coordinate ascent variational inference for a single document as described in Blei et al. 2003. Providing a value of -1 indicates that TMA should perform "full" variational inference, until the variational convergence criterion is met.

#### EM convergence threshold The convergence criteria for varitional EM. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that "score" is the lower bound on the likelihood for the whole corpus.

# Output

#### beta The beta output file is a number-of-topics x vocabulary-size matrix that provides the smoothed log likelihood of each term in the vocabulary for the given topic.

#### gamma The gamma output file is a number-of-documents x number-of-topics matrix that provides the variational posterior Dirichlet parameter. Each row represents the document on the topic simplex.

# Relationships

The data relationships determined by LDA are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.

#### document-topic The document-topic matrix used to determine the data relationships is directly taken from the gamma output file.

#### topic-term The topic-term matrix used to determine the data relationships is directly taken from the beta output file.

# Perplexity

WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.

Determining the held-out test perplexity is described generally in Perplexity. For LDA, we determine the held-out perplexity by evaluating the held-out data likelihood via inf (inference) mode with Blei's LDA code [TODO expand this section with more detail].

Clone this wiki locally