-
Notifications
You must be signed in to change notification settings - Fork 27
LDA
Latent Dirichlet Allocation is the most commonly used modern topic model and forms the foundation for a large number of extensions. LDA is described in Blei et al. 2003. TMA uses Dave Blei's LDA implementation.
# Input
- number of topics
- topic initialization
- Dirichlet hyperparameter
- alpha technique
- variational convergence threshold
- maximum variational iterations
- EM convergence threshold
#### number of topics The number of topics (thematic groups) present in the analysis data. Several techniques exist to set this parameter; a common technique is to evaluate the held-out likelihood or perplexity.
#### topic initialization This parameter describes how the topics will be initialized:
- "random" initializes each topic randomly
- "seeded" initializes each topic to a distribution smoothed from a randomly chosen document
#### Dirichlet hyperparameter The Dirichlet concentration hyperparameter determines the tendency of topics to distribute its probability over a larger number of terms (smaller values) or concentrate in a few terms (larger values).
- Note: TMA default = 50 / (number of topics); (this heuristic (I think) stems from using ~ average number of words per file / number of topics)
- Note it is possible to enable sampling of this parameter
#### alpha technique This parameter determines whether to iteratievly 'estimate' the Dirichlet parameter or keep it 'fixed':
- 'estimate': use the alpha estimation technique discussed in Blei et al. 2003
- 'fixed': use the value provided for the Dirichlet hyperparameter
#### variational convergence threshold The convergence criteria for variational inference. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that the score is the lower bound on the likelihood for a particular document
#### maximum variational iterations The maximum number of iterations of coordinate ascent variational inference for a single document as described in Blei et al. 2003. Providing a value of -1 indicates that TMA should perform "full" variational inference, until the variational convergence criterion is met.
#### EM convergence threshold The convergence criteria for varitional EM. Stop if (score_old - score) / abs(score_old) is less than this value (or after the maximum number of iterations). Note that "score" is the lower bound on the likelihood for the whole corpus.
# Output
#### beta
The beta
output file is a number-of-topics x vocabulary-size matrix that provides the smoothed log likelihood of each term in the vocabulary for the given topic.
#### gamma
The gamma
output file is a number-of-documents x number-of-topics matrix that provides the variational posterior Dirichlet parameter. Each row represents the document on the topic simplex.
# Relationships
The data relationships determined by LDA are described generally in Data Relationships. The two necessary components of these relationship calculations are a document x topics probability matrix and a topics x terms probability matrix.
#### document-topic The document-topic matrix used to determine the data relationships is directly taken from the gamma output file.
#### topic-term The topic-term matrix used to determine the data relationships is directly taken from the beta output file.
# Perplexity
WARNING: The perplexity calculations have not been fully vetted yet -- do not rely on these calculations until this warning is removed.
Determining the held-out test perplexity is described generally in Perplexity. For LDA, we determine the held-out perplexity by evaluating the held-out data likelihood via inf
(inference) mode with Blei's LDA code [TODO expand this section with more detail].