You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content
[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})
General:
A good paper which shows issues about the topics instability.
word-topic and topic-document matrices (probabilities of words appearing in topics and topics appearing in documents)
variational approximations and Gibbs sampling. These algorithms find a local maximum of the joint likelihood function of the dataset
the LDA approach has been further developed by offering more complex model extensions with additional parameters and additional information [2, 9, 18, 4]
Problem:
the case of LDA, there are plenty of local maxima, which may lead to instability in the output.
problem of finding the optimal number of clusters
Since these distributions result from the same dataset with the same vocabulary and model parameters, any differences between them are entirely due to the randomness in Gibbs sampling. This randomness affects perplexity variations, word and document ratios, and the reproducibility of the qualitative topical solution
Old Solutions for stability:
A new metric of similarity between topics and a criterion of vocabulary reduction to evaluate stability.
numerical evaluation of topic modeling results is to measure perplexity. Perplexity shows how well topic-word and word-document distributions predict new test samples. The smaller the perplexity, the better (less uniform) is the LDA model.
Problem with perplexity:
the value of perplexity drops as the number of topics grows
perplexity depends on the dictionary size
Evaluation Metric:
symmetric Kullback–Leibler divergence
compute correlation between documents from two topic modeling experiments. correlation between documents does not depend on dictionary size. The method consists of the following steps:
construct a bipartite graph based on two topical solutions;
compute the minimal distance between topics in this bipartite graph;
compare topics between two cluster solutions based on the minimal distance.
Preprocessing step:
document and word ratios that show the fraction of words and documents that are actually relevant to specific topics
The text was updated successfully, but these errors were encountered:
Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content
[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})
General:
Problem:
Old Solutions for stability:
Evaluation Metric:
Preprocessing step:
The text was updated successfully, but these errors were encountered: