Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability of topics #9

Closed
amritbhanu opened this issue Apr 5, 2016 · 0 comments
Closed

Stability of topics #9

amritbhanu opened this issue Apr 5, 2016 · 0 comments
Labels

Comments

@amritbhanu
Copy link
Contributor

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})

General:

  • A good paper which shows issues about the topics instability.
  • word-topic and topic-document matrices (probabilities of words appearing in topics and topics appearing in documents)
  • variational approximations and Gibbs sampling. These algorithms find a local maximum of the joint likelihood function of the dataset
  • the LDA approach has been further developed by offering more complex model extensions with additional parameters and additional information [2, 9, 18, 4]

Problem:

  • the case of LDA, there are plenty of local maxima, which may lead to instability in the output.
  • problem of finding the optimal number of clusters
  • Since these distributions result from the same dataset with the same vocabulary and model parameters, any differences between them are entirely due to the randomness in Gibbs sampling. This randomness affects perplexity variations, word and document ratios, and the reproducibility of the qualitative topical solution

Old Solutions for stability:

  • A new metric of similarity between topics and a criterion of vocabulary reduction to evaluate stability.
  • numerical evaluation of topic modeling results is to measure perplexity. Perplexity shows how well topic-word and word-document distributions predict new test samples. The smaller the perplexity, the better (less uniform) is the LDA model.
    • Problem with perplexity:
      • the value of perplexity drops as the number of topics grows
      • perplexity depends on the dictionary size

Evaluation Metric:

  • symmetric Kullback–Leibler divergence
  • compute correlation between documents from two topic modeling experiments. correlation between documents does not depend on dictionary size. The method consists of the following steps:
    • construct a bipartite graph based on two topical solutions;
    • compute the minimal distance between topics in this bipartite graph;
    • compare topics between two cluster solutions based on the minimal distance.

Preprocessing step:

  • document and word ratios that show the fraction of words and documents that are actually relevant to specific topics
@amritbhanu amritbhanu mentioned this issue Apr 6, 2016
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant