Stability of topics #9

amritbhanu · 2016-04-05T23:41:59Z

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

[bibtex](@inproceedings{koltcov2014latent,
title={Latent dirichlet allocation: stability and applications to studies of user-generated content},
author={Koltcov, Sergei and Koltsova, Olessia and Nikolenko, Sergey},
booktitle={Proceedings of the 2014 ACM conference on Web science},
pages={161--165},
year={2014},
organization={ACM}
})

General:

A good paper which shows issues about the topics instability.
word-topic and topic-document matrices (probabilities of words appearing in topics and topics appearing in documents)
variational approximations and Gibbs sampling. These algorithms find a local maximum of the joint likelihood function of the dataset
the LDA approach has been further developed by offering more complex model extensions with additional parameters and additional information [2, 9, 18, 4]

Problem:

the case of LDA, there are plenty of local maxima, which may lead to instability in the output.
problem of finding the optimal number of clusters
Since these distributions result from the same dataset with the same vocabulary and model parameters, any differences between them are entirely due to the randomness in Gibbs sampling. This randomness affects perplexity variations, word and document ratios, and the reproducibility of the qualitative topical solution

Old Solutions for stability:

A new metric of similarity between topics and a criterion of vocabulary reduction to evaluate stability.
numerical evaluation of topic modeling results is to measure perplexity. Perplexity shows how well topic-word and word-document distributions predict new test samples. The smaller the perplexity, the better (less uniform) is the LDA model.
- Problem with perplexity:
  - the value of perplexity drops as the number of topics grows
  - perplexity depends on the dictionary size

Evaluation Metric:

symmetric Kullback–Leibler divergence
compute correlation between documents from two topic modeling experiments. correlation between documents does not depend on dictionary size. The method consists of the following steps:
- construct a bipartite graph based on two topical solutions;
- compute the minimal distance between topics in this bipartite graph;
- compare topics between two cluster solutions based on the minimal distance.

Preprocessing step:

document and word ratios that show the fraction of words and documents that are actually relevant to specific topics

amritbhanu added the Papers label Apr 5, 2016

amritbhanu mentioned this issue Apr 6, 2016

Read this #4

Closed

7 tasks

amritbhanu closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability of topics #9

Stability of topics #9

amritbhanu commented Apr 5, 2016

Stability of topics #9

Stability of topics #9

Comments

amritbhanu commented Apr 5, 2016

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content