Improving the Usability of Topic Models #12

amritbhanu · 2016-04-11T21:04:09Z

Improving the Usability of Topic Models

[bibtex](@PhDThesis{yang2015improving,
title={Improving the Usability of Topic Models},
author={Yang, Yi},
year={2015},
school={NORTHWESTERN UNIVERSITY}
})

Problems:

Gibbs sampling inference method for LDA runs too slow for large dataset with many topics.
The topics learned by LDA sometimes are difficult to interpret by end users.
LDA suffers from instability problem

Motivation:

like to efficiently train a big topic model with prior knowledge

Terminologies:

Markov random field
First-Order Logic

Stability Measures:

try running the algorithm many times and choose the model with the highest likelihood
document-level stability and token-level stability.
the number of topics was set to 20, the number of iterations was 1000. We use a uniform α with a value of 1.0, a uniform β with a value of 0.01

General:

LDA can be viewed as dimension reduction tool for document modeling by reducing the dataset dimension from the vocabulary size V to the number of topics T.
users have external knowledge regarding word correlation, which can be taken into account to improve the semantic coherence of topic modeling
Methods:
SCLDA can handle different kinds of knowledge such as word correlation, document correlation,
document label and so on. One advantage of SC-LDA over existing methods is that it is very fast to converge.

Datasets:

Analogy when topics are unstable:

the mental map Jane has built for the paper collection is disrupted, resulting in confusion and frustration. The tool has become less useful to Jane unless she puts in some effort to update her mental map, which significantly increases her cognitive load

References:

Online LDA [23]
[41] presents an algorithm for distributed Gibbs sampling
[67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
[19] presents the Gibbs sampling method for LDA inference
Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
Dirichlet Forest LDA [3]
Logic LDA: [4]
Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
NMF-LDA: Similar to Quad-LDA, [64]
Markov Random Topic Fields (MRTF): [27]
Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
[47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

Summary:

Labeled LDA can only handle document label knowledge. Dirichlet Forest LDA, Quad-LDA, NMF-LDA and ITM can only handle word correlation knowledge. MRTF can only handle document correlation knowledge. Logic LDA can handle word correlation , document label knowledge and other kinds of knowledge. However, each knowledge has to be encoded as First Order Logic

amritbhanu mentioned this issue Apr 12, 2016

Read this #4

Closed

7 tasks

amritbhanu added the Papers label Apr 13, 2016

amritbhanu modified the milestone: Nice Papers Apr 13, 2016

amritbhanu closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the Usability of Topic Models #12

Improving the Usability of Topic Models #12

amritbhanu commented Apr 11, 2016

Improving the Usability of Topic Models #12

Improving the Usability of Topic Models #12

Comments

amritbhanu commented Apr 11, 2016

Improving the Usability of Topic Models

References: