Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the Usability of Topic Models #12

Closed
amritbhanu opened this issue Apr 11, 2016 · 0 comments
Closed

Improving the Usability of Topic Models #12

amritbhanu opened this issue Apr 11, 2016 · 0 comments
Labels
Milestone

Comments

@amritbhanu
Copy link
Contributor

Improving the Usability of Topic Models

[bibtex](@PhDThesis{yang2015improving,
title={Improving the Usability of Topic Models},
author={Yang, Yi},
year={2015},
school={NORTHWESTERN UNIVERSITY}
})

Problems:

  • Gibbs sampling inference method for LDA runs too slow for large dataset with many topics.
  • The topics learned by LDA sometimes are difficult to interpret by end users.
  • LDA suffers from instability problem

Motivation:

  • like to efficiently train a big topic model with prior knowledge

Terminologies:

  • Markov random field
  • First-Order Logic

Stability Measures:

  • try running the algorithm many times and choose the model with the highest likelihood
  • document-level stability and token-level stability.
  • the number of topics was set to 20, the number of iterations was 1000. We use a uniform α with a value of 1.0, a uniform β with a value of 0.01

General:

  • LDA can be viewed as dimension reduction tool for document modeling by reducing the dataset dimension from the vocabulary size V to the number of topics T.
  • users have external knowledge regarding word correlation, which can be taken into account to improve the semantic coherence of topic modeling
    Methods:
  • SCLDA can handle different kinds of knowledge such as word correlation, document correlation,
    document label and so on. One advantage of SC-LDA over existing methods is that it is very fast to converge.

Datasets:

Analogy when topics are unstable:

  • the mental map Jane has built for the paper collection is disrupted, resulting in confusion and frustration. The tool has become less useful to Jane unless she puts in some effort to update her mental map, which significantly increases her cognitive load

References:

  • Online LDA [23]
  • [41] presents an algorithm for distributed Gibbs sampling
  • [67] proposes a MapReduce parallelization framework that uses variational inference as the underlying algorithm
  • [19] presents the Gibbs sampling method for LDA inference
  • Labeled LDA: [48] presents a generative model for modeling document collections where the documents are associated with labels
  • Dirichlet Forest LDA [3]
  • Logic LDA: [4]
  • Quad-LDA: [42] In order to improve the coherence of the keywords per topic learned by LDA.
  • NMF-LDA: Similar to Quad-LDA, [64]
  • Markov Random Topic Fields (MRTF): [27]
  • Interactive Topic Modeling (ITM): [26] proposes the first interactive framework for allowing users to iteratively refine the topics discovered by LDA by adding constraints that enforce that sets of words must appear together in the same topic
  • [47] proposes Fast-LDA by constructing an adaptive upper bound on the sampling distribution and achieves a faster inference

Summary:

  • Labeled LDA can only handle document label knowledge. Dirichlet Forest LDA, Quad-LDA, NMF-LDA and ITM can only handle word correlation knowledge. MRTF can only handle document correlation knowledge. Logic LDA can handle word correlation , document label knowledge and other kinds of knowledge. However, each knowledge has to be encoded as First Order Logic
@amritbhanu amritbhanu mentioned this issue Apr 12, 2016
7 tasks
@amritbhanu amritbhanu modified the milestone: Nice Papers Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant