# Mixed Membership Clustering for Text
## Latent Dirichlet allocation (LDA)
We begin by loading the required libraries 

In [None]:
require(quanteda)
require(topicmodels)
require("reshape2")
require("ggplot2")
set.seed(1)

and continue with the corpus in form of a document-term matrix (DMT) 

![alt text](data/dtm.png "DTM")

In [None]:
load("data/DTM.2.RData")
dtm <- convert(DTM.2, to = "topicmodels")

Parameter estimation can take some time, depending on the size of the vocabulary, the number of documents and the setting of K

In [None]:
lda <- LDA(dtm, k = 5, method="Gibbs", control=list(iter = 100, verbose = 20, alpha = 0.2, estimate.beta = TRUE))

We can then inspect the the estimated parameters and use to display the respective distributions for $\phi$ and $\theta$

In [None]:
example_ids <- c(1, 2, 3) # select document IDs of interest

lda_posterior <- posterior(lda)
top5termsPerTopicProb <- lda::top.topic.words(lda_posterior$terms, 5, by.score = T)
topicProportionExamples <- lda_posterior$topics[example_ids, ]
colnames(topicProportionExamples) <- apply(top5termsPerTopicProb, 2, paste, collapse = " ")

vizDataFrame <- melt(data = cbind(data.frame(topicProportionExamples), document = docvars(DTM.2[example_ids])$docname), 
                     variable.name = "topic", 
                     id.vars = "document")

ggplot(data = vizDataFrame, aes(x = topic, y = value, fill = document), ylab = "proportion") +
  geom_bar(stat = "identity", position = "stack") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position="none") +  
  coord_flip() + facet_wrap(~document, ncol = length(example_ids))

For further information have a look at the *topicmodels* [vignette](https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf) as well as [T. Griffiths and M. Steyvers, 2004](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf). 