Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add measures of topic quality #3

Open
jwijffels opened this issue Jan 2, 2019 · 22 comments
Open

add measures of topic quality #3

jwijffels opened this issue Jan 2, 2019 · 22 comments

Comments

@jwijffels
Copy link
Collaborator

jwijffels commented Jan 2, 2019

note perplexity does not exist for BTM models, we can implement

  • Coherence
  • Average Intra-Cluster Distance
  • Average Inter-Cluster Distance
  • Purity
  • Normalised mutual information
  • Adjusted Rand Index

As defined in the BTM paper: https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

@jwijffels jwijffels changed the title add perplexity add measures of topic quality Jan 3, 2019
@jwijffels
Copy link
Collaborator Author

@manuelbickel I see you have been working on some of these measures for the text2vec package. Are you interested in some extra work for this biterm topic model package?

@manuelbickel
Copy link

Hi, thank you for your interest in my recent work. I will have to finalize some work for my PhD thesis in the next two months, but afterwards I could try to provide support. I think it should not be too difficult to use the metrics implemented in text2vec so far for the biterm model. The input required for coherence metrics are "just" the n top topic terms and a reference corpus to build a reference tcm from. To my knowledge (which is limited since I am not a computer scientist) coherence metrics have so far been applied in the context of "normal" text in contrast to the shorter texts BTM is aiming at - so we might have to check how good the metrics work in this context - but it should probably be fine if a suitable reference corpus of similar nature as the texts is selected, I guess.

@jwijffels
Copy link
Collaborator Author

That would be great!

@manuelbickel
Copy link

Just a reminder for the time when we get to work on this in detail... For the cluster distance metrics, the Jensen Shannon divergence is needed, which has already been implemented in LDAvis package here. We can use this...

@hg-wells
Copy link

Hi jwijffels,

Thank you very much for BTM. I am trying to find out the optimal number of topics from a corpus of tweets, my approach:
(1) Run a set of BTM models, from 1 up to 10 topics.
(2) Identify the model with higest overall log-likelihood as the baseline model (in my case, model 9, i.e. a 9-topic model)
(3) Run a series of likelihood ratio test for each model against model 9, with degrees of freedom 9-k (e.g., for a 3-topic model, 9-3 = 6 degrees of freedom)
(4) if significant, keep the baseline model as optimal

Could you please let me know what you think, I'd appreciate it much.
Also, I'm interested in coherence, I was wondering if there is any dev version for BTM or if you know any other way to compute it in R starting from a BTM-type object.

Thank you very much

@jwijffels
Copy link
Collaborator Author

No comment.
There exist up to now no public dev version containing measures of topic quality. Anyone interested in taking up the challenge and incorporating it in the package is appreciated.

@hg-wells
Copy link

Thanks. I wish I had the ability to comment and help, I would do it.

@manuelbickel
Copy link

First of all, sorry, that I still have not proceeded with coherence metrics for BTM - since my current main job has nothing to do with programming it's difficult to find time. Still on my list.

Now to the question of @hg-wells:
Generating many models with varying number of topics and finding the "best" makes sense from my perspective. I have published a paper using this approach based on the excellent text2vec package by dselivanov - so not BTM, but standard LDA, see paper and vignette here; the code is unpolished and package is not installation ready, do not hesitate to contact me for questions.

I used loglik and coherence metrics with a comparably small but thematically very specific reference corpus. The metrics all gave different answers regarding the "best" number of topics since they measure different qualities.

The best approach to my current knowledge would is creating models with different number of topics and also different hyerparameters, the use different reference corpora and different metrics to evaluate the results. Then pick out the "best" models that these metrics propose and inspect them manually by using expert knowledge. A lot of work, I know... I think pure automatic detection works in some contexts, but not in all. Texts represent meaning, it depends what kind of meaning you are searching for - there is no single correct answer. Imagine all these topic modelling algorithms as text coders like in the social sciences - you need to find out, which one you trust the most, this can vary depending on the task.

I hope that helps a bit, at least.

@hg-wells
Copy link

hg-wells commented Apr 27, 2020 via email

@hg-wells
Copy link

Hi,
I tried to adapt the exclusivity function from the stm package to BTM (Roberts et al., 2014; see https://www.rdocumentation.org/packages/stm/versions/1.3.5/topics/exclusivity) to measure topic coherence.
As I said, I am a beginner, apologies if in advance if this does not make any sense. I would really appreciate your feedback. Thank you.

`

Exclusivity

exclusivity <- function (model, M = 30, frexw = 0.7){
w <- frexw
phidf <- t(as.matrix(model$phi))
phi <- list(phidf)
row.names(phi) <- NULL
if (length(phi) != 1)
stop("Exclusivity calculation only designed for models without content covariates")
tbeta <- t(exp(phi[[1]]))
s <- rowSums(tbeta)
mat <- tbeta/s
ex <- apply(mat, 2, rank)/nrow(mat)
fr <- apply(tbeta, 2, rank)/nrow(mat)
frex <- 1/(w/ex + (1 - w)/fr)
index <- apply(tbeta, 2, order, decreasing = TRUE)[1:M, ]
out <- vector(length = ncol(tbeta))
for (i in 1:ncol(frex)) {
out[i] <- sum(frex[index[, i], i])
}
print(mean(out))
}

Run BTM and compute exclusivity

install.packages("BTM")
install.packages("udpipe")
library(BTM)
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model_x <- BTM(x, k = 5, iter = 10, trace = TRUE, background = TRUE)
exclusivity(model_x)
`

@jwijffels
Copy link
Collaborator Author

@hg-wells Thanks for the contribution and having taken the time for this
Before proceeding
Can you create a pull request where you put the R code in the package, document it, provide an example and show the expected behaviour of the function and possibly a test using the tinytest package
I've just set up travis CI such that we can see from there if there are any issues
thanks!

@hg-wells
Copy link

hg-wells commented May 27, 2020 via email

@abitter
Copy link

abitter commented Dec 10, 2020

Hi @jwijffels and @hg-wells,
we modified a function, originally calculating semantic coherence in LDA models provided by @grenwi.

For semantic coherence of biterm topic models, the function is as follows:

# modified from tmca_coherence() function (by A. Niekler & G. Wiedemann)
# https://github.com/tm4ss/tm4ss.github.io
coherenceBTM <- function(model, DTM, N = 10) {
  
  # Ensure matrix or Matrix-format (convert if slam)
  require(Matrix)
  require(slam)
  if (is.simple_triplet_matrix(DTM)) {
    DTM <- sparseMatrix(i=DTM$i, j=DTM$j, x=DTM$v, dims=c(DTM$nrow, DTM$ncol), dimnames = dimnames(DTM))
  }
  
  K <- model$K
  
  DTMBIN <- DTM > 0
  
  documentFrequency <- colSums(DTMBIN)
  names(documentFrequency) <- colnames(DTMBIN)
  
  topNtermsPerTopic <- terms(model, top_n = N)
  
  termcollect <- list()
  for (i in 1:K){
    termcollect[[i]] <- topNtermsPerTopic[[i]][,1]
  }
  
  allTopicModelTerms <- unique(as.vector(unlist(termcollect)))
  
  DTMBIN <- DTMBIN[, allTopicModelTerms]
  DTMBINCooc <- t(DTMBIN) %*% DTMBIN
  DTMBINCooc <- t((DTMBINCooc + 1) / colSums(DTMBIN))
  DTMBINCooc <- log(DTMBINCooc)
  DTMBINCooc <- as.matrix(DTMBINCooc)
  
  coherence <- rep(0, K)
  pb <- txtProgressBar(max = K)
  for (topicIdx in 1:K) {
    setTxtProgressBar(pb, topicIdx)
    topWordsOfTopic <- topNtermsPerTopic[[topicIdx]][,1]
    
    coherence[topicIdx] <- 0
    for (m in 2:length(topWordsOfTopic)) {
      for (l in 1:(m-1)) {
        mTerm <- as.character(topWordsOfTopic[m])
        lTerm <- as.character(topWordsOfTopic[l])
        coherence[topicIdx] <- coherence[topicIdx] + DTMBINCooc[mTerm, lTerm]
      }
    }
  }
  close(pb)
  
  return(coherence)
}

The necessary Document-Term-Matrix (DTM) can be calculated with:

## DTM for coherence calculation ##
library(quanteda)
corpus <- corpus(x$lemma)
DFM <- dfm(tokens(corpus)) # edit parameters of tokens() to your needs, e. g. removal of separators and punctuation
DTM <- convert(DFM, to = "topicmodels")

where x is:

x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]

# one doc per row
x <- aggregate(x$lemma, by = list(x$doc_id), paste, collapse = " ")
names(x) <- c("doc_id", "lemma")

(the example from https://github.com/bnosac/BTM)

We provided this function in our supplementary materials (http://dx.doi.org/10.23668/psycharchives.4372; unfortunately only working with BTM prior to version 0.3.2), where we made use of @hg-wells exclusivity function as well. Thx for that!
And thx for implementing BTM in R!

@jwijffels
Copy link
Collaborator Author

Thanks @abitter , we should probably be including this functionalities in this R package.
Is the code you provide and which was originally by @grenwi available under the apache license similarly as this project?

@abitter
Copy link

abitter commented Dec 11, 2020

We made our code available under LGPL 3.0 licence, which should be compatible with apache license 2.0.

@hg-wells
Copy link

Thank you! Sorry for the delay in replying.

@jwijffels
Copy link
Collaborator Author

@abitter The code can't be included in the package as LGPL 3.0 which is a less liberal license as the Apache License 2.0 which BTM ships. Only if your code license is changed to Apache, it can be included in the package.

@abitter
Copy link

abitter commented Jan 28, 2021

@jwijffels I see – I'll check if that's possible.

@abitter
Copy link

abitter commented Feb 10, 2021

@jwijffels So I checked with @grenwi, and we provide the code above (#3 (comment)) also under Apache 2.0 license.
So despite the LGPL 3.0 licensing of the script, where these lines of code were included (http://dx.doi.org/10.23668/psycharchives.4372), this part is now also licensed under Apache 2.0 to be included in your BTM package.

@ginalamp
Copy link

ginalamp commented Sep 16, 2021

Should fit$ll for logLik be maximised or how does the likelihood look like? If I run the code multiple times and get very positive and very negative values, is the best fit the value closest to 0, or is it the most positive (or negative) value?

@mevalerio
Copy link

@ginalamp I noticed that the ll from logLik increases as the number of topic increases . I suppose that can be proved mathematically and also the asymptote. Therefore, I suspect that logLik cannot be used for evaluating results. Even if, in @hg-wells paper he wrote: "The maximum model likelihood or subjective choices are often the basis for model selection [88, 96, 97]. Maximum likelihood models may produce topics with comparably low interpretability [112]." I suspect I am losing something.

@mevalerio
Copy link

mevalerio commented Mar 16, 2023

@hg-wells why in your exclusivity function is exp needed in tbeta <- t(exp(phi[[1]]))?

Thank you for the clarification and apologies if the question is obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants