add measures of topic quality #3

jwijffels · 2019-01-02T12:09:21Z

note perplexity does not exist for BTM models, we can implement

Coherence
Average Intra-Cluster Distance
Average Inter-Cluster Distance
Purity
Normalised mutual information
Adjusted Rand Index

As defined in the BTM paper: https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

jwijffels · 2019-01-03T22:16:32Z

@manuelbickel I see you have been working on some of these measures for the text2vec package. Are you interested in some extra work for this biterm topic model package?

manuelbickel · 2019-01-04T08:06:56Z

Hi, thank you for your interest in my recent work. I will have to finalize some work for my PhD thesis in the next two months, but afterwards I could try to provide support. I think it should not be too difficult to use the metrics implemented in text2vec so far for the biterm model. The input required for coherence metrics are "just" the n top topic terms and a reference corpus to build a reference tcm from. To my knowledge (which is limited since I am not a computer scientist) coherence metrics have so far been applied in the context of "normal" text in contrast to the shorter texts BTM is aiming at - so we might have to check how good the metrics work in this context - but it should probably be fine if a suitable reference corpus of similar nature as the texts is selected, I guess.

jwijffels · 2019-01-04T14:19:05Z

That would be great!

manuelbickel · 2019-01-07T22:37:30Z

Just a reminder for the time when we get to work on this in detail... For the cluster distance metrics, the Jensen Shannon divergence is needed, which has already been implemented in LDAvis package here. We can use this...

hg-wells · 2020-04-24T11:47:49Z

Hi jwijffels,

Thank you very much for BTM. I am trying to find out the optimal number of topics from a corpus of tweets, my approach:
(1) Run a set of BTM models, from 1 up to 10 topics.
(2) Identify the model with higest overall log-likelihood as the baseline model (in my case, model 9, i.e. a 9-topic model)
(3) Run a series of likelihood ratio test for each model against model 9, with degrees of freedom 9-k (e.g., for a 3-topic model, 9-3 = 6 degrees of freedom)
(4) if significant, keep the baseline model as optimal

Could you please let me know what you think, I'd appreciate it much.
Also, I'm interested in coherence, I was wondering if there is any dev version for BTM or if you know any other way to compute it in R starting from a BTM-type object.

Thank you very much

jwijffels · 2020-04-24T13:47:01Z

No comment.
There exist up to now no public dev version containing measures of topic quality. Anyone interested in taking up the challenge and incorporating it in the package is appreciated.

hg-wells · 2020-04-24T14:20:51Z

Thanks. I wish I had the ability to comment and help, I would do it.

manuelbickel · 2020-04-26T13:59:28Z

First of all, sorry, that I still have not proceeded with coherence metrics for BTM - since my current main job has nothing to do with programming it's difficult to find time. Still on my list.

Now to the question of @hg-wells:
Generating many models with varying number of topics and finding the "best" makes sense from my perspective. I have published a paper using this approach based on the excellent text2vec package by dselivanov - so not BTM, but standard LDA, see paper and vignette here; the code is unpolished and package is not installation ready, do not hesitate to contact me for questions.

I used loglik and coherence metrics with a comparably small but thematically very specific reference corpus. The metrics all gave different answers regarding the "best" number of topics since they measure different qualities.

The best approach to my current knowledge would is creating models with different number of topics and also different hyerparameters, the use different reference corpora and different metrics to evaluate the results. Then pick out the "best" models that these metrics propose and inspect them manually by using expert knowledge. A lot of work, I know... I think pure automatic detection works in some contexts, but not in all. Texts represent meaning, it depends what kind of meaning you are searching for - there is no single correct answer. Imagine all these topic modelling algorithms as text coders like in the social sciences - you need to find out, which one you trust the most, this can vary depending on the task.

I hope that helps a bit, at least.

hg-wells · 2020-04-27T18:14:58Z

Hi, Thank you so much for your answer, I really appreciate it. I find your paper very interesting. As I said, I am interested in biterm models at the moment since I am working on tweets. But your code on LDA is of enormous help, thank you so much for sharing it. will look into that and try to work out coherence and apply it to BTM. I also appreciate the description of your approach. I will attempt to replicate the approach you have used in your paper. Again, thank you so much for all your help, I really appreciate it.

…

On Sun, 26 Apr 2020 at 14:59, Manuel Bickel ***@***.***> wrote: First of all, sorry, that I still have not proceeded with coherence metrics for BTM - since my current main job has nothing to do with programming it's difficult to find time. Still on my list. Now to the question of @hg-wells <https://github.com/hg-wells>: Generating many models with varying number of topics and finding the "best" makes sense from my perspective. I have published a paper using this approach based on the excellent text2vec package by dselivanov - so not BTM, but standard LDA, see paper and vignette here <https://github.com/manuelbickel/textility>; the code is unpolished and package is not installation ready, do not hesitate to contact me for questions. I used loglik and coherence metrics with a comparably small but thematically very specific reference corpus. The metrics all gave different answers regarding the "best" number of topics since they measure different qualities. The best approach to my current knowledge would is creating models with different number of topics and also different hyerparameters, the use different reference corpora and different metrics to evaluate the results. Then pick out the "best" models that these metrics propose and inspect them manually by using expert knowledge. A lot of work, I know... I think pure automatic detection works in some contexts, but not in all. Texts represent meaning, it depends what kind of meaning you are searching for - there is no single correct answer. Imagine all these topic modelling algorithms as text coders like in the social sciences - you need to find out, which one you trust the most, this can vary depending on the task. I hope that helps a bit, at least. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABH6WV4INYYJLLGLIXQDNT3ROQ44ZANCNFSM4GM2IKKA> .

hg-wells · 2020-05-27T16:26:49Z

Hi,
I tried to adapt the exclusivity function from the stm package to BTM (Roberts et al., 2014; see https://www.rdocumentation.org/packages/stm/versions/1.3.5/topics/exclusivity) to measure topic coherence.
As I said, I am a beginner, apologies if in advance if this does not make any sense. I would really appreciate your feedback. Thank you.

`

Exclusivity

exclusivity <- function (model, M = 30, frexw = 0.7){
w <- frexw
phidf <- t(as.matrix(model$phi))
phi <- list(phidf)
row.names(phi) <- NULL
if (length(phi) != 1)
stop("Exclusivity calculation only designed for models without content covariates")
tbeta <- t(exp(phi[[1]]))
s <- rowSums(tbeta)
mat <- tbeta/s
ex <- apply(mat, 2, rank)/nrow(mat)
fr <- apply(tbeta, 2, rank)/nrow(mat)
frex <- 1/(w/ex + (1 - w)/fr)
index <- apply(tbeta, 2, order, decreasing = TRUE)[1:M, ]
out <- vector(length = ncol(tbeta))
for (i in 1:ncol(frex)) {
out[i] <- sum(frex[index[, i], i])
}
print(mean(out))
}

Run BTM and compute exclusivity

install.packages("BTM")
install.packages("udpipe")
library(BTM)
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model_x <- BTM(x, k = 5, iter = 10, trace = TRUE, background = TRUE)
exclusivity(model_x)
`

jwijffels · 2020-05-27T18:54:57Z

@hg-wells Thanks for the contribution and having taken the time for this
Before proceeding
Can you create a pull request where you put the R code in the package, document it, provide an example and show the expected behaviour of the function and possibly a test using the tinytest package
I've just set up travis CI such that we can see from there if there are any issues
thanks!

hg-wells · 2020-05-27T21:31:54Z

Hi, thank you for your reply and interest. I will surely do it, it may be in a few days but I am happy to move on with it!

…

On Wed, 27 May 2020 at 19:55, jwijffels ***@***.***> wrote: @hg-wells <https://github.com/hg-wells> Thanks for the contribution and having taken the time for this Before proceeding Can you create a pull request where you put the R code in the package, document it, provide an example and show the expected behaviour of the function and possibly a test using the tinytest package I've just set up travis CI such that we can see from there if there are any issues thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABH6WV6IYBVOMSPLAVQHIBLRTVOZVANCNFSM4GM2IKKA> .

abitter · 2020-12-10T13:18:56Z

Hi @jwijffels and @hg-wells,
we modified a function, originally calculating semantic coherence in LDA models provided by @grenwi.

For semantic coherence of biterm topic models, the function is as follows:

# modified from tmca_coherence() function (by A. Niekler & G. Wiedemann)
# https://github.com/tm4ss/tm4ss.github.io
coherenceBTM <- function(model, DTM, N = 10) {
  
  # Ensure matrix or Matrix-format (convert if slam)
  require(Matrix)
  require(slam)
  if (is.simple_triplet_matrix(DTM)) {
    DTM <- sparseMatrix(i=DTM$i, j=DTM$j, x=DTM$v, dims=c(DTM$nrow, DTM$ncol), dimnames = dimnames(DTM))
  }
  
  K <- model$K
  
  DTMBIN <- DTM > 0
  
  documentFrequency <- colSums(DTMBIN)
  names(documentFrequency) <- colnames(DTMBIN)
  
  topNtermsPerTopic <- terms(model, top_n = N)
  
  termcollect <- list()
  for (i in 1:K){
    termcollect[[i]] <- topNtermsPerTopic[[i]][,1]
  }
  
  allTopicModelTerms <- unique(as.vector(unlist(termcollect)))
  
  DTMBIN <- DTMBIN[, allTopicModelTerms]
  DTMBINCooc <- t(DTMBIN) %*% DTMBIN
  DTMBINCooc <- t((DTMBINCooc + 1) / colSums(DTMBIN))
  DTMBINCooc <- log(DTMBINCooc)
  DTMBINCooc <- as.matrix(DTMBINCooc)
  
  coherence <- rep(0, K)
  pb <- txtProgressBar(max = K)
  for (topicIdx in 1:K) {
    setTxtProgressBar(pb, topicIdx)
    topWordsOfTopic <- topNtermsPerTopic[[topicIdx]][,1]
    
    coherence[topicIdx] <- 0
    for (m in 2:length(topWordsOfTopic)) {
      for (l in 1:(m-1)) {
        mTerm <- as.character(topWordsOfTopic[m])
        lTerm <- as.character(topWordsOfTopic[l])
        coherence[topicIdx] <- coherence[topicIdx] + DTMBINCooc[mTerm, lTerm]
      }
    }
  }
  close(pb)
  
  return(coherence)
}

The necessary Document-Term-Matrix (DTM) can be calculated with:

## DTM for coherence calculation ##
library(quanteda)
corpus <- corpus(x$lemma)
DFM <- dfm(tokens(corpus)) # edit parameters of tokens() to your needs, e. g. removal of separators and punctuation
DTM <- convert(DFM, to = "topicmodels")

where x is:

x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]

# one doc per row
x <- aggregate(x$lemma, by = list(x$doc_id), paste, collapse = " ")
names(x) <- c("doc_id", "lemma")

(the example from https://github.com/bnosac/BTM)

We provided this function in our supplementary materials (http://dx.doi.org/10.23668/psycharchives.4372; unfortunately only working with BTM prior to version 0.3.2), where we made use of @hg-wells exclusivity function as well. Thx for that!
And thx for implementing BTM in R!

jwijffels · 2020-12-11T09:08:32Z

Thanks @abitter , we should probably be including this functionalities in this R package.
Is the code you provide and which was originally by @grenwi available under the apache license similarly as this project?

abitter · 2020-12-11T12:42:54Z

We made our code available under LGPL 3.0 licence, which should be compatible with apache license 2.0.

hg-wells · 2021-01-25T16:56:21Z

Thank you! Sorry for the delay in replying.

jwijffels · 2021-01-26T08:01:14Z

@abitter The code can't be included in the package as LGPL 3.0 which is a less liberal license as the Apache License 2.0 which BTM ships. Only if your code license is changed to Apache, it can be included in the package.

abitter · 2021-01-28T10:46:46Z

@jwijffels I see – I'll check if that's possible.

abitter · 2021-02-10T14:25:44Z

@jwijffels So I checked with @grenwi, and we provide the code above (#3 (comment)) also under Apache 2.0 license.
So despite the LGPL 3.0 licensing of the script, where these lines of code were included (http://dx.doi.org/10.23668/psycharchives.4372), this part is now also licensed under Apache 2.0 to be included in your BTM package.

ginalamp · 2021-09-16T08:23:33Z

Should fit$ll for logLik be maximised or how does the likelihood look like? If I run the code multiple times and get very positive and very negative values, is the best fit the value closest to 0, or is it the most positive (or negative) value?

mevalerio · 2023-03-08T23:32:21Z

@ginalamp I noticed that the ll from logLik increases as the number of topic increases . I suppose that can be proved mathematically and also the asymptote. Therefore, I suspect that logLik cannot be used for evaluating results. Even if, in @hg-wells paper he wrote: "The maximum model likelihood or subjective choices are often the basis for model selection [88, 96, 97]. Maximum likelihood models may produce topics with comparably low interpretability [112]." I suspect I am losing something.

mevalerio · 2023-03-16T13:54:14Z

@hg-wells why in your exclusivity function is exp needed in tbeta <- t(exp(phi[[1]]))?

Thank you for the clarification and apologies if the question is obvious.

jwijffels changed the title ~~add perplexity~~ add measures of topic quality Jan 3, 2019

jwijffels mentioned this issue Feb 25, 2019

Topic Distribution in Documents using BTM. #5

Closed

jwijffels mentioned this issue Jun 26, 2019

How to determine number of topics? #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add measures of topic quality #3

add measures of topic quality #3

jwijffels commented Jan 2, 2019 •

edited

jwijffels commented Jan 3, 2019

manuelbickel commented Jan 4, 2019

jwijffels commented Jan 4, 2019

manuelbickel commented Jan 7, 2019

hg-wells commented Apr 24, 2020

jwijffels commented Apr 24, 2020

hg-wells commented Apr 24, 2020

manuelbickel commented Apr 26, 2020

hg-wells commented Apr 27, 2020 via email

hg-wells commented May 27, 2020

jwijffels commented May 27, 2020

hg-wells commented May 27, 2020 via email

abitter commented Dec 10, 2020 •

edited

jwijffels commented Dec 11, 2020

abitter commented Dec 11, 2020

hg-wells commented Jan 25, 2021

jwijffels commented Jan 26, 2021

abitter commented Jan 28, 2021

abitter commented Feb 10, 2021

ginalamp commented Sep 16, 2021 •

edited

mevalerio commented Mar 8, 2023

mevalerio commented Mar 16, 2023 •

edited

add measures of topic quality #3

add measures of topic quality #3

Comments

jwijffels commented Jan 2, 2019 • edited

jwijffels commented Jan 3, 2019

manuelbickel commented Jan 4, 2019

jwijffels commented Jan 4, 2019

manuelbickel commented Jan 7, 2019

hg-wells commented Apr 24, 2020

jwijffels commented Apr 24, 2020

hg-wells commented Apr 24, 2020

manuelbickel commented Apr 26, 2020

hg-wells commented Apr 27, 2020 via email

hg-wells commented May 27, 2020

Exclusivity

Run BTM and compute exclusivity

jwijffels commented May 27, 2020

hg-wells commented May 27, 2020 via email

abitter commented Dec 10, 2020 • edited

jwijffels commented Dec 11, 2020

abitter commented Dec 11, 2020

hg-wells commented Jan 25, 2021

jwijffels commented Jan 26, 2021

abitter commented Jan 28, 2021

abitter commented Feb 10, 2021

ginalamp commented Sep 16, 2021 • edited

mevalerio commented Mar 8, 2023

mevalerio commented Mar 16, 2023 • edited

jwijffels commented Jan 2, 2019 •

edited

abitter commented Dec 10, 2020 •

edited

ginalamp commented Sep 16, 2021 •

edited

mevalerio commented Mar 16, 2023 •

edited