# Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence<sup>1</sup>

### Authors: Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan Boyd-Graber and Philip Resnik (University of Maryland)

Link to paper: https://openreview.net/forum?id=tjdHCnPqoo


Paper walkthrough by Armin Catovic, Sr. Data Scientist, Schibsted

<sub><sup>1</sup> Presented in a poster session at NeurIPS 2021</sub>

> ___When a measure becomes a target, it ceases to be a good measure___ _(Goodharts' Law)_

## Background

* The goal of topic models is to facilitate human understanding of a text<sup>2</sup> corpus
* While real-world users evaluate topics based on their specific needs, research/academia has gravitated towards automated proxies of human judgment, i.e. automated evaluation metrics
* __Perplexity__ based evaluations are now seldom used since they are negatively correlated with human interpretability
* The current standard for automated evaluation of topic models is using __coherence__ metrics - most popular being __Normalized Pointwise Mutual Information (NPMI)__
* Over the last few years most of the topic modelling has migrated to neural based approaches, with ever-increasing NPMI scores and state-of-the-art (SOTA) claims
* However, __very few papers have used human evaluations__ when reporting results<sup>3</sup>
* Neural models seem to manifest qualitatively distinct topics compared to classical approaches such as Latent Dirichlet Allocation (LDA) - __can a single coherence metric equally apply to both neural based as well as classical approaches?__
* There seems to be a wide __standardization gap__ when it comes to pre-processing of text, as well as in the calculation of the NPMI itself

<sub><sup>2</sup> Topic modelling has also been applied to non-textual datasets, such as images and genome sequences.</sub>

<sub><sup>3</sup> See <em>Table 6</em> in <em>Appendix A</em> of the original paper. Outside the core method-development literature, human evaluations have been used to develop new metrics and improve understanding of existing model behaviour.</sub>

## Approach

Hoyle et al make the following contributions in the paper:

* Present a meta-analysis of neural topic model evaluation and the current state of affairs
* Develop a standardized, re-producable, pre-processed versions of two widely used evaluation datasets - __Wikitext-103__ (Wikipedia corpus) and __LDC2008T19__ (New York Times corpus)
* Optimize three topic models - one classical (LDA with Gibbs sampling) and two neural (ETM and D-VAE) - using identical pre-processing, model selection criteria, and hyperparameter tuning
* Obtain human evaluations of these models using ratings and word intrusion tasks
* Provide new evaluations of the correlation between automated and human evaluations

## Motivation

* Question the validity of fully automated evaluations
* Define a standard for text pre-processing, model tuning, and model evluation

## Topic Modelling in Perspective

* Topic models, such as __Latent Dirichlet Allocation (LDA)__, are probabilistic generative models of text - they "tell a (somewhat simplified) story" of how a document came to be
* Topic models assume that each document is a sparse mixture of topics, and that each topic is a specific categorical distribution over the words/vocabulary
* When we evaluate topic models, we typically look at (1) most probable $N$ words (e.g. $N=10$) in each topic, and (2) most probable topic assignments for each document
* With the emergence of deep learning, topic modelling methods have mostly migrated towards __Neural Topic Models (NTMs)__, where instead of using sparse representations and Markov Chain Monte Carlo (MCMC) based estimators, NTMs use dense continuous word representations and gradient optimization to fit the parameters
* NTMs have gained in popularity because of ___results suggesting they produce more interpretable topics compared to "classical" methods such as LDA___

## NPMI - The Standard Topic Model Coherence Evaluation

* __Coherence__ stems from a Latin _cohaerere_, meaning _"to cling/stick (closely) together"_
* A topic is said to be coherent, if given its set of $N$ most probable terms, when viewed together, enables one to recognise the topic as an identifiable category<sup>4</sup>
* Today, the evalutation consensus is to use __pairwise normalized pointwise mutual information (NPMI)__, as defined below:

![NPMI](./figures/npmi.png "NPMI")

* NPMI scores highly if top $N$ words - summed over all pairs of $w_i$ and $w_j$ - have high joint probability $P(w_i, w_j)$ compared to their marginal probability
* The probabilities are estimated using word co-occurance counts from a ___reference corpus___ for a specific ___context window___, which can range from ten words, up to the entire document
* Therefore, the choice of a reference corpus and corresponding context window determine the NPMI score
* Furthermore, since neural word representations are intimately connected to NPMI, there is a potential for NTMs to generate topics with high NPMI scores, without (qualitatively) explaining the corpus well to the user

<sub><sup>4</sup> Doogan and Buntine state that "an interpretable topic is one that can be easily labeled".</sub>

## Human Metrics of Topic Coherence


* Hoyle et al obtain human evaluations for three different topic models
* They use __Prolific.co__ to recruit crowdworkers and collect the data via __Qualtrics__ survey platform
* The authors apply two well established human evaluation tasks - __word intrusion__ and __topic rating__
* __Intrusion__ - behavioural way to assess topic coherence
  * Each topic is represented by its top $N$ words plus one _"intruder"_ word, which has a low probability of belonging to that topic, but a high probability of belonging to a different topic
  * Topic coherence is judged by how well human annotators detect the _"intruder"_ word
* __Rating__ - human evaluators are presented with a topic and its corresponding $N$ most probable words; they have to rate the _"quality"_ of the topic using a three-point ordinal scale

## Human Evaluation - Word Intrusion

![Figure 1a](./figures/human_evaluation_intrusion.png "Figure 1a")

## Human Evaluation - Topic Rating

![Figure 1b](./figures/human_evaluation_rating.png "Figure 1b")

## A Meta-Analysis of Neural Topic Modeling

While analysing a large body of NTM literature, related to forty different models, all claiming superior topic coherence scores, the authors have found the following:

* Variance in all areas subject to authors' analysis
* Pre-processing, which can significantly affect model quality, is 30% of the time inconsistent across datasets __within the same paper__
* Some pre-processing details are often omitted making it difficult/impossible to replicate the pipeline
* Datasets used to establish relationships between human annotations and automated metrics, are different from datasets used to train the models
* 40% of papers fail to clearly specify their model tuning and selection procedure
* NPMI evaluation itself is ambiguous:
  * Reference corpus is often not specified
  * Co-occurance window size and top-$N$ most probable words per topic often not specified
  * Often not clear which __implementation__ of coherence metric is actually used

## Closing the Standardization Gap for Topic Models

* __Dataset selection__:
  * The authors choose Wikipedia (__Wikitext-103__) and NY Times (__LDC2008T19__) for both training and as a reference (evaluation) corpus
* __Pre-processing pipeline__:
  * The authors specify in great detail<sup>5</sup>, all aspects of document processing, vocabulary creation, and vocabulary filtering
    * The use of Zipf law for vocabulary pruning, i.e. by removing tokens that appear in fewer than $2(0.02|D|)^{-log10}$ documents ($|D|$ is the size of corpus), is particularly enlightening
* __Model selection__:
  * Three different models are used for comparison purposes
    * Gibbs-LDA (G-LDA) is a "classic" LDA topic model, estimated using partially collapsed Gibbs sampler implemented in a Mallet Java package
    * Dirichlet-VAE (D-VAE) is a SOTA NTM
    * Embedded Topic Model (ETM) is similar to LDA, except it uses dense word embeddings in its generative model
    * Fixed computational budget per model is maintained, and same hyperparameter selection process is applied to all three models<sup>6</sup>
    * Authors eliminate models with highly redundant topics as follows:
      * Models in which any of the top-5 words of one topic overlap with another
      * Models that have a topic uniqueness score above 0.7
* __Topic coherence metric__:
  * Authors use NPMI, estimated using the reference corpus with a 10-word window over the top-10 topic words


<sub><sup>5</sup> See Appendix A.2 Preprocessing Details.</sub>

<sub><sup>6</sup> See Appendix A.3 Training Details.</sub>

## Results - Human Judgment Differs from Automated Metrics

![Figure 2](./figures/results_automated_vs_human_evaluation.png "Figure 2")

__Note:__ Coloured circles correspond to pairwise one-tailed significance tests between model scores at $\alpha = 0.05$; for example, the right-most circle at bottom right shows that human evaluation of topic ratings for D-VAE are significantly higher than ETM for topics derived from Wikipedia.

## Results - Human Judgment Differs from Automated Metrics (2)

![Figure 3](./figures/results_human_evaluation_all_vs_filtered.png "Figure 3")

__Note:__ Mean human evaluation on the ratings and word intrusion tasks, after filtering out respondents who reported a lack of familiarity with the topic words.

## Discussion and Conclusion

* While automated evaluations (using NPMI) suggest D-VAE as a clear winner between models, human evaluation is more nuanced
  * Human judgments exhibit greater variability
  * "Good old" LDA seems to perform roughly on-par with SOTA deep learning models, i.e. D-VAE
* While human evaluators report familiarity with terms over 90% of the time for both G-LDA and ETM, D-VAE has a notably lower average term familiarity (70%)
  * This difference, as well as the validation by filtering out respondents who report lack of term familiarity, indicate that D-VAE, by-default, produces more _"esoteric"_ topics that are narrower in scope than those of other models
* Automated coherence metrics, such as NPMI, merely provide good _"guidance"_, but should not be solely used to judge topic quality, nor to make superiority claims
  * __Human evaluations should be an integral part of evaluating topic models, and possibly all unsupervised models__
* Standardizing datasets, model tuning, and pre-processing pipelines, should be considered a high priority for topic modelling researchers
  * Hoyle et al do a good job of formulating one such evaluation standard
* The findings, i.e. lack of consistency, reproducability, and the use of Goodhart's law, are in a way reflective of the __general trend in machine learning, and particularly where unsupervised learning is concerned.__

## Personal Reflections...

* We use topic models - currently LDA with Gibbs sampling, i.e. same as G-LDA specified in the paper - as part of our contextual advertising product, so this paper is highly relevant to us
* We 100% agree that text pre-processing is one of the most influential factors on model performance (and topic coherence)
  * We put constraints on minimum article/document length, we perform vocabulary pruning, and we use a custom list of stop words that was developed together with our product specialists
  * The idea from Hoyle et al. of constraining vocabulary according to power-law distribution makes a lot of sense
* We perform usability tests and topic evaluation together with our product specialists and sales reps
  * In this regard, we are also 100% in agreement with the paper
  * The use of formalized intrusion/rating tasks could be an improvement and could be applied on larger scale across Schibsted Marketing Services
* Overall a very important paper, and very happy that it got some spotlight at NeurIPS 2021, as this standardization gap in topic modelling and ML as a whole, is becoming increasingly important