# Topic Model Coherence

```
Author:
    Zach Wolpe
    zachcolinwolpe@gmail.com
    zachwolpe.com
```

_Topic Coherence_ is a relative metric that provides a metric for determining:
    - Performance descrepancies between topic models
    - Finding the optimal number of topics 
    
    
    
### Coherence Calculation

 $$\text{Coherence} = \sum_{i<j} score(w_i, w_j)$$

Where words $W={w_1, w_2, ..., w_n}$ are ordered from most to least frequently appearing. The two leading coherence algorithms (UMass and UCI) essentially measure the same thing \cite{newman2010automatic} and as such I chosen to focus on UMass. The UMass \textit{scores} between $\{w_i, w_j\}$ combinations (which are summed subsequent to calculation) are computed as:

$$score_{UMass}^{k}(w_i, w_j | K) = \log \frac{D(w_i, w_j)+ \epsilon}{D(w_i)}$$

Where $K_i$ is $i^{th}$ topic returned by the model and $w_i$ is more common than $w_j$. $D(w_i)$ is the probability of a word $w_i$ is in a document (the number of times $w_i$ appears in a document divided by total documents). $D(w_i, w_j)$ is the conditional probability that $w_j$ will occur in a document, given $w_i$ is in the document - which eludes to some sort of dependency between key words within a topic \cite{mimno2011optimizing}. $\epsilon$ simply provides a smoothing parameter, which is often simply set to $1$ to avoid taking $\log(0)$ in the case where the conditional probability is zero. 

One concern arises, as it is overtly clear that the number of $w_i, w_j$ combinations balloons to absurd quantities for even even relatively small corpus documents with few words in the aggregated vocabulary. 


As such, it is adequate to simply compute coherence for word pairs $\{(w_1,w_2),(w_2,w_3), ..., (w_{n-1},w_n)\}$ \cite{mimno2011optimizing}.

To improve the reliability of computed coherence, we can factor in the sampling distribution of documents by learning the spread of topic coherence for a given dataset. This is achieved by: randomizing the train-test split; computing LDA parameters; calculating coherence on the returned topics; repeating many times. 



### Model Parameters
Specify the number of topics to learn as well as the number of words to keep in the model vocabulary.

In [1]:
# Specify parameters

# number of topics
n_topics = 10

# size of vocab
n_words = 1000

In [2]:
%matplotlib inline
import sys, os
%env THEANO_FLAGS=device=cpu,floatX=float64
import theano

from collections import OrderedDict
from copy import deepcopy
import numpy as np
import pandas as pd
from time import time
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt
import seaborn as sns
from theano import shared
import theano.tensor as tt
from theano.sandbox.rng_mrg import MRG_RandomStreams

import pymc3 as pm
from pymc3 import math as pmmath
from pymc3 import Dirichlet
from pymc3.distributions.transforms import t_stick_breaking
plt.style.use('seaborn-darkgrid')

env: THEANO_FLAGS=device=cpu,floatX=float64


## Dataset

In [3]:
# The number of words in the vocabulary
n_words = n_words

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_words,
                                stop_words='english')

t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
feature_names = tf_vectorizer.get_feature_names()
print("done in %0.3fs." % (time() - t0))

Loading dataset...
done in 2.191s.
Extracting tf features for LDA...
done in 2.999s.


## Train-Test Split

In [4]:
n_samples_tr = round(tf.shape[0] * 0.7) # testing on 70%
n_samples_te = tf.shape[0] - n_samples_tr
docs_tr = tf[:n_samples_tr, :]
docs_te = tf[n_samples_tr:, :]
print('Number of docs for training = {}'.format(docs_tr.shape[0]))
print('Number of docs for testing = {}'.format(docs_te.shape[0]))


n_tokens = np.sum(docs_tr[docs_tr.nonzero()])
print('Number of tokens in training set = {}'.format(n_tokens))
print('Sparsity = {}'.format(
    len(docs_tr.nonzero()[0]) / float(docs_tr.shape[0] * docs_tr.shape[1])))

Number of docs for training = 7920
Number of docs for testing = 3394
Number of tokens in training set = 384502
Sparsity = 0.0255030303030303


# Randomize Test-Train Split

To increase random variability when repeatedly computing performance metrics, randomizing the train-test split can aid in assessing the spread (random variability) of performance metrics.

As such I have randomized Train-Test-Split processes, to be called before each coherence calculation.

In [5]:
def Random_Train_Test_Split(tf, p=0.7):
    """
    Return:
        Randomly train-test split scarce scipy matrix
    
    Parameters:
        tf: clean dataset, type = scipy.sparse.csr.csr_matrix
        p: percentage to assign to training data
    """
    
    # Shuffle
    from sklearn.utils import shuffle
    tf = shuffle(tf)
    
    n_samples_tr = round(tf.shape[0] * p) 
    n_samples_te = tf.shape[0] - n_samples_tr
    
    docs_tr = tf[:n_samples_tr, :]
    docs_te = tf[n_samples_tr:, :]

    n_tokens = np.sum(docs_tr[docs_tr.nonzero()])
    Sparsity = len(docs_tr.nonzero()[0]) / float(docs_tr.shape[0] * docs_tr.shape[1])

    
    results = {
        'n_samples_tr': n_samples_tr,
        'n_samples_te': n_samples_te,
        'docs_tr': docs_tr,
        'docs_te': docs_te,
        'n_tokens': n_tokens,
        'Sparsity': Sparsity
    }
    
    return results

In [6]:
# duplicate matrix
test_tf = tf

# Model Coherence Comparison

In [5]:
from LearnParameters import LearnParameters 


# results_pymc3 = LearnParameters.train_pymc3(docs_te, docs_tr, n_samples_te, n_samples_tr, n_words, n_topics, n_tokens)

# results_Sklearn = LearnParameters.train_sklearn(docs_te, docs_tr, n_topics)

  self.shared = theano.shared(data[in_memory_slc])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
Average Loss = 2.3883e+06: 100%|██████████| 10000/10000 [03:45<00:00, 44.67it/s]
Finished [100%]: Average Loss = 2.3899e+06


Predictive log prob (pm3) = -6.10896207824732


# Coherence

For each number of topics, run the coherence algorithm 10 times. To further increase the spread of plausible values, 'reshuffle' the training/testing documents.

Thereafter a sampling distribution can be learnt about the characteristics of model coherence.


The below algorthm:
    - Randomly assigns a train-test-split
    - Learns model parameters
    - Computes topic model coherence
    - Returns a list of coherence scores for both Pymc3 & Sklearn
   

In [7]:
from time import time
from LearnParameters import LearnParameters 
from coherence_umass import coherence_umass
# coherence_pymc3 = coherence_umass.coherence(data_samples, results_pymc3['beta'], feature_names)


def coherence_list(tf, corpus, feature_names, n_topics, iters=10, n_words=1000, n_common_words=10, epsilon=1):
    """
    Return:
        A map of lists of coherence scores (for various train-test-splits & model parameter approximation)
    
    Parameters:
        REQUIRED:
            tf: unsplit, processed, term-freq matrix
            corpus: unprocessed corpus of data
            feature_names: list of most common n words (processed)
            n_topics: number of topics to learn in the model
        
        OPTIONAL:
            iters: number of times to iterate
            n_words: no. of words to keep in the model vocabulary
            n_common_words: number of 'most common words' to use to compute coherence,
            epsilon: smoothing parameter for the conherence joint probability computation
    
    """
    
    coherence_py = []
    coherence_sk = []
    t0 = time()
        
    for i in range(iters):
        print('\n Iteration: ', i)
        
        random_TTS = Random_Train_Test_Split(tf)
        n_samples_tr = random_TTS['n_samples_tr']
        n_samples_te = random_TTS['n_samples_te']
        docs_te = random_TTS['docs_te']
        docs_tr = random_TTS['docs_tr']
        n_tokens = random_TTS['n_tokens']

        # PyMC3
        results_pymc3 = LearnParameters.train_pymc3(docs_te, docs_tr, n_samples_te, n_samples_tr, n_words, n_topics, n_tokens)
        coherence_pymc3 = coherence_umass.coherence(data_samples, results_pymc3['beta'], feature_names, n_common_words=n_common_words, epsilon=epsilon)
        coherence_py.append(coherence_pymc3)

        # Sklearn
        results_Sklearn = LearnParameters.train_sklearn(docs_te, docs_tr, n_topics)
        coherence_Sklearn = coherence_umass.coherence(data_samples, results_Sklearn['beta'], feature_names, n_common_words=n_common_words, epsilon=epsilon)
        coherence_sk.append(coherence_Sklearn)
    
    timer = time() - t0
    
    results = {
        'pymc3 coherence': coherence_py,
        'sklearn coherence': coherence_sk,
        'runtime': timer
    }
    
    return results
    

# 10 Topics

Perform the full calculation (from split to compute coherence) for a model that learns $10$ topics.

Save the results as a _.pkl_ (pickle) file.

In [12]:
coherence_10Topics = coherence_list(tf, data_samples, feature_names, n_topics)


 Iteration:  0


  self.shared = theano.shared(data[in_memory_slc])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
Average Loss = 2.4493e+06: 100%|██████████| 10000/10000 [03:55<00:00, 42.42it/s]
Finished [100%]: Average Loss = 2.4521e+06


Predictive log prob (pm3) = -5.966144554906394

 Iteration:  1


  self.shared = theano.shared(data[in_memory_slc])
  rval = inputs[0].__getitem__(inputs[1:])
Average Loss = 2.3275e+06: 100%|██████████| 10000/10000 [03:37<00:00, 46.06it/s]
Finished [100%]: Average Loss = 2.3283e+06


Predictive log prob (pm3) = -6.211724942216167

 Iteration:  2


Average Loss = 2.2927e+06: 100%|██████████| 10000/10000 [03:32<00:00, 50.18it/s]
Finished [100%]: Average Loss = 2.2916e+06


Predictive log prob (pm3) = -6.1850092251448

 Iteration:  3


Average Loss = 2.4134e+06: 100%|██████████| 10000/10000 [03:48<00:00, 43.72it/s]
Finished [100%]: Average Loss = 2.4146e+06


Predictive log prob (pm3) = -6.295613775368426

 Iteration:  4


Average Loss = 2.3276e+06: 100%|██████████| 10000/10000 [03:28<00:00, 47.93it/s]
Finished [100%]: Average Loss = 2.3284e+06


Predictive log prob (pm3) = -5.9001408152260035

 Iteration:  5


Average Loss = 2.2671e+06: 100%|██████████| 10000/10000 [03:29<00:00, 47.64it/s]
Finished [100%]: Average Loss = 2.2671e+06


Predictive log prob (pm3) = -6.23887260630289

 Iteration:  6


Average Loss = 2.445e+06: 100%|██████████| 10000/10000 [03:54<00:00, 42.60it/s]
Finished [100%]: Average Loss = 2.4456e+06


Predictive log prob (pm3) = -6.293714509667282

 Iteration:  7


Average Loss = 2.4209e+06: 100%|██████████| 10000/10000 [03:27<00:00, 47.97it/s]
Finished [100%]: Average Loss = 2.4211e+06


Predictive log prob (pm3) = -6.191410225644912

 Iteration:  8


Average Loss = 2.4431e+06: 100%|██████████| 10000/10000 [03:38<00:00, 47.39it/s]
Finished [100%]: Average Loss = 2.4402e+06


Predictive log prob (pm3) = -6.127480592896584

 Iteration:  9


Average Loss = 2.4462e+06: 100%|██████████| 10000/10000 [03:38<00:00, 45.76it/s]
Finished [100%]: Average Loss = 2.4487e+06


Predictive log prob (pm3) = -6.372324465737378


In [13]:
# Save times dict
import pickle
pickle_out = open('coherence_10Topics.pkl', 'wb')
pickle.dump(coherence_10Topics, pickle_out)
pickle_out.close()

In [14]:
coherence_10Topics.keys()

dict_keys(['pymc3 coherence', 'sklearn coherence', 'runtime'])

In [17]:
coherence_10Topics['pymc3 coherence']
coherence_10Topics['runtime']

25862.239357948303

In [33]:
coherence_10Topics['runtime'] / 60 / 60 # hours

7.183955377207862

# 20 Topics

Perform the full calculation (from split to compute coherence) for a model that learns $20$ topics.

Save the results as a _.pkl_ (pickle) file.

In [8]:
coherence_20Topics = coherence_list(tf, data_samples, feature_names, n_topics=20)


 Iteration:  0


  self.shared = theano.shared(data[in_memory_slc])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
  rval = inputs[0].__getitem__(inputs[1:])
Average Loss = 2.4369e+06: 100%|██████████| 10000/10000 [05:31<00:00, 34.56it/s]
Finished [100%]: Average Loss = 2.4352e+06


Predictive log prob (pm3) = -6.269980636601577

 Iteration:  1


  self.shared = theano.shared(data[in_memory_slc])
  rval = inputs[0].__getitem__(inputs[1:])
Average Loss = 2.496e+06: 100%|██████████| 10000/10000 [05:34<00:00, 31.22it/s]
Finished [100%]: Average Loss = 2.4968e+06


Predictive log prob (pm3) = -6.293237777205665

 Iteration:  2


Average Loss = 2.4688e+06: 100%|██████████| 10000/10000 [05:14<00:00, 31.79it/s]
Finished [100%]: Average Loss = 2.4686e+06


Predictive log prob (pm3) = -6.046163368039833

 Iteration:  3


Average Loss = 2.4799e+06: 100%|██████████| 10000/10000 [06:14<00:00, 31.73it/s]
Finished [100%]: Average Loss = 2.4792e+06


Predictive log prob (pm3) = -6.0391100107719335

 Iteration:  4


Average Loss = 2.413e+06: 100%|██████████| 10000/10000 [05:44<00:00, 29.02it/s]
Finished [100%]: Average Loss = 2.4129e+06


Predictive log prob (pm3) = -6.0499330885094

 Iteration:  5


Average Loss = 2.4696e+06: 100%|██████████| 10000/10000 [05:44<00:00, 29.06it/s]
Finished [100%]: Average Loss = 2.4715e+06


Predictive log prob (pm3) = -6.015767036916389

 Iteration:  6


Average Loss = 2.4727e+06: 100%|██████████| 10000/10000 [05:20<00:00, 31.20it/s]
Finished [100%]: Average Loss = 2.472e+06


Predictive log prob (pm3) = -6.076921527381877

 Iteration:  7


Average Loss = 2.4952e+06: 100%|██████████| 10000/10000 [05:50<00:00, 28.50it/s]
Finished [100%]: Average Loss = 2.4968e+06


Predictive log prob (pm3) = -6.524978126225958

 Iteration:  8


Average Loss = 2.4118e+06: 100%|██████████| 10000/10000 [05:26<00:00, 30.62it/s]
Finished [100%]: Average Loss = 2.4119e+06


Predictive log prob (pm3) = -5.904192533848899

 Iteration:  9


Average Loss = 2.5094e+06: 100%|██████████| 10000/10000 [05:28<00:00, 30.47it/s]
Finished [100%]: Average Loss = 2.5093e+06


Predictive log prob (pm3) = -6.069923641639302


In [9]:
# Save times dict
import pickle
pickle_out = open('coherence_20Topics.pkl', 'wb')
pickle.dump(coherence_20Topics, pickle_out)
pickle_out.close()

# 15 Topics

Perform the full calculation (from split to compute coherence) for a model that learns $15$ topics.

Save the results as a _.pkl_ (pickle) file.

In [12]:
coherence_15Topics = coherence_list(tf, data_samples, feature_names, n_topics=15)


 Iteration:  0


Average Loss = 2.423e+06: 100%|██████████| 10000/10000 [9:48:58<00:00,  3.53s/it]       
Finished [100%]: Average Loss = 2.4233e+06


Predictive log prob (pm3) = -6.218041493394998

 Iteration:  1


Average Loss = 2.328e+06: 100%|██████████| 10000/10000 [04:25<00:00, 37.72it/s]
Finished [100%]: Average Loss = 2.3265e+06


Predictive log prob (pm3) = -6.070801035059394

 Iteration:  2


Average Loss = 2.3713e+06: 100%|██████████| 10000/10000 [04:37<00:00, 36.00it/s]
Finished [100%]: Average Loss = 2.3719e+06


Predictive log prob (pm3) = -6.530034771986264

 Iteration:  3


Average Loss = 2.4807e+06: 100%|██████████| 10000/10000 [04:28<00:00, 37.22it/s]
Finished [100%]: Average Loss = 2.4772e+06


Predictive log prob (pm3) = -5.990515072589926

 Iteration:  4


Average Loss = 2.4457e+06: 100%|██████████| 10000/10000 [04:38<00:00, 35.86it/s]
Finished [100%]: Average Loss = 2.4433e+06


Predictive log prob (pm3) = -6.1529731952533995

 Iteration:  5


Average Loss = 2.3339e+06: 100%|██████████| 10000/10000 [04:53<00:00, 34.11it/s]
Finished [100%]: Average Loss = 2.3337e+06


Predictive log prob (pm3) = -6.172745250049275

 Iteration:  6


Average Loss = 2.5308e+06: 100%|██████████| 10000/10000 [15:57<00:00, 10.44it/s]
Finished [100%]: Average Loss = 2.5307e+06


Predictive log prob (pm3) = -6.100908215720134

 Iteration:  7


Average Loss = 2.4651e+06: 100%|██████████| 10000/10000 [05:16<00:00, 31.58it/s]
Finished [100%]: Average Loss = 2.4634e+06


Predictive log prob (pm3) = -6.176964461835821

 Iteration:  8


Average Loss = 2.4209e+06: 100%|██████████| 10000/10000 [04:27<00:00, 37.41it/s]
Finished [100%]: Average Loss = 2.4211e+06


Predictive log prob (pm3) = -6.176914747822377

 Iteration:  9


Average Loss = 2.4454e+06: 100%|██████████| 10000/10000 [04:26<00:00, 37.47it/s]
Finished [100%]: Average Loss = 2.448e+06


Predictive log prob (pm3) = -6.110171294190219


In [13]:
# Save times dict
import pickle
pickle_out = open('coherence_15Topics.pkl', 'wb')
pickle.dump(coherence_15Topics, pickle_out)
pickle_out.close()