#Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown

Let's explore the topics of the political blog, The Daily Kos

In [1]:
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda.testutil import toy_dataset
from microscopes.lda import model, runner
from collections import Counter

import itertools
import numpy as np
import pyLDAvis
import scipy as sp
import numpy as np
import re
import simplejson

We will visualize these topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis).  To prep the data, we'll create a function:

In [2]:
def get_vis_data(latent, num_docs, num_to_word):
    sorted_num_vocab = sorted(num_to_word.keys())
    
    topic_term_distribution = []
    for topic in latent.word_distribution(prng):
        topic_term_distribution.append([topic[word_id] for word_id in sorted_num_vocab])
        
    doc_topic_distribution = latent.document_distribution()
    
    doc_lengths = [len(doc) for doc in num_docs]
    
    vocab = [num_to_word[k] for k in sorted_num_vocab]
    assert all(map(len, vocab))
    
    ctr = Counter(list(itertools.chain.from_iterable(num_docs)))
    term_frequency = [ctr[num] for num in sorted_num_vocab]
    
    return {'topic_term_dists': topic_term_distribution, 
            'doc_topic_dists': doc_topic_distribution,
            'doc_lengths': doc_lengths,
            'vocab': vocab,
            'term_frequency': term_frequency}

Daily Kos data is stored locally.  We'll process the data into a list of lists:

In [3]:
with open("docword.kos.txt", "r") as f:
    kos_raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]

    docs = []
for _, grp in itertools.groupby(kos_raw, lambda x: x[0]):
    doc = []
    for _, word_id, word_cnt in grp:
        doc += word_cnt * [word_id - 1]
    docs.append(doc)

The data is stored as indices of words in the vocabulary.  We'll turn these indices and words into dictionaries as a reference.

In [4]:
with open("vocab.kos.txt", "r") as f:
    kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
word_to_id = {word: i for i, word in enumerate(kos_vocab)}

We must define our model before we intialize it.  In this case, we need the number of docs and the number of words. 

From there, we can initialize our model and set the hyperparameters

In [5]:
N, V = len(docs), len(id_to_word)
defn = model_definition(N, V)
prng = rng()
kos_latent = initialize(defn, docs, prng, 
                        vocab_hp=1, 
                        dish_hps={"alpha": 0.1, "gamma": 0.1})
r = runner.runner(defn, docs, kos_latent)

print "number of docs:", N, "vocabulary size:", V

number of docs: 3430 vocabulary size: 6906


In [6]:
def check_params(alpha, gamma, eta, n = len(docs), v = len(id_to_word)):
    defn = model_definition(n, v)
    prng = rng()
    kos_latent = initialize(defn, docs, prng, 
                        vocab_hp=eta, 
                        dish_hps={"alpha": alpha, "gamma": gamma})
    r = runner.runner(defn, docs, kos_latent)
    r.run(prng, 10)
    print kos_latent.ntopics()
    return kos_latent 

Given the size of the dataset, it'll take some time to run.

We'll run our model for 1000 iterations and save our results every 25 iterations.

In [7]:
step_size = 25
steps = 40

for s in range(steps):
    r.run(prng, step_size)
    with open("daily-kos-summary.json", "w") as fp:
        simplejson.dump(get_vis_data(kos_latent, docs, id_to_word), fp=fp)
    print "iteration:", s * step_size, "perplexity:", kos_latent.perplexity(), "num topics:", kos_latent.ntopics()

iteration: 0 perplexity: 1862.04247596 num topics: 7
iteration: 25 perplexity: 1841.41178003 num topics: 7
iteration: 50 perplexity: 1822.86140826 num topics: 7
iteration: 75 perplexity: 1811.27293604 num topics: 7
iteration: 100 perplexity: 1801.32272475 num topics: 8
iteration: 125 perplexity: 1793.89339826 num topics: 7
iteration: 150 perplexity: 1784.2086994 num topics: 8
iteration: 175 perplexity: 1775.26781672 num topics: 9
iteration: 200 perplexity: 1771.0568549 num topics: 10
iteration: 225 perplexity: 1767.12303448 num topics: 9
iteration: 250 perplexity: 1765.20349732 num topics: 8
iteration: 275 perplexity: 1762.72988472 num topics: 9
iteration: 300 perplexity: 1761.0224189 num topics: 9
iteration: 325 perplexity: 1759.81190482 num topics: 10
iteration: 350 perplexity: 1757.97894703 num topics: 11
iteration: 375 perplexity: 1758.01621354 num topics: 9
iteration: 400 perplexity: 1756.7352334 num topics: 10
iteration: 425 perplexity: 1755.64540322 num topics: 9
iteration: 450 

Since inference takes a long time, we can load our trained data in this block if we want to visualize it later.  We will now load our data and visualize our topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis)

In [3]:
with open("daily-kos-summary.json", "r") as f:
    text = f.read()
    data = simplejson.loads(text)

prepared = pyLDAvis.prepare(**data)
pyLDAvis.display(prepared)