#Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown

Let's explore the topics of the political blog, The Daily Kos

In [1]:
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda.testutil import toy_dataset
from microscopes.lda import model, runner
from collections import Counter

import itertools
import numpy as np
import pyLDAvis
import scipy as sp
import numpy as np
import re
import simplejson

We will visualize these topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis).  To prep the data, we'll create a function:

In [2]:
def get_vis_data(latent, num_docs, num_to_word):
    sorted_num_vocab = sorted(num_to_word.keys())
    
    topic_term_distribution = []
    for topic in latent.word_distribution(prng):
        topic_term_distribution.append([topic[word_id] for word_id in sorted_num_vocab])
        
    doc_topic_distribution = latent.document_distribution()
    
    doc_lengths = [len(doc) for doc in num_docs]
    
    vocab = [num_to_word[k] for k in sorted_num_vocab]
    assert all(map(len, vocab))
    
    ctr = Counter(list(itertools.chain.from_iterable(num_docs)))
    term_frequency = [ctr[num] for num in sorted_num_vocab]
    
    return {'topic_term_dists': topic_term_distribution, 
            'doc_topic_dists': doc_topic_distribution,
            'doc_lengths': doc_lengths,
            'vocab': vocab,
            'term_frequency': term_frequency}

Daily Kos data is stored locally.  We'll process the data into a list of lists:

In [3]:
with open("docword.kos.txt", "r") as f:
    kos_raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]

    docs = []
for _, grp in itertools.groupby(kos_raw, lambda x: x[0]):
    doc = []
    for _, word_id, word_cnt in grp:
        doc += word_cnt * [word_id - 1]
    docs.append(doc)

The data is stored as indices of words in the vocabulary.  We'll turn these indices and words into dictionaries as a reference.

In [4]:
with open("vocab.kos.txt", "r") as f:
    kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
word_to_id = {word: i for i, word in enumerate(kos_vocab)}

We must define our model before we intialize it.  In this case, we need the number of docs and the number of words. 

From there, we can initialize our model and set the hyperparameters

In [5]:
N, V = len(docs), len(id_to_word)
defn = model_definition(N, V)
prng = rng()
kos_latent = initialize(defn, docs, prng, 
                        vocab_hp=0.5, 
                        dish_hps={"alpha": 0.1, "gamma": 0.1})
r = runner.runner(defn, docs, kos_latent)

print "number of docs:", N, "vocabulary size:", V

number of docs: 3430 vocabulary size: 6906


Given the size of the dataset, it'll take some time to run.

We'll run our model for 1000 iterations and save our results every 10 iterations.

In [6]:
step_size = 10
steps = 100

for _ in range(steps):
    r.run(prng, step_size)
    with open("daily-kos-summary.json", "w") as fp:
        simplejson.dump(get_vis_data(kos_latent, docs, id_to_word), fp=fp)
    print "iteration:", _ * step_size, "perplexity:", kos_latent.perplexity(), "num topics:", kos_latent.ntopics()

iteration: 0 perplexity: 1675.55909212 num topics: 11
iteration: 10 perplexity: 1630.75548573 num topics: 11
iteration: 20 perplexity: 1619.42292248 num topics: 11
iteration: 30 perplexity: 1612.05211709 num topics: 12
iteration: 40 perplexity: 1608.09166416 num topics: 13
iteration: 50 perplexity: 1604.36821919 num topics: 12
iteration: 60 perplexity: 1601.07012597 num topics: 16
iteration: 70 perplexity: 1598.9560849 num topics: 14
iteration: 80 perplexity: 1597.29881413 num topics: 14
iteration: 90 perplexity: 1594.30666062 num topics: 15
iteration: 100 perplexity: 1592.87141848 num topics: 16
iteration: 110 perplexity: 1590.75743761 num topics: 17
iteration: 120 perplexity: 1589.87675289 num topics: 15
iteration: 130 perplexity: 1587.88430103 num topics: 15
iteration: 140 perplexity: 1586.88827287 num topics: 16
iteration: 150 perplexity: 1585.28495676 num topics: 16
iteration: 160 perplexity: 1584.17485032 num topics: 17
iteration: 170 perplexity: 1583.18147984 num topics: 17
iter

Now that we've finished inference, we will load our data and visualize our topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis)

In [7]:
with open("daily-kos-summary.json", "w") as fp:
    simplejson.dump(get_vis_data(kos_latent, docs, id_to_word), fp=fp)

data = get_vis_data(kos_latent, docs, id_to_word)
prepared = pyLDAvis.prepare(**data)
pyLDAvis.display(prepared)

