#Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown

Let's explore the topics of the political blog, The Daily Kos

In [8]:
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda.testutil import toy_dataset
from microscopes.lda import model, runner
from collections import Counter

import itertools
import numpy as np
import pyLDAvis
import scipy as sp
import numpy as np
import re
import simplejson

We will visualize these topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis).  To prep the data, we'll create a function:

In [9]:
def get_vis_data(latent, num_docs, num_to_word):
    sorted_num_vocab = sorted(num_to_word.keys())
    
    topic_term_distribution = []
    for topic in latent.word_distribution(prng):
        topic_term_distribution.append([topic[word_id] for word_id in sorted_num_vocab])
        
    doc_topic_distribution = latent.document_distribution()
    
    doc_lengths = [len(doc) for doc in num_docs]
    
    vocab = [num_to_word[k] for k in sorted_num_vocab]
    assert all(map(len, vocab))
    
    ctr = Counter(list(itertools.chain.from_iterable(num_docs)))
    term_frequency = [ctr[num] for num in sorted_num_vocab]
    
    return {'topic_term_dists': topic_term_distribution, 
            'doc_topic_dists': doc_topic_distribution,
            'doc_lengths': doc_lengths,
            'vocab': vocab,
            'term_frequency': term_frequency}

Daily Kos data is stored locally.  We'll process the data into a list of lists:

In [10]:
with open("docword.kos.txt", "r") as f:
    kos_raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]

    docs = []
for _, grp in itertools.groupby(kos_raw, lambda x: x[0]):
    doc = []
    for _, word_id, word_cnt in grp:
        doc += word_cnt * [word_id - 1]
    docs.append(doc)

The data is stored as indices of words in the vocabulary.  We'll turn these indices and words into dictionaries as a reference.

In [11]:
with open("vocab.kos.txt", "r") as f:
    kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
word_to_id = {word: i for i, word in enumerate(kos_vocab)}

We must define our model before we intialize it.  In this case, we need the number of docs and the number of words. 

From there, we can initialize our model and set the hyperparameters

In [35]:
N, V = len(docs), len(id_to_word)
defn = model_definition(N, V)
prng = rng()
kos_latent = initialize(defn, docs, prng, 
                        vocab_hp=1, 
                        dish_hps={"alpha": 0.01, "gamma": 0.01})
r = runner.runner(defn, docs, kos_latent)

print "number of docs:", N, "vocabulary size:", V

number of docs: 3430 vocabulary size: 6906


Given the size of the dataset, it'll take some time to run.

We'll run our model for 1000 iterations and save our results every 10 iterations.

In [13]:
step_size = 25
steps = 40

for s in range(steps):
    r.run(prng, step_size)
    with open("daily-kos-summary-%d.json" % s, "w") as fp:
        simplejson.dump(get_vis_data(kos_latent, docs, id_to_word), fp=fp)
    print "iteration:", s * step_size, "perplexity:", kos_latent.perplexity(), "num topics:", kos_latent.ntopics()

iteration: 0 perplexity: 1613.4726769 num topics: 14
iteration: 25 perplexity: 1585.10452919 num topics: 14
iteration: 50 perplexity: 1572.77452758 num topics: 14
iteration: 75 perplexity: 1565.08170834 num topics: 15
iteration: 100 perplexity: 1559.46222771 num topics: 14
iteration: 125 perplexity: 1555.18427659 num topics: 16
iteration: 150 perplexity: 1551.79867809 num topics: 16
iteration: 175 perplexity: 1548.85912113 num topics: 16
iteration: 200 perplexity: 1547.05724419 num topics: 16
iteration: 225 perplexity: 1543.476413 num topics: 18
iteration: 250 perplexity: 1541.5465986 num topics: 17
iteration: 275 perplexity: 1540.05496137 num topics: 17
iteration: 300 perplexity: 1536.55069285 num topics: 17
iteration: 325 perplexity: 1534.88067832 num topics: 17
iteration: 350 perplexity: 1532.66098019 num topics: 17
iteration: 375 perplexity: 1531.87071545 num topics: 17
iteration: 400 perplexity: 1529.69302106 num topics: 17
iteration: 425 perplexity: 1528.07337694 num topics: 19
i

Now that we've finished inference, we will load our data and visualize our topics with [pyLDAvis](https://github.com/bmabey/pyLDAvis)

In [15]:
with open("daily-kos-summary-39.json", "r") as f:
    text = f.read()
    data = simplejson.loads(text)

prepared = pyLDAvis.prepare(**data)
pyLDAvis.display(prepared)

