#Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown

Let's explore the topics of the political blog, The Daily Kos

In [1]:
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda.testutil import toy_dataset
from microscopes.lda import model, runner
from microscopes.lda.utils import 
from collections import Counter

import itertools
import numpy as np
import pyLDAvis
import scipy as sp
import numpy as np
import re
import simplejson

Before loading the documents, we load maps from the vocabulary words to the word IDs and vice versa.

In [21]:
with open("vocab.kos.txt", "r") as f:
    kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
word_to_id = {word: i for i, word in enumerate(kos_vocab)}

Daily Kos data is stored locally. The format is follows:

```
 The format of the docword.os.txt file is 3 header lines, followed by
NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---
```

We'll process the data into a list of lists of words (using the vocabulary maps):

In [23]:
with open("docword.kos.txt", "r") as f:
    kos_raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]

docs = []
for _, grp in itertools.groupby(kos_raw, lambda x: x[0]):
    doc = []
    for _, word_id, word_cnt in grp:
        doc += word_cnt * [id_to_word[word_id-1]]
    docs.append(doc)

We must define our model before we intialize it.  In this case, we need the number of docs and the number of words. 

From there, we can initialize our model and set the hyperparameters

In [26]:
N, V = len(docs), len(id_to_word)
defn = model_definition(N, V)
prng = rng()
kos_state = initialize(defn, docs, prng, 
                        vocab_hp=1, 
                        dish_hps={"alpha": 0.1, "gamma": 0.1})
r = runner.runner(defn, docs, kos_state)

print "number of docs:", N, "vocabulary size:", V

number of docs: 3430 vocabulary size: 6906


Given the size of the dataset, it'll take some time to run.

We'll run our model for 1000 iterations and save our results every 25 iterations.

In [32]:
step_size = 20
steps = 50

for s in range(steps):
    r.run(prng, step_size)
    with open("daily-kos-summary.json", "w") as fp:
        simplejson.dump(kos_state.pyldavis_data(), fp=fp)
    print "iteration:", s * step_size, "perplexity:", kos_state.perplexity(), "num topics:", kos_state.ntopics()

iteration: 0 perplexity: 1614.14784735 num topics: 12
iteration: 20 perplexity: 1606.02950398 num topics: 11
iteration: 40 perplexity: 1601.56676276 num topics: 12
iteration: 60 perplexity: 1599.25875685 num topics: 12
iteration: 80 perplexity: 1597.83294351 num topics: 11
iteration: 100 perplexity: 1596.15490749 num topics: 13
iteration: 120 perplexity: 1594.70170534 num topics: 12
iteration: 140 perplexity: 1594.03328402 num topics: 12
iteration: 160 perplexity: 1593.23815834 num topics: 13
iteration: 180 perplexity: 1593.03490288 num topics: 12
iteration: 200 perplexity: 1592.60149631 num topics: 12
iteration: 220 perplexity: 1591.85805902 num topics: 12
iteration: 240 perplexity: 1591.39032136 num topics: 12
iteration: 260 perplexity: 1590.17425812 num topics: 14
iteration: 280 perplexity: 1589.90027366 num topics: 12
iteration: 300 perplexity: 1589.86805157 num topics: 11
iteration: 320 perplexity: 1588.9906641 num topics: 13
iteration: 340 perplexity: 1588.63435982 num topics: 12

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a Python implementation of the [LDAvis](https://github.com/cpsievert/LDAvis) tool created by [Carson Sievert](https://github.com/cpsievert). 

> LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.



In [35]:
with open("daily-kos-summary.json", "r") as f:
    text = f.read()
    data = simplejson.loads(text)

prepared = pyLDAvis.prepare(**data)
pyLDAvis.display(prepared)