#Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process

The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown

Let's explore the topics of the political blog, The Daily Kos

In [2]:
from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda.testutil import toy_dataset
from microscopes.lda import model, runner
from collections import Counter

import itertools
import numpy as np
import pyLDAvis
import scipy as sp
import numpy as np
import re
import simplejson

Before loading the documents, we load maps from the vocabulary words to the word IDs and vice versa.

In [4]:
with open("vocab.kos.txt", "r") as f:
    kos_vocab = [word.strip() for word in f.readlines()]
id_to_word = {i: word for i, word in enumerate(kos_vocab)}
word_to_id = {word: i for i, word in enumerate(kos_vocab)}

Daily Kos data is stored locally. The format is follows:

```
 The format of the docword.os.txt file is 3 header lines, followed by
NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---
```

We'll process the data into a list of lists of words (using the vocabulary maps):

In [5]:
with open("docword.kos.txt", "r") as f:
    kos_raw = [map(int, _.strip().split()) for _ in f.readlines()][3:]

docs = []
for _, grp in itertools.groupby(kos_raw, lambda x: x[0]):
    doc = []
    for _, word_id, word_cnt in grp:
        doc += word_cnt * [id_to_word[word_id-1]]
    docs.append(doc)

We must define our model before we intialize it.  In this case, we need the number of docs and the number of words. 

From there, we can initialize our model and set the hyperparameters

In [6]:
N, V = len(docs), len(id_to_word)
defn = model_definition(N, V)
prng = rng()
kos_state = initialize(defn, docs, prng, 
                        vocab_hp=1, 
                        dish_hps={"alpha": 0.1, "gamma": 0.1})
r = runner.runner(defn, docs, kos_state)

print "number of docs:", N, "vocabulary size:", V

number of docs: 3430 vocabulary size: 6906


Given the size of the dataset, it'll take some time to run.

We'll run our model for 1000 iterations and save our results every 25 iterations.

In [9]:
step_size = 10
steps = 5

for s in range(steps):
    r.run(prng, step_size)
    with open("daily-kos-summary.json", "w") as fp:
        simplejson.dump(kos_state.pyldavis_data(), fp=fp)
    print "iteration:", s * step_size, "perplexity:", kos_state.perplexity(), "num topics:", kos_state.ntopics()

iteration: 0 perplexity: 1623.21803167 num topics: 10
iteration: 10 perplexity: 1621.99408015 num topics: 10
iteration: 20 perplexity: 1622.10325697 num topics: 10
iteration: 30 perplexity: 1620.45031166 num topics: 11
iteration: 40 perplexity: 1620.42437657 num topics: 10


[pyLDAvis](https://github.com/bmabey/pyLDAvis) is a Python implementation of the [LDAvis](https://github.com/cpsievert/LDAvis) tool created by [Carson Sievert](https://github.com/cpsievert). 

> LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.



In [49]:
prepared = pyLDAvis.prepare(**kos_state.pyldavis_data())
pyLDAvis.display(prepared)

## Other Functionality

### Model Serialization

LDA `state` objects are fully serializable with Pickle and cPickle. 

In [38]:
import pickle
new_state = pickle.loads(pickle.dumps(kos_state))

In [39]:
kos_state.assignments() == new_state.assignments()

True

In [40]:
kos_state.dish_assignments() == new_state.dish_assignments()

True

In [41]:
kos_state.table_assignments() == new_state.table_assignments()

True

### Term Relevance

We can generate term relevances (as defined by [Sievert and Shirley 2014](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf)) for each topic. 

In [53]:
relevance = kos_state.term_relevance_by_topic()

Here are the ten most relevant words for each topic:

In [66]:
for topics in relevance:
    words = [word for word, _ in topics[:10]]
    print ' '.join(words)

november account electoral governor sunzoo contact password meter altsite republicansforkerry
tax jobs billion cuts economic health deficit budget taxes energy
carson coburn bunning oklahoma knowles murkowski thune herseth mongiardo diedrich
bush administration president cheney intelligence commission iraq white weapons rice
kerry dean edwards clark poll bush percent iowa gephardt primary
party media people convention conservative political bloggers ballot time internet
iraq war iraqi military troops soldiers baghdad forces iraqis killed
guard records boat swift service vietnam veterans duty medals awol
district republican elections house delay senate race money seat democrats
scientists science environmental emissions mcdonalds species epa cell cells researchers


### Topic Prediction 

We can also predict how the topics with be distributed within an arbitrary document.

Let's create a document from the 100 most relevant words in the 7th topic.

In [78]:
doc = [word for word, _ in relevance[6][:100]]
print ' '.join(doc)

iraq war iraqi military troops soldiers baghdad forces iraqis killed american abu fallujah ghraib saddam army occupation pentagon prisoners rumsfeld insurgents officials torture coalition shiite prison government najaf afghanistan violence wounded country attacks dead insurgency resistance casualties sunni abuses security sadr minister international invasion abuse united marines police killing americans people foreign combat detainees men deaths civilian chalabi arab civilians shiites iraqs juan enemy muslim official authority fighting wars operations women council cpa saddams terrorists died photos power gen hussein usled bremer armed british command countries rumsfelds rights human contractors cole mosque sovereignty islamic guantanamo marine death explosives sistani falluja


In [79]:
kos_state.predict(doc, r)

[[0.01595686398987793,
  0.05007820755281674,
  0.04411037192755322,
  0.10491765533286677,
  0.020317818565352107,
  0.040100600665650354,
  0.5716339607196672,
  0.06249246472382885,
  0.02219331151872114,
  0.06819874500366653]]

The prediction is that this document is mostly generated by topic 7.

Similarly, if we create a document from words from the 1st and 7th topic, our prediction is that the document is generated mostly by those topics.

In [86]:
doc = [word for word, _ in relevance[0][:100]] + [word for word, _ in relevance[6][:100]]

kos_state.predict(doc, r)

[[0.4097113742293531,
  0.03569778707009671,
  0.03474053139849126,
  0.06000183895558933,
  0.025217125340158176,
  0.031218204733139295,
  0.29237424043028964,
  0.04032224003163025,
  0.027324349133695868,
  0.043392308677557355]]

### Topic and Term Distributions

Of course, we can also get the topic distribution for each document (commonly called $\Theta$).

In [93]:
kos_state.topic_distribution_by_document()[0]

[5.3154028941888714e-05,
 6.17083900427994e-05,
 4.788980476703211e-05,
 0.6273726486533122,
 0.1897739815883383,
 0.00011859123428585457,
 7.399157372083377e-05,
 4.0066586417904096e-05,
 0.1824521921178911,
 5.7760222821151605e-06]

We can also get the raw word distribution for each topic (commonly called $\Phi$). This is related to the _word relevance_.

In [94]:
kos_state.word_distribution_by_topic()[0]

{'foul': 1.4834374269412365e-05,
 'prices': 1.4834374269412365e-05,
 'hanging': 1.4834374269412365e-05,
 'eligible': 1.4834374269412365e-05,
 'electricity': 1.4834374269412365e-05,
 'lord': 1.4834374269412365e-05,
 'regional': 5.933749707764946e-05,
 'oceana': 1.4834374269412365e-05,
 'broward': 1.4834374269412365e-05,
 'bringing': 7.417186861857772e-05,
 'emilys': 2.966874853882473e-05,
 'prize': 0.00010384061897639185,
 'wednesday': 0.00010384061897639185,
 'commented': 1.4834374269412365e-05,
 'guardsmen': 1.4834374269412365e-05,
 'tired': 4.450312189874239e-05,
 'miller': 4.450312189874239e-05,
 'budget': 2.966874853882473e-05,
 'rusty': 4.450312189874239e-05,
 'saddams': 1.4834374269412365e-05,
 'errors': 2.966874853882473e-05,
 'contributed': 1.4834374269412365e-05,
 'fingers': 2.966874853882473e-05,
 'archpundit': 2.966874853882473e-05,
 'bowerss': 2.966874853882473e-05,
 'increasing': 1.4834374269412365e-05,
 'specialist': 1.4834374269412365e-05,
 'hero': 5.933749707764946e-05,