This simple example loads the newsgroups data from sklearn
and train an LDA-like model on it.

Author: Chris Moody <chrisemoody@gmail.com>

License: MIT

### Setup

In [6]:
from lda2vec import preprocess, LDA2Vec, Corpus
from sklearn.datasets import fetch_20newsgroups
from chainer import serializers
from chainer import cuda
import numpy as np
import os.path
import logging


# Optional: moving the model to the GPU makes it ~10x faster
# set to False if you're having problems with Chainer and CUDA
gpu = cuda.available

logging.basicConfig()

### Preprocess text

Preprocessing the text transforms words into integer indices in order of decreasing frequency, with rare words filtered out and frequent words subsampled.

In [2]:
# Fetch data
texts = fetch_20newsgroups(subset='train').data
# Convert to unicode (spaCy only works with unicode)
texts = [unicode(d) for d in texts]

# Preprocess data
max_length = 10000   # Limit of 1k words per document
tokens, vocab = preprocess.tokenize(texts, max_length, tag=False,
                                    parse=False, entity=False)
corpus = Corpus()
# Make a ranked list of rare vs frequent words
corpus.update_word_count(tokens)
corpus.finalize()
# The tokenization uses spaCy indices, and so may have gaps
# between indices for words that aren't present in our dataset.
# This builds a new compact index
compact = corpus.to_compact(tokens)
# Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=50)
# Words tend to have power law frequency, so selectively
# downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
# Now flatten a 2D array of document per row and word position
# per column to a 1D array of words. This will also remove skips
# and OoV words
doc_ids = np.arange(pruned.shape[0])
flattened, (doc_ids,) = corpus.compact_to_flat(pruned, doc_ids)

### Model parameters

Various hyperparameters of the lda2vec algorithm

In [None]:
# Model Parameters
# Number of documents
n_docs = len(texts)
# Number of unique words in the vocabulary
n_words = flattened.max() + 1
# Number of dimensions in a single word vector
n_hidden = 128
# Number of topics to fit
n_topics = 20
# Get the count for each key
counts = corpus.keys_counts[:n_words]
# Get the string representation for every compact key
words = corpus.word_list(vocab)[:n_words]

### Fit the model

In [None]:
# Fit the model
model = LDA2Vec(n_words, n_hidden, counts, dropout_ratio=0.2)
model.add_categorical_feature(n_docs, n_topics, name='document_id')
model.finalize()
if os.path.exists('model.hdf5'):
    serializers.load_hdf5('model.hdf5', model)
for _ in range(200):
    model.top_words_per_topic('document_id', words)
    if gpu:
        model.to_gpu()
    model.fit(flattened, categorical_features=[doc_ids], fraction=1e-3,
              epochs=1)
    model.to_cpu()
serializers.save_hdf5('model.hdf5', model)
model.top_words_per_topic('document_id', words)

### Prepare topics

In [None]:
# Visualize the model -- look at lda.ipynb to see the results
model.to_cpu()
topics = model.prepare_topics('document_id', words)
np.savez('topics.pyldavis', **topics)

### Reading in the saved model topics

After runnning the `lda.py` script in `lda2vec`'s `examples/twenty_newsgroups` directory a `topics.pyldavis.npz` will be created that contains the topic-to-word probabilities and frequencies. What's left is to visualize and label each topic from the it's prevalent words.

In [1]:
# You must be using a very recent version of pyLDAvis to use the lda2vec outputs. 
# As of this writing, anything past Jan 6 2016 or this commit 14e7b5f60d8360eb84969ff08a1b77b365a5878e should work.
# You can do this quickly by installing it directly from master like so:
# pip install git+https://github.com/bmabey/pyLDAvis.git@master#egg=pyLDAvis
import numpy as np
import pyLDAvis
pyLDAvis.enable_notebook()

In [8]:
# The topics.pyldavis.npz file is created by lda.py, but also ships with lda2vec
npz = np.load(open('topics.pyldavis.npz', 'r'))
dat = {k: v for (k, v) in npz.iteritems()}
dat['vocab'] = dat['vocab'].tolist()

In [9]:
dat['doc_lengths']

array([157, 147, 413, ..., 167, 256,  98], dtype=int32)

### Top words in every topic

In [10]:
top_n = 10
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'topic %i '  % j
    msg += ' '.join([dat['vocab'][i].strip()[:35] for i in top])
    print msg

topic 0 out_of_vocabulary hicnet x/  oname hiv  pts lds eof_not_ok
topic 1 out_of_vocabulary hiv vitamin infections candida foods infection dyer diet patients
topic 2 out_of_vocabulary duo adb c650 centris lciii motherboard fpu vram simm
topic 3 yeast candida judas infections  vitamin foods scholars greek tyre
topic 4 jupiter lebanese lebanon karabakh israeli israelis comet roby hezbollah hernlem
topic 5 xfree86 printer speedstar font jpeg imake deskjet pov fonts borland
topic 6 nubus 040 scsi-1 scsi-2 pds israelis 68040 lebanese powerpc livesey
topic 7 colormap cursor xterm handler pixmap gcc xlib openwindows font expose
topic 8 out_of_vocabulary circuits magellan voltage outlet circuit grounding algorithm algorithms polygon
topic 9 amp alomar scsi-1 scsi-2 68040  mhz connectors hz wiring
topic 10 astronomical  astronomy telescope larson jpl satellites aerospace visualization redesign
topic 11 homicides homicide handgun ># firearms cramer guns minorities gun rushdie
topic 12 out_of_vo

### Visualizing top words per topic

In [11]:
# Unfortunately for me, pyLDAvis spews out numpy deprecation errors
import warnings
warnings.filterwarnings('ignore')
prepared_data = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], 
                                 dat['doc_lengths'], dat['vocab'], dat['term_frequency'], mds='tsne')

In the visualization below the objective is for a human to label each topic given the top words. 

A few selections:
- Topic 6, for example, has lots of computer visual references with words like 'gui', 'fonts' and 'jpeg'
- Topic 8 is about medicine with talk about 'patients', 'yeast', 'vitamin' and 'infection'
- Topic 17 is about politics with the highly relevants words being 'stephanopoulos', 'secretary', 'senator', 'serbs' and 'azerbaijani'
- Topic 18 is about space with 'astronomical', 'satellites', and 'shuttle'

Unfortunately, pyLDAvis shuffles the topic order from what we had before :(

In [12]:
pyLDAvis.display(prepared_data)

Note that the interactive visualization above doesn't show up in GitHub's rendering of ipynb files. If you're seeing this from Github, try clicking the link below:

http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda.ipynb

### 'True' topics

The 20 newsgroups dataset is interesting because users effetively classify the topics by posting to a particular newsgroup. This lets us qualitatively check our unsupervised topics with the 'true' labels. For example, the four topics we highlighted above are intuitively close to `comp.graphics`, `sci.med`, `talk.politics.misc`, and `sci.space`.

    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x	
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey	
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    misc.forsale	
    talk.politics.misc
    talk.politics.guns
    talk.politics.mideast	
    talk.religion.misc
    alt.atheism
    soc.religion.christian