### Reading in the saved model topics

After runnning the `lda.py` script in `lda2vec`'s `examples/twenty_newsgroups` directory a `topics.pyldavis.npz` will be created that contains the topic-to-word probabilities and frequencies. What's left is to visualize and label each topic from the it's prevalent words.

In [10]:
# You must be using a very recent version of pyLDAvis to use the lda2vec outputs. 
# As of this writing, anything past Jan 6 2016 or this commit 14e7b5f60d8360eb84969ff08a1b77b365a5878e should work.
# You can do this quickly by installing it directly from master like so:
# pip install https://github.com/bmabey/pyLDAvis.git
import numpy as np
import pyLDAvis
pyLDAvis.enable_notebook()

In [3]:
# The topics.pyldavis.npz file is created by lda.py, but also ships with lda2vec
npz = np.load(open('topics.pyldavis.npz', 'r'))
dat = {k: v for (k, v) in npz.iteritems()}
dat['vocab'] = dat['vocab'].tolist()

### Top words in every topic

In [9]:
top_n = 10
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'topic %i '  % j
    msg += ' '.join([dat['vocab'][i].strip()[:35] for i in top])
    print msg

topic 0 out_of_vocabulary hicnet x/  oname hiv  pts lds eof_not_ok
topic 1 out_of_vocabulary hiv vitamin infections candida foods infection dyer diet patients
topic 2 out_of_vocabulary duo adb c650 centris lciii motherboard fpu vram simm
topic 3 yeast candida judas infections  vitamin foods scholars greek tyre
topic 4 jupiter lebanese lebanon karabakh israeli israelis comet roby hezbollah hernlem
topic 5 xfree86 printer speedstar font jpeg imake deskjet pov fonts borland
topic 6 nubus 040 scsi-1 scsi-2 pds israelis 68040 lebanese powerpc livesey
topic 7 colormap cursor xterm handler pixmap gcc xlib openwindows font expose
topic 8 out_of_vocabulary circuits magellan voltage outlet circuit grounding algorithm algorithms polygon
topic 9 amp alomar scsi-1 scsi-2 68040  mhz connectors hz wiring
topic 10 astronomical  astronomy telescope larson jpl satellites aerospace visualization redesign
topic 11 homicides homicide handgun ># firearms cramer guns minorities gun rushdie
topic 12 out_of_vo

### Visualizing top words per topic

In [8]:
# Unfortunately for me, pyLDAvis spews out numpy deprecation errors
import warnings
warnings.filterwarnings('ignore')
prepared_data = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], 
                                 dat['doc_lengths'], dat['vocab'], dat['term_frequency'], mds='tsne')

In the visualization below the objective is for a human to label each topic given the top words. 

A few selections:
- Topic 6, for example, has lots of computer visual references with words like 'gui', 'fonts' and 'jpeg'
- Topic 8 is about medicine with talk about 'patients', 'yeast', 'vitamin' and 'infection'
- Topic 17 is about politics with the highly relevants words being 'stephanopoulos', 'secretary', 'senator', 'serbs' and 'azerbaijani'
- Topic 18 is about space with 'astronomical', 'satellites', and 'shuttle'

Unfortunately, pyLDAvis shuffles the topic order from what we had before :(

In [9]:
pyLDAvis.display(prepared_data)

### 'True' topics

The 20 newsgroups dataset is interesting because users effetively classify the topics by posting to a particular newsgroup. This lets us qualitatively check our unsupervised topics with the 'true' labels. For example, the four topics we highlighted above are intuitively close to `comp.graphics`, `sci.med`, `talk.politics.misc`, and `sci.space`.

    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x	
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey	
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    misc.forsale	
    talk.politics.misc
    talk.politics.guns
    talk.politics.mideast	
    talk.religion.misc
    alt.atheism
    soc.religion.christian