# Topic Modeling with gensim
We'll try out [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) on the [20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) with some simple preprocessing.

#### Install gensim

In [1]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-0.13.3-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (4.3MB)
[K    100% |████████████████████████████████| 4.3MB 182kB/s 
[?25hCollecting numpy>=1.3 (from gensim)
  Downloading numpy-1.11.2-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (3.9MB)
[K    100% |████████████████████████████████| 3.9MB 204kB/s 
[?25hCollecting smart-open>=1.2.1 (from gensim)
  Downloading smart_open-1.3.5.tar.gz
Requirement already up-to-date: six>=1.5.0 in /Users/Bob/anaconda/lib/python3.5/site-packages (from gensim)
Requirement already up-to-date: scipy>=0.7.0 in /Users/Bob/anaconda/lib/python3.5/site-packages (from gensim)
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Downloading boto-2.43.0-py2.py3-none-any.whl (1.3MB)
[K    100% |████████████████████████████████| 1.4MB 555kB/s 
[?25hCollecting bz2file (from smart-open>=1.2.1->

##### imports

In [2]:
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



Let's retain only a subset of the 20 categories in the original 20 Newsgroups Dataset.

In [3]:
# Set categories
categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']#, 
              #'rec.motorcycles', 'sci.space', 'talk.politics.mideast']
# Download the training subset of the 20 NG dataset, with headers, footers, quotes removed
# Only keep docs from the 6 categories above
ng_train = datasets.fetch_20newsgroups(subset='train', categories=categories, 
                                      remove=('headers', 'footers', 'quotes'))

2016-11-01 11:47:42,054 : INFO : Decompressing /Users/Bob/scikit_learn_data/20news_home/20news-bydate.tar.gz


In [5]:
# Take a look at the first doc
ng_train.data[0:5]

['\n\n\nI happen to be a big fan of Jayson Stark.  He is a baseball writer for the \nPhiladelphia Inquirer.  Every tuesday he writes a "Week in Review" column.  \nHe writes about unusual situations that occured during the week.  Unusual\nstats.  He has a section called "Kinerisms of the Week" which are stupid\nlines by Mets brodcaster Ralph Kiner.  Every year he has the LGTGAH contest.\nThat stands for "Last guy to get a hit."  He also writes for Baseball \nAmerica.  That column is sort of a highlights of "Week in Review."  If you \ncan, check his column out sometime.  He might make you laugh.\n\nRob Koffler\n',
 '\nHere\'s one I remember: (sort of)\nYogi\'s asleep in a hotel room late at night and gets a call from someone.\nAfter he answers the phone the person at the other end asks if he woke Yogi\nup. Yogi answered, "No, the phone did."',
 '\n\n\tSorry, I was, but I somehow have misplaced my diskette from the last \ncouple of months or so. However, thanks to the efforts of Bobby, it

## Document Preprocessing
We'll need to generate a term-document matrix of word (token) counts for use in LDA.

We'll use `sklearn`'s `CountVectorizer` to generate our term-document matrix of counts. We'll make use of a few parameters to accomplish the following preprocessing of the text documents all within the `CountVectorizer`:
* `analyzer=word`: Tokenize by word
* `ngram_range=(1,2)`: Keep all 1 and 2-word grams
* `stop_words=english`: Remove all English stop words
* `token_pattern=\\b[a-z][a-z]+\\b`: Match all tokens with 2 or more (strictly) alphabet characters

In [45]:
# Create a CountVectorizer for parsing/counting words
vectorizer = CountVectorizer(ngram_range=(1,2),stop_words='english',token_pattern='\\b[a-z][a-z]+\\b',max_df=0.7)
vectorizer.fit(ng_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.7, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
        tokenizer=None, vocabulary=None)

In [46]:
# Create the term-document matrix
# Transpose it so the terms are the rows
counts = vectorizer.transform(ng_train.data).transpose()

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [47]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(counts)

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [48]:
id2word = dict((v,k) for k,v in vectorizer.vocabulary_.items())

In [49]:
len(id2word)

118113

## LDA
At this point we can simply plow ahead in creating an LDA model.  It requires our corpus of word counts, mapping of row ids to words, and the number of topics (3).

In [50]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=10)

2016-11-01 14:23:19,460 : INFO : using symmetric alpha at 0.3333333333333333
2016-11-01 14:23:19,461 : INFO : using symmetric eta at 0.3333333333333333
2016-11-01 14:23:19,461 : INFO : using serial LDA version on this node
2016-11-01 14:23:21,538 : INFO : running online LDA training, 3 topics, 10 passes over the supplied corpus of 1661 documents, updating model once every 1661 documents, evaluating perplexity every 1661 documents, iterating 50x with a convergence threshold of 0.001000
2016-11-01 14:23:34,803 : INFO : -17.439 per-word bound, 177712.2 perplexity estimate based on a held-out corpus of 1661 documents with 241620 words
2016-11-01 14:23:34,804 : INFO : PROGRESS: pass 0, at document #1661/1661
2016-11-01 14:23:38,226 : INFO : topic #0 (0.333): 0.002*"don" + 0.002*"image" + 0.001*"does" + 0.001*"good" + 0.001*"people" + 0.001*"graphics" + 0.001*"think" + 0.001*"edu" + 0.001*"god" + 0.001*"like"
2016-11-01 14:23:38,228 : INFO : topic #1 (0.333): 0.002*"like" + 0.002*"think" + 0

Let's take a look at what happened.  Here are the 5 most important words for each of the 3 topics we found:

In [51]:
lda.print_topics()

2016-11-01 14:25:40,921 : INFO : topic #0 (0.333): 0.002*"don" + 0.002*"graphics" + 0.001*"does" + 0.001*"like" + 0.001*"edu" + 0.001*"good" + 0.001*"think" + 0.001*"people" + 0.001*"just" + 0.001*"know"
2016-11-01 14:25:40,923 : INFO : topic #1 (0.333): 0.002*"image" + 0.002*"jpeg" + 0.002*"like" + 0.001*"don" + 0.001*"think" + 0.001*"people" + 0.001*"just" + 0.001*"good" + 0.001*"year" + 0.001*"know"
2016-11-01 14:25:40,925 : INFO : topic #2 (0.333): 0.002*"god" + 0.002*"don" + 0.002*"just" + 0.001*"think" + 0.001*"know" + 0.001*"does" + 0.001*"atheism" + 0.001*"time" + 0.001*"image" + 0.001*"data"


[(0,
  '0.002*"don" + 0.002*"graphics" + 0.001*"does" + 0.001*"like" + 0.001*"edu" + 0.001*"good" + 0.001*"think" + 0.001*"people" + 0.001*"just" + 0.001*"know"'),
 (1,
  '0.002*"image" + 0.002*"jpeg" + 0.002*"like" + 0.001*"don" + 0.001*"think" + 0.001*"people" + 0.001*"just" + 0.001*"good" + 0.001*"year" + 0.001*"know"'),
 (2,
  '0.002*"god" + 0.002*"don" + 0.002*"just" + 0.001*"think" + 0.001*"know" + 0.001*"does" + 0.001*"atheism" + 0.001*"time" + 0.001*"image" + 0.001*"data"')]

#### Topic Space
If we want to map our documents to the topic space we need to actually use the LdaModel transformer that we created above, like so:

In [54]:
# Transform the docs from the word space to the topic space (like "transform" in sklearn)
lda_corpus = lda[corpus]

In [55]:
# Store the documents' topic vectors in a list so we can take a peak
lda_docs = [doc for doc in lda_corpus]

Now we can take a look at the document vectors in the topic space, which are measures of the component of each document along each topic.  Thus, at most a document vector can have num_topics=3 nonzero components in the topic space, and most have far fewer.

In [56]:
# Check out the document vectors in the topic space for the first 5 documents
lda_docs[1]

[(2, 0.98227502693866175)]

In [57]:
ng_train.data[1]

'\nHere\'s one I remember: (sort of)\nYogi\'s asleep in a hotel room late at night and gets a call from someone.\nAfter he answers the phone the person at the other end asks if he woke Yogi\nup. Yogi answered, "No, the phone did."'

## On your own...
- Go get some of the NIPS papers from [here](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).  
- Try performing LDA on this data with gensim
- Play with some of the preprocessing options and parameters for LDA, observe what happens
- See if you can use the resulting topic space to extract topic vectors and cluster some documents
- How do your results look?