# Topic Modeling with gensim
We'll try out [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) on the [20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) with some simple preprocessing.

#### Install gensim

In [1]:
# pip install --upgrade gensim

##### imports

In [1]:
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Let's retain only a subset of the 20 categories in the original 20 Newsgroups Dataset.

In [60]:
import pickle
with open("aaai_topics.pkl", 'r') as datafile:
    categories = pickle.load(datafile)

with open("aaai_abstracts.pkl", 'r') as datafile:
    abstracts = pickle.load(datafile)

for i, j in enumerate(abstracts):
    abstracts[i]=str(j)


## Document Preprocessing
We'll need to generate a term-document matrix of word (token) counts for use in LDA.

We'll use `sklearn`'s `CountVectorizer` to generate our term-document matrix of counts. We'll make use of a few parameters to accomplish the following preprocessing of the text documents all within the `CountVectorizer`:
* `analyzer=word`: Tokenize by word
* `ngram_range=(1,2)`: Keep all 1 and 2-word grams
* `stop_words=english`: Remove all English stop words
* `token_pattern=\\b[a-z][a-z]+\\b`: Match all tokens with 2 or more (strictly) alphabet characters
* `min_df=2`: Words must appear in at least 2 documents
* `max_df=0.02`: Words must appear in less than 2% of the documents

In [61]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(analyzer='word',
                                  ngram_range=(1, 2), stop_words='english',
                                  token_pattern='\\b[a-z][a-z]+\\b', max_df=0.1, min_df=5)
count_vectorizer.fit(abstracts)

CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.1, max_features=None, min_df=5,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
        tokenizer=None, vocabulary=None)

In [62]:
# Create the term-document matrix
# Transpose it so the terms are the rows
ng_vecs = count_vectorizer.transform(abstracts).transpose()
ng_vecs.shape

(1380, 398)

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [63]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(ng_vecs)

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [64]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.iteritems())

In [65]:
len(id2word)

1380

## LDA
At this point we can simply plow ahead in creating an LDA model.  It requires our corpus of word counts, mapping of row ids to words, and the number of topics (3).

In [66]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus, id2word=id2word, num_topics=len(categories), passes=20)

Let's take a look at what happened.  Here are the 5 most important words for each of the 3 topics we found:

In [67]:
lda.print_topics(num_words=8, num_topics=len(categories))

[(0,
  u'0.013*graph + 0.012*goods + 0.012*knowledge + 0.011*agents + 0.011*level + 0.010*matching + 0.010*free + 0.010*graphs'),
 (1,
  u'0.020*single + 0.013*human + 0.010*knowledge + 0.009*voting + 0.009*activities + 0.009*rule + 0.008*users + 0.008*sets'),
 (2,
  u'0.030*transfer + 0.020*target + 0.015*classification + 0.015*recognition + 0.014*transfer learning + 0.014*missing + 0.013*knowledge + 0.012*target domain'),
 (3,
  u'0.029*label + 0.024*labels + 0.022*case + 0.018*examples + 0.017*view + 0.015*classification + 0.014*labeled + 0.014*supervised'),
 (4,
  u'0.016*bound + 0.013*behavior + 0.011*machine learning + 0.011*machine + 0.010*analysis + 0.010*class + 0.009*learning algorithms + 0.009*markov'),
 (5,
  u'0.025*view + 0.021*objective + 0.016*multi view + 0.013*topic + 0.011*supervised + 0.009*technique + 0.009*analysis + 0.008*matrix'),
 (6,
  u'0.021*dynamics + 0.021*probabilistic + 0.012*program + 0.012*form + 0.011*items + 0.010*online + 0.009*temporal + 0.009*sche

#### Topic Space
If we want to map our documents to the topic space we need to actually use the LdaModel transformer that we created above, like so:

In [56]:
# Transform the docs from the word space to the topic space (like "transform" in sklearn)
lda_corpus = lda[corpus]

In [57]:
# Store the documents' topic vectors in a list so we can take a peak
lda_docs = [doc for doc in lda_corpus]

Now we can take a look at the document vectors in the topic space, which are measures of the component of each document along each topic.  Thus, at most a document vector can have num_topics=3 nonzero components in the topic space, and most have far fewer.

In [58]:
# Check out the document vectors in the topic space for the first 5 documents
lda_docs[0:5]

[[(16, 0.9867424242288535)],
 [(6, 0.20650803173677695),
  (10, 0.70208851743068701),
  (16, 0.07790913262600753)],
 [(8, 0.97727272723301539)],
 [(3, 0.97552447545325294)],
 [(5, 0.98295454542590188)]]

In [59]:
abstracts[0]

'Transfer learning considers related but distinct tasks defined on heterogenous domains and tries to transfer knowledge between these tasks to improve generalization performance. It is particularly useful when we do not have sufficient amount of labeled training data in some tasks, which may be very costly, laborious, or even infeasible to obtain. Instead, learning the tasks jointly enables us to effectively increase the amount of labeled training data. In this paper, we formulate a kernelized Bayesian transfer learning framework that is a principled combination of kernel-based dimensionality reduction models with task-specific projection matrices to find a shared subspace and a coupled classification model for all of the tasks in this subspace. Our two main contributions are: (i) two novel probabilistic models for binary and multiclass classification, and (ii) very efficient variational approximation procedures for these models. We illustrate the generalization performance of our algo

## On your own...
- Go get some of the NIPS papers from [here](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).  
- Try performing LDA on this data with gensim
- Play with some of the preprocessing options and parameters for LDA, observe what happens
- See if you can use the resulting topic space to extract topic vectors and cluster some documents
- How do your results look?