# Topic Modeling with gensim
We'll try out [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) on the [20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) with some simple preprocessing.

In [1]:
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
import glob, os

In [3]:
texts = []
path = '/home/amn34/metis/stuff/noirs/chandler/'
for books in glob.glob(os.path.join(path, '*.txt')):
    doc = open(books).read()
    doc = doc.decode('utf-8')
    texts.append(doc)

In [4]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop = stop + [u'said',u'went',u'could',u'would',u'got',u'get',u'looked',u'around',u'man',u'one',u'put',u'back',\
               u'like',u'know',u'little']

In [5]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(analyzer='word',
                                  ngram_range=(1, 2), stop_words=stop,
                                  token_pattern='\\b[a-z][a-z]+\\b',max_df=.5, min_df=2)

In [6]:
count_vectorizer.fit(texts)

CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.5, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'w... u'got', u'get', u'looked', u'around', u'man', u'one', u'put', u'back', u'like', u'know', u'little'],
        strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
# Create the term-document matrix
# Transpose it so the terms are the rows
ng_vecs = count_vectorizer.transform(texts).transpose()
ng_vecs.shape

(186633, 7)

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [8]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(ng_vecs)

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [9]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.iteritems())

In [10]:
len(id2word)

186633

## LDA
At this point we can simply plow ahead in creating an LDA model.  It requires our corpus of word counts, mapping of row ids to words, and the number of topics (3).

In [14]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus, id2word=id2word, num_topics=3, passes=5, alpha='auto', eta='0.1')

ValueError: Unable to determine proper eta value given '0.1'

Let's take a look at what happened.  Here are the 5 most important words for each of the 3 topics we found:

In [12]:
lda.print_topics(num_words=5, num_topics=3)

[(0,
  u'0.000*wade + 0.000*mitchell + 0.000*murdock + 0.000*hench + 0.000*steelgrave'),
 (1,
  u'0.000*wade + 0.000*kingsley + 0.000*degarmo + 0.000*mitchell + 0.000*geiger'),
 (2,
  u'0.002*mitchell + 0.001*goble + 0.001*betty + 0.001*brandon + 0.001*mayfield'),
 (3,
  u'0.000*wade + 0.000*kingsley + 0.000*degarmo + 0.000*bay city + 0.000*murdock'),
 (4,
  u'0.000*wade + 0.000*degarmo + 0.000*geiger + 0.000*kingsley + 0.000*lavery'),
 (5,
  u'0.001*wade + 0.001*degarmo + 0.001*kingsley + 0.001*lavery + 0.001*lennox'),
 (6,
  u'0.002*geiger + 0.001*murdock + 0.001*regan + 0.001*vannier + 0.001*brody')]