<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Guided Practice With Topic Modeling and LDA

_Authors: Dave Yerrington (SF)_

---

> **Note: This lab is intended to be completed with guidance from the instructor.**

You'll rarely need to build an unsupervised topic model like LDA from scratch. Luckily, scikit-learn comes with an LDA topic modeling functionality. 

Let's explore a brief walk through of LDA and topic modeling using gensim. The `gensim` package is another popular LDA module. We'll work with a small collection of documents represented as a list.

### 1) Load the packages and create the small "documents."

You may need to install the `gensim` package with `pip` or `conda`.

In [1]:
from gensim import corpora, models, matutils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
import pandas as pd


doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# Compile the sample documents into a list.
documents = [doc_a, doc_b, doc_c, doc_d, doc_e]
df        = pd.DataFrame(documents, columns=['text'])

In [2]:
df

Unnamed: 0,text
0,Brocolli is good to eat. My brother likes to e...
1,My mother spends a lot of time driving my brot...
2,Some health experts suggest that driving may c...
3,I often feel pressure to perform well at schoo...
4,Health professionals say that brocolli is good...


### 2) Load stop words either from NLTK or scikit-learn.

In [3]:
from nltk.corpus import stopwords

In [4]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [5]:
# A:

### 3) Use `CountVectorizer` to transform our text and take out the stop words.

In [6]:
# A:

### 4) Extract the tokens that remain after stop word removal.

The `.vocabulary_` attribute of the vectorizer contains a dictionary of terms. There's also the built-in `.get_feature_names()` function, which will extract the column names.

In [7]:
# A:

### 5) Get counts of the tokens.

Convert the matrix from a vectorizer to a dense matrix, then sum by column to get the counts per term.

In [8]:
# A:

### 6) Set up the vocabulary dictionary.

First, we need to set up the vocabulary. Gensim's LDA expects our vocabulary to be formatted such that the dictionary keys are the column indices and the values are the words themselves, like this:

{0: u'baseball',  
 1: u'better',  
 2: u'blood',  
 3: u'brocolli',  
 4: u'cause',  
 5: u'drive',  
 6: u'driving',  
 7: u'eat',  
 8: u'experts'}  

Create this dictionary below.  

HINT: vectorizer.vocabulary_.iteritems()

In [9]:
# A:

### 7) Create a token to ID mapping with gensim's `corpora.Dictionary`.

This dictionary class is a more standard way to work with gensim models. There are a few standard steps we should take:

**7.A) Count the frequency of the words.**

We can easily do this with the Python `defaultdict(int)` function, which doesn't require us to have the key in the dictionary to be able to add to it:

```python
frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1
```




In [10]:
# A:

**7.B) Remove any words that appear only once or in the stop words.**

Iterate through the documents and only keep the useful words and tokens.

In [11]:
# A:

**7.C) Create the `corpora.Dictionary` object with the retained tokens.**

In [12]:
# A:

**7.D) Use the `dictionary.doc2bow()` function to convert the texts to bag-of-words representations.**

In [13]:
# A:

**Why should we use this process?**

The main advantage is that this dictionary object has quick helper functions.

There are also some major performance advantages. It can take a while for tokenization to be computed, especially when the text files are quite large. You can save these post-computed dictionary items to file, then quickly load them from a disk.

It's also possible to add new documents to your corpus without having to re-tokenize your entire set. This is great for online systems that can take new documents on demand.  

This is a much better way to handle LDA and other gensim models as you work with larger text data sets.

### 8) Set up the LDA model.

We can create the gensim LDA model object like so:

```python
lda = models.LdaModel(
    # Supply our sparse predictor matrix wrapped in a matutils.Sparse2Corpus object:
    matutils.Sparse2Corpus(X, documents_columns=False),
    # or, alternatively use the corpus object created with the dictionary in the previous frame!
    # Corpus,
    # the number of topics we want:
    num_topics  =  3,
    # How many passes over the vocabulary:
    passes      =  20,
    # The id2word vocabulary we made ourselves:
    id2word     =  vocab
    # or, use the gensim dictionary object!
    # id2word     =  dictionary
)
```

In [14]:
# A:

### 9) Look at the topics.

The model has a `.print_topics()` function that accepts the number of topics to print and the number of words per topic. The number before the word is the probability that the word occurs in the topic.

In [15]:
for topic in lda.print_topics(num_topics=3, num_words=5):
    print(topic[1])

### 10) Get the topic scores for a document.

The `.get_document_topics()` function accepts a bag-of-words representation for a document and returns the scores for each topic.  

HINT: dictionary.doc2bow(texts[2])

In [16]:
# A:

### 11) Label and visualize the topics.

Let's come up with some high-level labels. This is the subjective part of LDA. What do the word probabilities that represent topics mean? Let's make some up.

Plot a heat map of the topic probabilities for each of the documents.

In [17]:
# A:

### 12) Fit an LDA model with scikit-learn.

Scikit-learn's LDA model is in the decomposition submodule:

```python
from sklearn.decomposition import LatentDirichletAllocation
```

One of the greatest benefits of scikit-learn implementation is that it comes with the familiar `.fit()`, `.transform()`, and `.fit_transform()` methods.

**12.A) Initialize and fit a scikit-learn LDA with `n_topics=3` on our output from the `CountVectorizer`.**

In [18]:
# A:

**12.B) Print out the topic-word distributions using the `.components_` attribute.**

Each row of this matrix represents a topic, and the columns represent the words. (These are not probabilities.)

In [19]:
# A:

**12.C) Use the `.transform()` method to convert the matrix into the topic scores.**

These are the document-topic distributions.

In [20]:
# A:

### 13) Further steps.

This has been a very basic example. LDA typically doesn't perform well on small data sets. Try to see how it behaves on your own using a larger one. Keep in mind that finding the optimal number of topics can be tricky and subjective.

**Generally, you should consider:**
- How well topics are applied to the documents overall.
- The strength of the topics overall to all documents.
- Improving preprocessing, such as stop word removal.
- Building a nice web interface to explore your documents (see: [LDAExplorer](https://github.com/dyerrington/LDAExplorer) and [pyLDAvis](https://github.com/bmabey/pyLDAvis/blob/master/README.rst)).

These general guidelines should help you tune your hyperparameter **k** for the number of topics.