# [LEGALST-190] Lab 4/10: Topic Models

This lab will cover latent dirichlet allocation and topic models using `gensim` and `scikit-learn`.

*Estimated Time: 35 Minutes *

### Table of Contents
[The Data](#section data)<br>
1 - [Using Gensim to Implement a LDA Model](#section 1)<br>
2 - [Using scikit-learn](#section 2)<br>
3 - [Finding topics from UN Debates](#section 3)<br>

**Dependencies:**

In [3]:
import string

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

#!pip install gensim
from gensim import corpora, models

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import numpy as np
import pandas as pd

from helper import *

Collecting gensim
  Downloading gensim-3.4.0-cp36-cp36m-win_amd64.whl (22.5MB)
Collecting smart-open>=1.2.1 (from gensim)
  Downloading smart_open-1.5.7.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Downloading boto-2.48.0-py2.py3-none-any.whl (1.4MB)
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
  Downloading boto3-1.7.4-py2.py3-none-any.whl (128kB)
Collecting botocore<1.11.0,>=1.10.4 (from boto3->smart-open>=1.2.1->gensim)
  Downloading botocore-1.10.4-py2.py3-none-any.whl (4.2MB)
Collecting jmespath<1.0.0,>=0.7.1 (from boto3->smart-open>=1.2.1->gensim)
  Downloading jmespath-0.9.3-py2.py3-none-any.whl
Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->smart-open>=1.2.1->gensim)
  Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59kB)
Building wheels for collected packages: smart-open, bz2file
  Running setup.py bdist_wheel for smart-open: started
  Running setup.py bdist_whe



----
## The Data<a id='section data'></a>

For this lab, we'll use sci-kit learn's `20 newsgroups` dataset, which is a list of approximately 18,000 newsgroup posts. Because of its size, we'll only be working with about 750 posts. At the end of this lab, we'll also work with a selected portion of the UN Data. 

----

## Section 1: Using Gensim to Implement a LDA Model<a id='section 1'></a>

### What Is Latent Dirichlet Allocation?
Latent dirichlet allocation is a way of discovering topics in a set of documents, generating topics based on word frequency. LDA is a probabilistic bag-of-words model that makes an assumption that documents are produced from a variety of topics that produce words with certain probilities. Then it backtracks, finding a set of certain topics that would have created the documents.

----

### Using `gensim`

We'll use the LDA algorithm from `gensim`, a python library for topic modelling.

Let's get working with the data. The `20 newsgroups` data is under the name `20newgroups_data.csv` in the data folder. 

**Question 1.1:** Retrieve the posts from the DataFrame and assign the list to a variable named `documents`.

In [None]:
data = pd.read_csv('data/20newsgroups_data.csv')
...

In [6]:
#SOLUTION
data = pd.read_csv('data/20newsgroups_data.csv')
documents = data['posts'].values
documents[5]

" \n \nI read somewhere, I think in Morton Smith's _Jesus the Magician_, that\nold Lazarus wasn't dead, but going in the tomb was part of an initiation\nrite for a magi-cult, of which Jesus was also a part.   It appears that\na 3-day stay was normal.   I wonder .... ?"

Awesome! We now have data we can work with. Before we start anything, we must clean the text!

Just to review, we want to process our text by:<br>
1) Tokenizing our document<br>
2) Removing stop words (remove meaningless words)<br>
3) Stemming or merging words that have equivalent meanings<br>

<a id='gensim'></a>**Question 1.2:** Tokenize and stem the text in `documents` and filter your tokens by using both `stop` and `more_stops` and punctuation.

In [None]:
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"]

...

In [3]:
#Potential Solution
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")

more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"] 
tokenized = []
for i in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation and x not in more_stops]
    stopped_tokens = [x for x in filtered_tokens if not x in stop]
    stemmed_tokens = [stemmer.stem(i) for i in stopped_tokens]
    tokenized.append(stemmed_tokens)

Now that we have our tokenized documents, we have to convert it to a *document-term matrix* which can be done by instantiating a `gensim` dictionary object. Our first step is to turn our tokenized documents into a "dictionary" that maps a word to its integer ID, like a bag-of-words model. <a id='Q1.3'></a>

**Question 1.3:** Implement a gensim dictionary from the `corpora` package and assign it to a variable named `dictionary`. You can look [at the documentation](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) for the corpora package if necessary.

In [4]:
#SOLUTION
dictionary = corpora.Dictionary(tokenized)

This is the last step before we implement the model! We must convert our documents to bag-of-words format using our dictionary. Every document is represented as a list of tuples of the word's integer ID and its frequency. This list of 400 documents represents our document-term matrix.

**Question 1.4:** Using `dictionary` from the previous question, convert to your tokenzied documents into a bag-of-words format and store it to a variable named `corpus`. We want to use `doc2bow()` method ***for every document*** in our tokenized text. The documentation is linked [here](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow).

You should end up with a list of tuples for each document and make sure that `len(corpus)` is 738. Calling `corpus[i]` for some integer i should return something of the form `[(16, 2), (58, 1), (59, 1),...`

In [5]:
#SOLUTION
corpus = [dictionary.doc2bow(text) for text in tokenized]

Now that we have a document-term matrix, we’re ready to generate an LDA model!

The cell below is an example of an implementation of a `gensim` LDA model. Run the cell and take a look at what it displays.

In [6]:
ldamodel = models.LdaModel(corpus, 
                           id2word=dictionary,
                           num_topics=6,
                           chunksize=270, 
                           update_every=20,
                           passes=3)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel)

Topic 0
0.007*"would" + 0.004*"use" + 0.004*"one" + 0.004*"peopl" + 0.003*"get" + 0.003*"think" + 0.003*"like" + 0.003*"even" + 0.002*"could" + 0.002*"say"

Topic 1
0.006*"one" + 0.005*"peopl" + 0.005*"think" + 0.005*"use" + 0.004*"like" + 0.004*"would" + 0.004*"go" + 0.004*"get" + 0.004*"know" + 0.004*"time"

Topic 2
0.006*"use" + 0.004*"one" + 0.003*"system" + 0.003*"would" + 0.003*"like" + 0.003*"make" + 0.003*"govern" + 0.003*"new" + 0.003*"get" + 0.003*"also"

Topic 3
0.009*"would" + 0.005*"one" + 0.005*"like" + 0.005*"peopl" + 0.004*"use" + 0.004*"god" + 0.004*"know" + 0.003*"get" + 0.003*"good" + 0.003*"hiv"

Topic 4
0.015*"1" + 0.013*"2" + 0.006*"3" + 0.004*"get" + 0.003*"4" + 0.003*"know" + 0.003*"copi" + 0.003*"one" + 0.003*"6" + 0.003*"make"

Topic 5
0.007*"use" + 0.005*"one" + 0.005*"would" + 0.004*"send" + 0.003*"file" + 0.003*"get" + 0.003*"know" + 0.003*"like" + 0.003*"also" + 0.003*"imag"



----

Our model returned some topics! We have a jumble of words and numbers, but remember that LDA is a probabilistic model. For clarity, the `show_topics` function utilizes the `LdaModel` method `.show_topics()` which gives us the words that contribute the most to `num_topics` random topics (random because we aren't defining the topics). The numbers in front the words represent the probability of that word appearing in the topic. If we look at **topic 2**, we can infer from the proportions in front of the word that the topic is more about 1 and 2 rather than "people", which has a lower value than the first two. 

Although we aren't explicitly defining the topics, we are telling the computer how many topics to look for. LDA treats each document as a mix of words and a mix of topics. It chooses words that contribute to a topic and finds certain topics that describe a document.

Unfortunately, we are working with a pretty small set of ducments and our topics would be more defined with a larger corpus, but we can still work with our data to get defined topics. Let's start defining more optimal parameters for our model!

----

There are quite a few parameters for generating the `LdaModel` that affect the quality of the topics returned. We'll go over some of the helpful and important parameters to use when implementing a LDA model in `gensim`.

| Required Parameters        |Value                          | Default | What it does  |
| :-------------------------:|:-----------------------------:| --|:-------------:|
|                corpus      | corpus (doc-term matrix) | None | This specifies your LDA model parameters. |
| id2word     | `gensim` dictionary | None | The doc-term matrix to word <br>integer ID mapping. |
| num_topics<br> | integer | 100 |Specifies the number of underlying topics <br> in your documents. Usually, the fewer <br> documents you have, the smaller number <br> you assign [(this is a hot topic!)](https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling).<br>|


| Optional (but helpful) Parameters   |Default | What it does  |
|:------------------------------------|------- |:-------------------------------:|
| passes | 1 | How many times you want to iterate through the corpus.<br> The more passes, the more accurate your model will be, <br> although  it can take longer time if you have a large dataset. |
|chunksize | 2000 | The size of the batch documents you want to run through.<br> e.g. chunksize = 10, we run 10 documents at a time.| 
|update_every |1 | Update the model after every `n` number of chunks. |




**Question 1.5:** After reviewing the parameters, go back to the previous model. What is problematic about it and its results? Which parameters do you think you should change first to get more explicit results?

*Write your answer here*

<a id='Q1.6a'></a> **Question 1.6a:** Improve the previous LDA model by adjusting the parameters  [(documentation)](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel). Try to get as clear topics as possible! Remember to call `show_topics` on your model to print out the words for your topics.

In [7]:
#Possible Solution
possible_model = models.LdaModel(corpus, 
                                 id2word=dictionary,
                                 num_topics=7,
                                 chunksize=75, 
                                 update_every=3,
                                 passes=15)
show_topics(possible_model)

Topic 0
0.012*"blue" + 0.007*"second" + 0.005*"orbit" + 0.005*"period" + 0.005*"bassen" + 0.004*"proton" + 0.004*"end" + 0.004*"launch" + 0.004*"play" + 0.004*"greek"

Topic 1
0.008*"would" + 0.008*"one" + 0.008*"peopl" + 0.006*"think" + 0.006*"like" + 0.006*"god" + 0.006*"time" + 0.005*"know" + 0.005*"make" + 0.004*"say"

Topic 2
0.008*"list" + 0.007*"send" + 0.007*"file" + 0.006*"x" + 0.005*"use" + 0.005*"comput" + 0.005*"mail" + 0.005*"request" + 0.005*"graphic" + 0.004*"score"

Topic 3
0.024*"1" + 0.022*"2" + 0.011*"3" + 0.008*"0" + 0.008*"4" + 0.007*"p" + 0.005*"copi" + 0.005*"n" + 0.005*"p-" + 0.005*"6"

Topic 4
0.006*"hiv" + 0.005*"april" + 0.004*"aid" + 0.004*"diseas" + 0.004*"health" + 0.003*"inform" + 0.003*"jewish" + 0.003*"6" + 0.003*"find" + 0.003*"new"

Topic 5
0.008*"would" + 0.008*"get" + 0.007*"use" + 0.007*"one" + 0.007*"like" + 0.005*"know" + 0.005*"go" + 0.005*"car" + 0.005*"could" + 0.005*"problem"

Topic 6
0.013*"pleas" + 0.007*"use" + 0.006*"interest" + 0.006*"pa

**Question 1.6b:** What are some topics that you can infer from your optimized model?

*Write your answer here*

**Question 1.6c:** Did you notice any patterns while changing values of certain parameters (how did num_topics change the quality of results)? What worked in giving you reasonable, clear topics and what didn't? 

*Write your answer here*

----

## Section 2: Using `scikit-learn`<a id='section 2'></a>

Along with `gensim`,  we can also use `scikit-learn` to implement a LDA model. Using the `scikit-learn` algorithm is less clear, since a lot of the work is done by the computer. But, by going through the `gensim` algorithm, we now have an idea how LDA works, and using the `scikit-learn` algorithm will be a little more clear. 

You may be wondering, *why did we do all of that work in the first section when we can just use `scikit-learn` to implement a LDA model?* 

The motivation to use `gensim` is that you have much more control over how you can implement your model. This is a really big benefit if you want to find topics without having too much overlap, repetition, or inconsistency. For example, with `gensim`, you can tokenize and stem your documents in any way you want. But with `scikit-learn`, you don't have as much control over how you manipulate your data. Another example is the ability to filter your `gensim` dictionary instance, which you will explore later in the last section. If you prefer a more explicit method of implementing a topic model and want command over your data, then `gensim` is a great option.

Anyway, let's get started with using `scikit-learn`!

----

**Question 2.1:** In order to implement a LDA model using `scikit-learn`, we must extract features to a matrix using either the count vectorizer or the tf-idf vectorizer. Which one do we use and why?

*Write your answer here*

----

If you answered a count vectorizer to the previous question, you're right! Since LDA is a probabilistic model, we only need the raw term counts.

<a id='sklearn'></a>

**Question 2.2:** Instantiate a count vectorizer with the parameters `max_df=.95`, `min_df=2`, and `stop_words='english'`.

In [8]:
#SOLUTION
countvec = CountVectorizer(max_df=.95, 
                           min_df=2, 
                           stop_words='english')

With the vectorizer, we can transform our data into a document-term matrix, as well as use `.get_feature_names()` to get the word to integer ID mapping like we did in [question 1.3](#Q1.3).

**Question 2.3:** Use your vectorizer to transform the same dataset from the first section of this lab to a document-term matrix (concept is similar to Q1.4) and get the feature names.

In [9]:
#SOLUTION
cv = countvec.fit_transform(documents)
cv_feature_names = countvec.get_feature_names()

We're almost done! The last step is to implement the model, so we can get our topics. Similar to `gensim`, there are parameters that should be adjusted to fit your documents.

| Parameters <br> [(documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)|Default | What it does  |
| :-------------------------:|:-----------------------------:|:-------------:|
|          n_components      | 10 | Equivalent to `num_topics` in the `gensim` model. This <br> specifies the number of latent topics in your documents. |
| max_iter | 10 | Equivalent to `passes` in the `gensim` model. |
| batch_size | 128 | Equivalent to `chunksize` |

**Question 2.4:** Implement the LDA model using `LatentDirichletAllocation`. Don't forget to fit your document-term matrix from the previous question!

**Note:** Set the `learning_method` parameter to `'online'`, which is its default (for the latest version of scikit-learn) but will throw a deprecation warning if not specified.

In [10]:
#SOLUTION
lda = LatentDirichletAllocation(n_components=5, 
                                max_iter=25, 
                                batch_size = 99,
                                learning_method='online').fit(cv)

A function `topic_words` is defined for you (from `helper.py`) and it takes in three arguments:

     1) model: your LDA model
     2) feature_names: the feature names from the vectorizer
     3) num_top_words: number of words you want displayed

It prints out the topic number and the words that fall under that topic, although it does not display weight of words like `gensim`.

**Question 2.5:** Specify the number of words you want displayed and call `topic_and_words` on the LDA model from the previous question. 

<sub>**Note:** If your topics are repetitive or aren't very coherent, try tweaking the parameters in the previous question.</sub>

In [11]:
#Possible solution
num_top_words = 15
topic_words(lda, cv_feature_names, num_top_words)

Topic 0:
god gm magi church jesus christ belief bible believe mary satan heaven greek christian did
Topic 1:
car blues game 10 said year just government encryption didn period season record going play
Topic 2:
win titan int mission launch probe jupiter vram centaur earth cb orbiter memory orbit space
Topic 3:
edu graphics mail use send data software pub using com windows computer space thanks file
Topic 4:
people like just don know think time good new say way ve said want years


----
**Question 2.6:** Did your model yield clear, interpretable results? How does it compare to the LDA model you created in [section one](#Q1.6a)?

*Write your answer here*

----
## Section 3: Finding topics from UN General Debates<a id='section 3'></a>

We have two ways of implementing a LDA model, let's try both on the UN General Debates dataset. We can now get an idea of what was discussed at a specific session through topic modelling!

**Question 3.1**: Load `un-general-debates-2015` from the data folder and extract the data from the 'text' column. This file contains the data from the 70th session. 

In [None]:
un = pd.read_csv('data/un-general-debates-2015.csv')
...

In [12]:
#SOLUTION
un = pd.read_csv('data/un-general-debates-2015.csv')
un_text = un['text'].values

**Question 3.2:** First implement a LDA model using `gensim`. Follow similar steps from the [first section](#gensim), and adjust your parameters accordingly! Use the `show_topics` function to display your topics.

**Tip:** Use the `filter_extremes(no_below=<int>)` method [(documentation)](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes) on your `gensim` dictionary, which helps filter through tokens based on frequency (in this case it'll keep any tokens contained in at least the specified integer number of documents). Feel free to use other parameters in `filter_extremes` to optimize your topics! 

As mentioned at the beginning of [section 2](#section 2), you really do have a lot of control over your model and I encourage you to utilize these tools to refine your topic model.

In [13]:
#Possible solution using gensim
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "..."]
un_corp = []
for i in un_text:
    un_tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    un_filtered_tokens = [x for x in un_tokens if x not in punctuation and x not in more_stops]
    un_stopped_tokens = [x for x in un_filtered_tokens if not x in stop]
    un_stemmed_tokens = [stemmer.stem(i) for i in un_stopped_tokens]
    un_corp.append(un_stemmed_tokens)

un_dictionary = corpora.Dictionary(un_corp)

un_dictionary.filter_extremes(no_below=5, no_above=.4)

un_d_t = [un_dictionary.doc2bow(i) for i in un_corp]

un_lda = models.LdaModel(un_d_t, num_topics=5,
                            id2word=un_dictionary, 
                            chunksize=20, 
                            update_every=1,
                            passes=17)

show_topics(un_lda)

Topic 0
0.022*"african" + 0.008*"sign" + 0.007*"congo" + 0.006*"appeal" + 0.006*"despit" + 0.006*"contin" + 0.006*"kingdom" + 0.005*"west" + 0.005*"central" + 0.005*"mali"

Topic 1
0.036*"palestinian" + 0.032*"israel" + 0.026*"palestin" + 0.022*"isra" + 0.021*"islam" + 0.018*"yemen" + 0.015*"syrian" + 0.013*"muslim" + 0.013*"iran" + 0.012*"settlement"

Topic 2
0.004*"trade" + 0.003*"invest" + 0.003*"reduc" + 0.003*"weapon" + 0.003*"energi" + 0.003*"target" + 0.003*"benefit" + 0.003*"forum" + 0.003*"risk" + 0.003*"enhanc"

Topic 3
0.006*"european" + 0.005*"say" + 0.004*"iraq" + 0.004*"never" + 0.004*"home" + 0.004*"islam" + 0.004*"militari" + 0.004*"ask" + 0.004*"know" + 0.003*"think"

Topic 4
0.037*"island" + 0.023*"ocean" + 0.017*"caribbean" + 0.014*"pacif" + 0.013*"fiji" + 0.011*"sid" + 0.011*"sea" + 0.010*"saint" + 0.010*"cuba" + 0.010*"venezuela"



**Question 3.3:** Now, implement a model using `scikit-learn`. Again, follow similar steps from the [second section](#sklearn) and adjust parameters accordingly. You can display your topics using `topic_words`.

In [14]:
#Possible solution using sklearn
countvec = CountVectorizer(max_df=.90, 
                           min_df=2, 
                           stop_words='english')

un_cv = countvec.fit_transform(un_text)
un_cv_feature_names = countvec.get_feature_names()

un_lda = LatentDirichletAllocation(n_components=5, 
                                   max_iter=30,
                                   learning_decay=0.7,
                                   batch_size=30,
                                   learning_method='online').fit(un_cv)

topic_words(un_lda, un_cv_feature_names, 10)

Topic 0:
islands solomon papua guinea pacific melanesian indonesia caledonia taiwan shelf
Topic 1:
belize grenadines vincent redress dominican passengers defer awesome chain fauna
Topic 2:
venezuela latin argentina guyana trinidad tobago american bolivia uruguay creditors
Topic 3:
madagascar rights global community climate terrorism council republic agenda challenges
Topic 4:
global rights sustainable climate change support agenda state council efforts


----
**Question 3.4:** Which algorithm yielded more well-defined topics? You can also skim the resolutions passed [here](http://research.un.org/en/docs/ga/quick/regular/70) for reference. What do you think are some factors that need to be considered about the data when choosing an algorithm and adjusting its parameters?

*Write your answer here*

**Question 3.5:** What are some differences that you noticed between the `gensim` and `scikit-learn` algorithms? What are some of their drawbacks? Do you prefer one over the other and if so, why?

*Write your answer here*

----
Awesome! Now you know how to implement a topic model two ways using `gensim` and `scikit-learn`. Even though `scikit-learn` is more straightforward and requires less work to implement, the control you have over `gensim` is very valuable and can result in more distinct topics.

Ultimately, the choice is yours and I hope having both options helps you generate great topic models!

----

## Bibliography

 - Chen, Edwin, Introduction to latent dirichlet allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
 - Use of `20newsgroups` data set adapted from Topic Modelling tutorial by Aneesha Bakharia
 https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
 - Resolutions from UN 70th session: http://research.un.org/en/docs/ga/quick/regular/70
 - Text cleaning code adapted from notebook by Alex Estes https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb

----
Notebook developed by: Jason Jiang

Data Science Modules: http://data.berkeley.edu/education/modules