# [LEGALST-190] Lab 4/10: Topic Models

This lab will cover latent dirichlet allocation and topic models using `gensim` and `scikit-learn`.

*Estimated Time: 35 Minutes *

### Table of Contents
[The Data](#section data)<br>
1 - [Using Gensim to Implement a LDA Model](#section 1)<br>
2 - [Using scikit-learn](#section 2)<br>
3 - [Finding topics from UN Debates](#section 3)<br>

**Dependencies:**

In [1]:
import string

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

!pip install gensim
from gensim import corpora, models

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import numpy as np
import pandas as pd

from helper import *

----
## The Data<a id='section data'></a>

For this lab, we'll use sci-kit learn's `20 newsgroups` dataset, which is a list of approximately 18,000 newsgroup posts. Because of its size, we'll only be working with about 750 posts. At the end of this lab, we'll also work with a selected portion of the UN Data. 

----

## Section 1: Using Gensim to Implement a LDA Model<a id='section 1'></a>

### What Is Latent Dirichlet Allocation?
Latent dirichlet allocation is a way of discovering topics in a set of documents, generating topics based on word frequency. LDA is a probabilistic bag-of-words model that makes an assumption that documents are produced from a variety of topics that produce words with certain probilities. Then it backtracks, finding a set of certain topics that would have created the documents.

----

### Using `Gensim`

We'll use the LDA algorithm from `Gensim`, a python library for topic modelling.

Let's get working with the data. The `20 newsgroups` data is under the name `20newgroups_data.csv` in the data folder. 

**Question 1.1:** Load your data into a DataFrame and retrieve the posts. Assign the list to a variable named `documents`.

In [17]:
#SOLUTION
data = pd.read_csv('data/20newsgroups_data.csv')
documents = data['posts'].values
documents[5]

" \n \nI read somewhere, I think in Morton Smith's _Jesus the Magician_, that\nold Lazarus wasn't dead, but going in the tomb was part of an initiation\nrite for a magi-cult, of which Jesus was also a part.   It appears that\na 3-day stay was normal.   I wonder .... ?"

Awesome! We now have data we can work with. Before we start anything, we must clean the text!

Just to review, we want to process our text by:<br>
1) Tokenizing our document<br>
2) Removing stop words (remove meaningless words)<br>
3) Stemming or merging words that have equivalent meanings<br>

<a id='gensim'></a>**Question 1.2:** Tokenize and stem the text in `documents`. The first line of code is provided, as well as a stop words (use both `stop` and `more_stops`) and punctuation that should be filtered.

In [9]:
#We used four (incl. given one) list comprehensions in total, and appended the final list of cleaned text to a list.
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")

more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"]
tokenized = []
for i in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    ...

In [3]:
#SOLUTION
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")

more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"] 
tokenized = []
for i in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation and x not in more_stops]
    stopped_tokens = [x for x in filtered_tokens if not x in stop]
    stemmed_tokens = [stemmer.stem(i) for i in stopped_tokens]
    tokenized.append(stemmed_tokens)

Now that we have our tokenized documents, we have to convert it to a *document-term matrix* which can be done by instantiating a `gensim` dictionary object. Our first step is to turn our tokenized documents into a "dictionary" that maps a word to its integer ID, like a bag-of-words model. <a id='Q1.3'></a>

**Question 1.3:** Implement a gensim dictionary from the `corpora` package and assign it to a variable named `dictionary`. You can look [at the documentation](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) for the corpora package if necessary.

In [4]:
#SOLUTION
dictionary = corpora.Dictionary(tokenized)

This is the last step before we implement the model! We must convert our documents to bag-of-words format using our dictionary. Every document is represented as a list of tuples of the word's integer ID and its frequency. This list of 400 documents represents our document-term matrix.

**Question 1.4:** Using `dictionary` from the previous question, convert to your tokenzied documents into a bag-of-words format and store it to a variable named `corpus`. We want to use `doc2bow()` method ***for every document*** in our tokenized text. The documentation is linked [here](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow).

You should end up with a list of tuples for each document and make sure that `len(corpus)` is 400. Calling `corpus[i]` for some integer i should return `[(16, 2), (58, 1), (59, 1),...`

In [5]:
#SOLUTION
corpus = [dictionary.doc2bow(text) for text in tokenized]

Now that we have a document-term matrix, we’re ready to generate an LDA model!

The cell below is an example of an implementation of a `gensim` LDA model. Run the cell and take a look at what it displays.

In [6]:
ldamodel = models.LdaModel(corpus, 
                           id2word=dictionary,
                           num_topics=6,
                           chunksize=270, 
                           update_every=20,
                           passes=3)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel)

Topic 0
0.005*"think" + 0.004*"peopl" + 0.004*"like" + 0.004*"would" + 0.003*"one" + 0.003*"use" + 0.003*"get" + 0.002*"even" + 0.002*"know" + 0.002*"new"

Topic 1
0.007*"one" + 0.005*"peopl" + 0.005*"use" + 0.004*"time" + 0.004*"would" + 0.004*"gm" + 0.003*"pleas" + 0.003*"know" + 0.003*"get" + 0.003*"work"

Topic 2
0.004*"use" + 0.004*"would" + 0.004*"like" + 0.003*"one" + 0.003*"space" + 0.003*"time" + 0.003*"billion" + 0.003*"system" + 0.002*"god" + 0.002*"key"

Topic 3
0.006*"one" + 0.005*"use" + 0.004*"would" + 0.004*"year" + 0.004*"make" + 0.004*"get" + 0.003*"know" + 0.003*"good" + 0.003*"like" + 0.003*"peopl"

Topic 4
0.009*"would" + 0.006*"one" + 0.005*"use" + 0.005*"get" + 0.005*"know" + 0.004*"like" + 0.004*"peopl" + 0.004*"go" + 0.003*"say" + 0.003*"also"

Topic 5
0.010*"1" + 0.008*"2" + 0.006*"use" + 0.005*"3" + 0.004*"would" + 0.003*"4" + 0.003*"like" + 0.003*"file" + 0.003*"one" + 0.003*"also"



----
There are quite a few parameters for generating the `LdaModel` that affect the topics returned. We'll go over some of the parameters that will be helpful in implementing the LDA model.

| Required Parameters        |Value                          | Default | What it does  |
| :-------------------------:|:-----------------------------:| --|:-------------:|
|                corpus      | corpus (doc-term matrix) | None | This specifies your LDA model parameters. |
| id2word     | `gensim` dictionary | None | Maps integer IDs from <br>   the doc-term matrix to words. |
| num_topics<br> | integer | 100 |Specifies the number of underlying topics <br> in your documents. Usually, the fewer <br> documents you have, the smaller number <br> you assign [(this is highly debated!)](https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling).<br>|


| Optional (but helpful) Parameters   |Default | What it does  |
|:------------------------------------|------- |:-------------------------------:|
| passes | 1 | How many times you want to iterate through the corpus.<br> The more passes, the more accurate your model will be, <br> although  it can take longer time if you have a large dataset. |
|chunksize | 2000 | The size of the batch documents you want to run through.<br> e.g. chunksize = 10, we run 10 documents at a time.| 
|update_every |1 | Update the model after every `n` number of chunks. |



**Question 1.5:** What is problematic about the previous model and its results? Which parameters do you think you should change first to get more explicit results?

*Write your answer here*

<a id='Q1.6a'></a> **Question 1.6a:** Improve the previous LDA model by adjusting the parameters  [(documentation)](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel). Try to get as clear topics as possible!

In [7]:
#Possible Solution
possible_model = models.LdaModel(corpus, 
                                 id2word=dictionary,
                                 num_topics=4,
                                 chunksize=75, 
                                 update_every=2,
                                 passes=15)
show_topics(possible_model)

Topic 0
0.008*"use" + 0.005*"file" + 0.005*"system" + 0.005*"window" + 0.005*"comput" + 0.005*"send" + 0.004*"softwar" + 0.004*"list" + 0.004*"program" + 0.004*"imag"

Topic 1
0.009*"would" + 0.008*"one" + 0.006*"peopl" + 0.006*"like" + 0.005*"know" + 0.005*"use" + 0.005*"think" + 0.005*"get" + 0.005*"go" + 0.004*"time"

Topic 2
0.023*"1" + 0.021*"2" + 0.011*"3" + 0.008*"4" + 0.005*"copi" + 0.004*"use" + 0.004*"6" + 0.004*"kk" + 0.004*"5" + 0.004*"p"

Topic 3
0.004*"hiv" + 0.004*"orbit" + 0.004*"magi" + 0.003*"aid" + 0.003*"april" + 0.003*"diseas" + 0.003*"health" + 0.003*"mission" + 0.003*"research" + 0.003*"studi"



**Question 1.6b:** Did you notice any patterns while changing values of certain parameters (how did num_topics change the quality of results)? What worked in giving you reasonable, clear topics and what didn't? 

*Write your answer here*

----
## Section 2: Using scikit-learn<a id='section 2'></a>

Along with `gensim`,  we can also use `scikit-learn` to implement a LDA model. Using the `scikit-learn` algorithm is less clear, since a lot of the work is done by the computer. But, by going through the `gensim` algorithm, we now have an idea how LDA works, and using `scikit-learn` algorithm will be a litt.

**Question 2.1:** In order to implement a LDA model using `scikit-learn`, we must extract features to a matrix using either the count vectorizer or the tf-idf vectorizer. Which one do we use and why?

*Write your answer here*

----

If you answered a count vectorizer to the previous question, you're right! Since LDA is a probabilistic model, we only need the raw term counts.

<a id='sklearn'></a>

**Question 2.2:** Instantiate a count vectorizer with the parameters `max_df=.95`, `min_df=2`, and `stop_words='english'`. For more explanation on these parameters, [look at the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [8]:
#SOLUTION
countvec = CountVectorizer(max_df=.95, 
                           min_df=2, 
                           stop_words='english')

With the vectorizer, we can transform our data into a document-term matrix, as well as use `.get_feature_names()` to get the word to integer ID mapping like we did in [question 1.3](#Q1.3).

**Question 2.3:** Use your vectorizer to transform the same dataset from the first section of this lab to a document-term matrix (concept is similar to Q1.4) and get the feature names. The specified methods to call are found [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform).

In [9]:
#SOLUTION
cv = countvec.fit_transform(documents)
cv_feature_names = countvec.get_feature_names()

We're almost done! The last step is to implement the model, so we can get our topics. Similar to `gensim`, there are parameters that should be adjusted to fit your documents.

| Parameters <br> [(documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)|Default | What it does  |
| :-------------------------:|:-----------------------------:|:-------------:|
|          n_components      | 10 | Equivalent to `num_topics` in the `gensim` model. This <br> specifies the number of latent topics in your documents. |
| max_iter | 10 | Equivalent to `passes` in the `gensim` model. |
| batch_size | 128 | Equivalent to `chunksize` |

**Question 2.4:** Implement the LDA model using `LatentDirichletAllocation`. Don't forget to fit your document-term matrix from the previous question!

**Note:** Set the `learning_method` parameter to `'online'`, which is its default (for the latest version of scikit-learn) but will throw a deprecation warning.

In [10]:
#SOLUTION
lda = LatentDirichletAllocation(n_components=5, 
                                max_iter=25, 
                                batch_size = 99,
                                learning_method='online').fit(cv)

A function `topic_words` is defined for you (from `helper.py`) and it takes in three arguments:

     1) model: your LDA model
     2) feature_names: the feature names from the vectorizer
     3) num_top_words: number of words you want displayed

It prints out the topic number and the words that fall under that topic, although it does not display weight of words like `gensim`.

**Question 2.5:** Specify the number of words you want displayed and call `topic_and_words` on the LDA model from the previous question. 

<sub>**Note:** If your topics are repetitive or aren't very coherent, try tweaking the parameters in the previous question.</sub>

In [11]:
#Possible solution
num_top_words = 15
topic_words(lda, cv_feature_names, num_top_words)

Topic 0:
people god just think don like time know good said does did way say didn
Topic 1:
edu graphics mail send pub data software space windows 10 file com ray ftp available
Topic 2:
israel aids people health said israeli new care like children jews armenians lebanese state say
Topic 3:
int joystick rr british company jaguar owns owned joy read count think non define bytes
Topic 4:
just like know don use good new think car years government problem time ve need


----
**Question 2.6:** Did your model yield clear, interpretable results? How does it compare to the LDA model you created in [section one](#Q1.6a)?

*Write your answer here*

----
## Section 3: Finding topics from UN General Debates<a id='section 3'></a>

We have two ways of implementing a LDA model, let's try both on the UN General Debates dataset. We can now get an idea of what was discussed at a certain session through topic modelling! 

**Question 3.1**: Load `un-general-debates-2015` from the data folder and extract the data from the 'text' column. This file contains the data from the 70th session. 

In [None]:
un = pd.read_csv('data/un-general-debates-2015.csv')
...

In [12]:
#SOLUTION
un = pd.read_csv('data/un-general-debates-2015.csv')
un_text = un['text'].values

**Question 3.2:** First implement a LDA model using `gensim`. Follow similar steps from the [first section](#gensim), and adjust your parameters accordingly! Use the `show_topics` function to display your topics.

**Tip:** Use the `filter_extremes(no_below=<int>)` [(documentation)](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes) method on your `gensim` dictionary, which helps filter through tokens based on frequency (in this case it'll keep any words with a count/frequency less than the specified integer.).

In [13]:
#Possible solution using gensim
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "..."]
un_corp = []
for i in un_text:
    un_tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    un_filtered_tokens = [x for x in un_tokens if x not in punctuation and x not in more_stops]
    un_stopped_tokens = [x for x in un_filtered_tokens if not x in stop]
    un_stemmed_tokens = [stemmer.stem(i) for i in un_stopped_tokens]
    un_corp.append(un_stemmed_tokens)

un_dictionary = corpora.Dictionary(un_corp)

un_dictionary.filter_extremes(no_below=10)

un_d_t = [un_dictionary.doc2bow(i) for i in un_corp]

un_lda = models.LdaModel(un_d_t, num_topics=5,
                            id2word=un_dictionary, 
                            chunksize=20, 
                            update_every=1,
                            passes=17)

show_topics(un_lda)

Topic 0
0.006*"african" + 0.005*"peacekeep" + 0.005*"democrat" + 0.004*"educ" + 0.004*"growth" + 0.004*"oper" + 0.003*"prioriti" + 0.003*"inclus" + 0.003*"financ" + 0.003*"play"

Topic 1
0.024*"island" + 0.023*"small" + 0.015*"ocean" + 0.010*"pacif" + 0.010*"impact" + 0.009*"sea" + 0.009*"caribbean" + 0.008*"per" + 0.008*"sid" + 0.007*"vulner"

Topic 2
0.012*"europ" + 0.011*"european" + 0.011*"want" + 0.010*"let" + 0.009*"differ" + 0.009*"never" + 0.008*"say" + 0.007*"think" + 0.007*"other" + 0.007*"deal"

Topic 3
0.010*"syria" + 0.009*"terrorist" + 0.008*"iraq" + 0.008*"crime" + 0.006*"territori" + 0.006*"violat" + 0.006*"syrian" + 0.006*"nuclear" + 0.006*"palestinian" + 0.005*"islam"

Topic 4
0.021*"israel" + 0.015*"islam" + 0.014*"america" + 0.013*"isra" + 0.013*"iran" + 0.012*"palestin" + 0.011*"palestinian" + 0.010*"said" + 0.009*"latin" + 0.009*"brother"



**Question 3.3:** Now, implement a model using `scikit-learn`. Again, follow similar steps from the [second section](#sklearn) and adjust parameters accordingly. You can display your topics using `topic_words`.

In [14]:
#Possible solution using sklearn
countvec = CountVectorizer(max_df=.95, 
                           min_df=2, 
                           stop_words='english')

un_cv = countvec.fit_transform(un_text)
un_cv_feature_names = countvec.get_feature_names()

un_lda = LatentDirichletAllocation(n_components=5, 
                                   max_iter=30,
                                   learning_decay=0.7,
                                   batch_size=30,
                                   learning_method='online').fit(un_cv)

In [15]:
topic_words(un_lda, un_cv_feature_names, 10)

Topic 0:
romania tonga seabed occasional morality presently sixtieth worrisome 1991 ruling
Topic 1:
human country rights new council states countries community tobago global
Topic 2:
tobago trinidad grenadines vincent love dominican redress posture 400 storm
Topic 3:
countries states human country global rights new sustainable climate years
Topic 4:
islands iran israel fiji pacific solomon islamic said islam god


**Question 3.4:** Which algorithm yielded more well-defined topics? You can also skim the resolutions passed [here](http://research.un.org/en/docs/ga/quick/regular/70) for reference. What do you think are some factors that need to be considered about the data when choosing an algorithm and adjusting its parameters?

*Write your answer here*

**Question 3.5:** What are some differences between the `gensim` and `scikit-learn` algorithms? What are some of their drawbacks? Do you prefer one over the other? 

*Write your answer here*

## Bibliography

 - Chen, Edwin, Introduction to latent dirichlet allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
 - Use of `20newsgroups` data set adapted from Topic Modelling tutorial by Aneesha Bakharia
 https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
 - Resolutions from UN 70th session: http://research.un.org/en/docs/ga/quick/regular/70
 - Text cleaning code adapted from notebook by Alex Estes https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb

----
Notebook developed by: Jason Jiang

Data Science Modules: http://data.berkeley.edu/education/modules