# [LEGALST-190] Lab 4/10: Topic Models

This lab will cover latent dirichlet allocation and topic models using `gensim` and `scikit-learn`.

*Estimated Time: 35 Minutes *

### Table of Contents
[The Data](#section data)<br>
1 - [Using Gensim to Implement a LDA Model](#section 1)<br>
2 - [Using scikit-learn](#section 2)<br>
3 - [Finding topics from UN Debates](#section 3)<br>

**Dependencies:**

In [1]:
import string
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from gensim import corpora, models, similarities 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

import numpy as np
import pandas as pd

from helper import *

----
## The Data<a id='section data'></a>

For this lab, we'll use sci-kit learn's `20 newsgroups` dataset, which is a list of approximately 18,000 newsgroup posts. At the end of this lab, we'll also work with a selected portion of the UN Data. 

----

## Section 1: Using Gensim to Implement a LDA Model<a id='section 1'></a>

### What Is Latent Dirichlet Allocation?
Latent dirichlet allocation is a way of discovering topics in a set of documents, generating topics based on word frequency. LDA is a probabilistic bag-of-words model that makes an assumption that documents are produced from a variety of topics that produce words with certain probilities. Then it backtracks, finding a set of certain topics that would have created the documents.

----

### Using `Gensim`

We'll use the LDA algorithm from `Gensim`, a python library for topic modelling.

Let's get working with the data. Run the cell below to load the data from `20newsgroups`.

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))

After you've loaded the dataset, use the `.data` method on the set to get the list of newgroups posts.

**Question 1.1:** Retrieve the first 400 documents in the set and assign it to a variable named `documents`.

In [23]:
#SOLUTION
documents = dataset.data[:400]
documents[5]

" \n \nI read somewhere, I think in Morton Smith's _Jesus the Magician_, that\nold Lazarus wasn't dead, but going in the tomb was part of an initiation\nrite for a magi-cult, of which Jesus was also a part.   It appears that\na 3-day stay was normal.   I wonder .... ?"

In [25]:
#If nothing returns, you're good to go.
assert len(documents) == 400

Awesome! We now have data we can work with. Before we start anything, we must clean the text!

Just to review, we want to process our text by:<br>
1) Tokenizing our document<br>
2) Removing stop words (remove meaningless words)<br>
3) Stemming or merging words that have equivalent meanings<br>

<a id='gensim'></a>**Question 1.2:** Tokenize and stem the text in `documents`. The first line of code is provided, as well as a stop words (use both `stop` and `more_stops`) and punctuation that should be filtered.

In [None]:
#We used four (incl. given one) list comprehensions in total, and appended the final list of cleaned text to a list.
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")

more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-"]
tokenized = []
for i in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    ...

In [5]:
#SOLUTION
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")

more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-"] 
tokenized = []
for i in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation and x not in more_stops]
    stopped_tokens = [x for x in filtered_tokens if not x in stop]
    stemmed_tokens = [stemmer.stem(i) for i in stopped_tokens]
    tokenized.append(stemmed_tokens)

Now that we have our tokenized documents, we have to convert it to a *document-term matrix* which can be done by instantiating a `gensim` dictionary object. Our first step is to turn our tokenized documents into a "dictionary" that maps a word to its integer ID, like a bag-of-words model. <a id='Q1.3'></a>

**Question 1.3:** Implement a gensim dictionary from the `corpora` package and assign it to a variable named `dictionary`. You can look [at the documentation](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) for the corpora package if necessary.

In [6]:
#SOLUTION
dictionary = corpora.Dictionary(tokenized)

This is the last step before we implement the model! We must convert our documents to bag-of-words format using our dictionary. Every document is represented as a list of tuples of the word's integer ID and its frequency. This list of 400 documents represents our document-term matrix.

**Question 1.4:** Using `dictionary` from the previous question, convert to your tokenzied documents into a bag-of-words format and store it to a variable named `corpus`. We want to use `doc2bow()` method ***for every document*** in our tokenized text. The documentation is linked [here](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow).

You should end up with a list of tuples for each document and make sure that `len(corpus)` is 400. Calling `corpus[i]` for some integer i should return `[(16, 2), (58, 1), (59, 1),...`

In [7]:
#SOLUTION
corpus = [dictionary.doc2bow(text) for text in tokenized]

Now that we have a document-term matrix, we’re ready to generate an LDA model! There are quite a few parameters for generating the `LdaModel` that affect the topics returned. We'll go over some of the parameters that will be helpful in implementing the LDA model.

| Required Parameters        |Value                          | Default | What it does  |
| :-------------------------:|:-----------------------------:| --|:-------------:|
|                corpus      | corpus (doc-term matrix) | None | This specifies your LDA model parameters. |
| num_topics<br> | integer | 100 |Specifies the number of underlying<br>topics in  your documents. Usually, the <br> fewer documents  you have, the  smaller <br> number you assign [(this is highly debated!)](https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling).<br>|
| id2word     | `gensim` dictionary | None | Maps integer IDs from <br>   the doc-term matrix to words. |

| Optional (but helpful) Parameters   |Default | What it does  |
|:------------------------------------|------- |:-------------------------------:|
| passes | 1 | How many times you want to iterate through the corpus.<br> The more passes, the more accurate your model will be, <br> although  it can take longer time if you have a large dataset. |
|chunksize | 2000 | The size of the batch documents you want to run through.<br> e.g. chunksize = 10, we run 10 documents at a time.| 
|update_every |1 | Update the model after every `n` number of chunks. |

<a id='Q1.5A'></a> **Question 1.5a:** Implement a LDA model using `LdaModel` and adjust the parameters in the function so that you get `num_topics` distinct topics back.

In [8]:
#One of many solutions
ldamodel = models.LdaModel(corpus, 
                           num_topics=4,
                           id2word=dictionary, 
                           chunksize=150, 
                           update_every=1,
                           passes=15)

Run the cell below to display the words that represent underlying topics in our data.

In [9]:
for i in ldamodel.show_topics():
    print('Topic ' + str(i[0]))
    print(i[1])
    print()

Topic 0
0.009*"one" + 0.007*"use" + 0.007*"would" + 0.006*"get" + 0.005*"like" + 0.005*"year" + 0.004*"time" + 0.004*"two" + 0.004*"could" + 0.004*"good"

Topic 1
0.009*"peopl" + 0.009*"would" + 0.005*"one" + 0.004*"like" + 0.004*"god" + 0.004*"make" + 0.003*"say" + 0.003*"state" + 0.003*"said" + 0.003*"hiv"

Topic 2
0.005*"file" + 0.005*"imag" + 0.004*"send" + 0.004*"mail" + 0.004*"graphic" + 0.004*"object" + 0.004*"format" + 0.003*"also" + 0.003*"packag" + 0.003*"3d"

Topic 3
0.007*"orbit" + 0.006*"1" + 0.005*"2" + 0.005*"launch" + 0.005*"3" + 0.005*"probe" + 0.005*"mission" + 0.005*"gm" + 0.004*"titan" + 0.004*"earth"



**Question 1.5b:** From your model, what are some topics that you can infer? Does changing `num_topics` affect the quality of your results?

*Write your answer here*

**Question 1.6:** How did you go about adjusting the parameters of your `LdaModel`? Did you notice any patterns while changing values of certain parameters? What worked in giving you reasonable, clear topics and what didn't? 

*Write your answer here*

----
## Section 2: Using scikit-learn<a id='section 2'></a>

Along with `gensim`,  we can also use `scikit-learn` to implement a LDA model. Using the `scikit-learn` algorithm is less clear, since a lot of the work is done by the computer. But, by going through the `gensim` algorithm, we now have an idea how LDA works, and using `scikit-learn` algorithm will be a litt.

**Question 2.1:** In order to implement a LDA model using `scikit-learn`, we must extract features to a matrix using either the count vectorizer or the tf-idf vectorizer. Which one do we use and why?

*Write your answer here*

----

If you answered a count vectorizer to the previous question, you're right! Since LDA is a probabilistic model, we only need the raw term counts.

<a id='sklearn'></a>

**Question 2.2:** Implement a count vectorizer with the parameters `max_df=.95`, `min_df=2`, and `stop_words='english'`. For more explanation on these parameters, [look at the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [10]:
#SOLUTION
countvec = CountVectorizer(max_df=.95, 
                           min_df=2, 
                           stop_words='english')

With the vectorizer, we can transform our data into a document-term matrix, as well as use `.get_feature_names()` to get the word to integer ID mapping like we did in [question 1.3](#Q1.3).

**Question 2.3:** Use your vectorizer to transform the same dataset from the first section of this lab to a document-term matrix (concept is similar to Q1.4) and get the feature names. The specified methods to call are found [here (link to documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform).

In [11]:
#SOLUTION
cv = countvec.fit_transform(documents[:400])
cv_feature_names = countvec.get_feature_names()

We're almost done! The last step is to implement the model, so we can get our topics. Similar to `gensim`, there are parameters that should be adjusted to fit your documents.

| Parameters <br> [(documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)|Default | What it does  |
| :-------------------------:|:-----------------------------:|:-------------:|
|          n_components      | 10 | Equivalent to `num_topics` in the `gensim` model. This <br> specifies the number of latent topics in your documents. |
| max_iter | 10 | Equivalent to `passes` in the `gensim` model. |
| batch_size | 128 | Equivalent to `chunksize` |

**Question 2.4:** Implement the LDA model using `LatentDirichletAllocation`. Don't forget to fit your document-term matrix from the previous question!

In [12]:
#SOLUTION
lda = LatentDirichletAllocation(n_components=5, 
                                max_iter=25, 
                                learning_method='online').fit(cv)

A function `topic_words` is defined for you (from `helper.py`) and it takes in three arguments:

     1) model: your LDA model
     2) feature_names: the feature names from the vectorizer
     3) num_top_words: number of words you want displayed

It prints out the topic number and the words that fall under that topic, although it does not display weight of words like `gensim`.

**Question 2.5:** Specify the number of words you want displayed and call `topic_and_words` on the LDA model from the previous question. 

<sub>**Note:** If your topics are repetitive or aren't very coherent, try tweaking the parameters in the previous question.</sub>

In [13]:
#Possible solution
num_top_words = 15
topic_words(lda, cv_feature_names, num_top_words)

Topic 0:
edu graphics like mail know pub don comp using send mac want program things ray
Topic 1:
key government armenians keys encryption chip people clipper know going public karabagh phone armenia door
Topic 2:
aids gm health van ik information infected patients medical fake said care met people trials
Topic 3:
years space people god little time build didn ve built use generation great rocket just
Topic 4:
just food good time way like think don msg power car question year monitor people


----
**Question 2.6:** Did your model yield clear, interpretable results? How does it compare to the LDA model you created in [section one](#Q1.5A)?

*Write your answer here*

----
## Section 3: Finding topics from UN General Debates<a id='section 3'></a>

We have two ways of implementing a LDA model, let's try both on the UN General Debates dataset. We can now get an idea of what was discussed at a certain session through topic modelling! 

**Question 3.1**: Load `un-general-debates-25` from the data folder and extract the data from the 'text' column. This csv contains the data from the 25th session. 

In [None]:
un = pd.read_csv('data/un-general-debates-25.csv')
...

In [14]:
#SOLUTION
un = pd.read_csv('data/un-general-debates-25.csv')
un_text = un['text'].values

**Question 3.2:** First implement a LDA model using `gensim`. Follow similar steps from the [first section](#gensim), but adjust your parameters accordingly!

**Note:** Use the `filter_extremes(no_below=10)` [(documentation)](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes) method on your `gensim` dictionary, which helps filtering through the tokens based on frequency (in this case it'll keep any words with a count/frequency less than 10).

In [15]:
#Possible solution using gensim
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "..."]
un_corp = []
for i in un_text:
    un_tokens = [word.lower() for sent in nltk.sent_tokenize(i) for word in nltk.word_tokenize(sent)]
    un_filtered_tokens = [x for x in un_tokens if x not in punctuation and x not in more_stops]
    un_stopped_tokens = [x for x in un_filtered_tokens if not x in stop]
    un_stemmed_tokens = [stemmer.stem(i) for i in un_stopped_tokens]
    un_corp.append(un_stemmed_tokens)

un_dictionary = corpora.Dictionary(un_corp)

un_dictionary.filter_extremes(no_below=9)

un_d_t = [un_dictionary.doc2bow(i) for i in un_corp]

un_lda = models.LdaModel(un_d_t, num_topics=5,
                            id2word=un_dictionary, 
                            chunksize=20, 
                            update_every=2,
                            passes=17)

un_lda.show_topics()

[(0,
  '0.022*"african" + 0.010*"southern" + 0.007*"portug" + 0.007*"racist" + 0.006*"portugues" + 0.006*"rhodesia" + 0.006*"lusaka" + 0.005*"oppress" + 0.005*"namibia" + 0.004*"guinea"'),
 (1,
  '0.019*"imperialist" + 0.014*"american" + 0.012*"cambodia" + 0.012*"palestinian" + 0.012*"liber" + 0.011*"imperi" + 0.009*"socialist" + 0.008*"democrat" + 0.008*"korea" + 0.007*"europ"'),
 (2,
  '0.007*"sea" + 0.005*"american" + 0.005*"seab" + 0.005*"per" + 0.004*"latin" + 0.004*"jurisdict" + 0.004*"cent" + 0.004*"draft" + 0.004*"floor" + 0.003*"concept"'),
 (3,
  '0.015*"israel" + 0.005*"ceasefir" + 0.004*"land" + 0.004*"neighbor" + 0.004*"jordan" + 0.004*"hijack" + 0.003*"europ" + 0.003*"coexist" + 0.003*"aid" + 0.003*"palestin"'),
 (4,
  '0.016*"european" + 0.015*"europ" + 0.005*"draft" + 0.005*"context" + 0.005*"feder" + 0.005*"factor" + 0.005*"chemic" + 0.004*"submit" + 0.004*"stress" + 0.004*"court"')]

**Question 3.3:** Now, implement a model using `scikit-learn`. Again, follow similar steps from the [second section](#sklearn) and adjust parameters accordingly.

In [27]:
#Possible solution using sklearn
countvec = CountVectorizer(max_df=.95, 
                           min_df=2, 
                           stop_words='english')

un_cv = countvec.fit_transform(un_text)
un_cv_feature_names = countvec.get_feature_names()

un_lda = LatentDirichletAllocation(n_components=4, 
                                   max_iter=30,
                                   learning_decay=0.4,
                                   batch_size=15,
                                   learning_method='online').fit(un_cv)

In [28]:
topic_words(un_lda, un_cv_feature_names, 10)

Topic 0:
flying absolve regionally riches falling fatal possesses accredited surprised augury
Topic 1:
people great republic arab africa rights political problems human principles
Topic 2:
law committee sea regional conference latin work north problems council
Topic 3:
israel india pakistan ceasefire lebanon jordan aircraft kashmir hijackers egypt


**Question 3.4:** Which algorithm yielded more well-defined topics (you can also skim the resolutions passed [here](http://research.un.org/en/docs/ga/quick/regular/25) if you're interested)? What do you think are some are factors that need to be considered when choosing an algorithm and adjusting its parameters?

*Write your answer here*

**Question 3.5:** What are some differences between the `gensim` and `scikit-learn` algorithms? What are some of their drawbacks? Do you prefer one over the other? 

*Write your answer here*

## Bibliography

 - Chen, Edwin, Introduction to latent dirichlet allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
 - Use of `20newsgroups` data set adapted from Topic Modelling tutorial by Aneesha Bakharia
 https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
 - UN 25th session from http://research.un.org/en/docs/ga/quick/regular/25
 - Text cleaning code adapted from notebook by Alex Estes https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb

----
Notebook developed by: Jason Jiang

Data Science Modules: http://data.berkeley.edu/education/modules