# DSCI 575 - Advanced Machine Learning

# Lab 3: HMMs and Topic modeling

In [None]:
import os.path
import numpy as np
import re
import pandas as pd
import matplotlib.pyplot as plt

import gensim 
from gensim.models import LdaModel
from gensim.models.wrappers import LdaMallet

import gensim.corpora as corpora
from gensim.corpora import Dictionary

from gensim import matutils, models

import pyLDAvis.gensim
import string
pd.set_option('display.max_colwidth', 100)
%matplotlib inline

## Table of contents

- [Submission guidelines](#sg)
- [Learning outcomes](#lo)
- [Exercise 1: Hidden Markov models (HMMs) by hand](#hmm)
- [Exercise 2: Topic modeling with LDA](#lda)

## Submission guidelines <a name="sg"></a>

#### Tidy submission
rubric={mechanics:2}
- To submit this assignment, submit this jupyter notebook with your answers embedded.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Use proper English, spelling, and grammar throughout your submission.

#### Code quality and writing
- These rubrics will be assessed on a question-by-question basis and are included in individual question rubrics below where appropriate.
- See the [quality rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_quality.md) and [writing rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_writing.md) as a guide to what we are looking for.
- Refer to [Python PEP 8 Style Guide](https://www.python.org/dev/peps/pep-0008/) for coding style.

## Learning outcomes <a name="lo"></a>

After finishing this lab you will be able to 
- formulate an HMM for part-of-speech tagging
- prepare data for topic modeling
- build a topic model using `gensim`
- interpret and visualize your topic model
- evaluate your topic model

## Exercise 1: Hidden Markov models (HMMs) by hand <a name="hmm"></a>

We saw the Viterbi algorithm for hidden Markov models (HMMs) in class. In general, it's a useful algorithm to know, as it can be used to find the most likely assignment of all or some subset of latent variables in graphical models such as hidden Markov models, Bayesian networks and conditional random fields. Recently, it has been used in conjunction with deep learning approaches. (For example, see [Le et al. 2017](https://aclweb.org/anthology/P17-1044).)

In this exercise, you will be working through the Viterbi algorithm by hand on a toy data to do part-of-speech tagging. Recall that part-of-speech tagging is the problem of assigning part-of-speech tags to each word in a given text. Usually, given a raw text corpus, only the words are observable and the part-of-speech tags are "hidden" and HMM is a natural choice for this problem. In fact, it used to be a popular model for the problem but the current state-of-the-art approach is a deep-learning approach. See [here](https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art)) for the current state-of-the-art for part-of-speech tagging. 

### 1(a) 
rubric={reasoning:4}

Consider the sentence below:
<blockquote>
    Will the chair chair the meeting from this chair ?
</blockquote>

and a simple part-of-speech tagset: 
<blockquote>
{noun, verb, determiner, preposition, punctuation}
</blockquote>    

The table below shows the possible assignments for words and part-of-speech tags. The symbol `x` denotes that the word and part-of-speech tag combination is possible. For instance, the word _chair_ is unlikely to be used as a determiner and so we do not have an `x` there. 

|    <i></i>    | Will    | the     | chair   | chair   | the     | meeting  | from    | this    | chair   | ?       |
| ------------- | :-----: | :-----: | :-----: | :-----: | :----:  | :------: | :-----: | :-----: | :-----: | :----:  |
| noun          | x       | x       |  x      | x       | x       | x        | <i></i> | <i></i> | x       | <i></i> |
| verb          | x       | <i></i> |  x      | x       | <i></i> | x        | <i></i> | <i></i> | x       | <i></i> |
| determiner    | <i></i> | x       | <i></i> | <i></i> | x       | <i></i>  | <i></i> | x       | <i></i> | <i></i> |
| preposition   | <i></i> | <i></i> | <i></i> | <i></i> | <i></i> | <i></i>  | x       | <i></i> | <i></i> | <i></i> |
| punctuation   | <i></i> | <i></i> | <i></i> | <i></i> | <i></i> | <i></i>  | <i></i> | <i></i> | <i></i> | x       |


Given this information, answer the following questions: 
1. With this simple tagset of part-of-speech tags, how many possible part-of-speech tag sequences (i.e, hidden state sequences) are there for the given sentence (observation sequence)?
2. Restricting to the possibilities shown above with `x`, how many possible part-of-speech tag sequences are there?
3. Given an HMM with states as part-of-speech tags and observations as words, one way to decode the observation sequence is as follows: 
    - enumerate all possible hidden state sequences (i.e., enumerate all solutions)
    - for each hidden state sequence, calculate the probability of the observation sequence given the hidden state sequence (i.e., score each solution)
    - pick the hidden state sequence which gives the highest probability for the observation sequence (i.e., pick the best solution)
    
   What is the time complexity of this method in terms of the number of states ($N$) and the length of the output sequence ($T$)?
   
4. If you decode the sequence using the Viterbi algorithm instead, what will be the time complexity in terms of the number of states ($N$) and the length of the output sequence ($T$)?

### YOUR ANSWER HERE

### 1(b) 
rubric={reasoning:5}

Consider a two word language _fish_ and _sleep_ with two possible part-of-speech tags: _noun_ and _verb_. Suppose in our training corpus, _fish_ appears 8 times as a noun and 5 times as a verb and _sleep_ appears 2 times as a noun and 5 times as a verb. The probability of a sentence starting with a noun is 0.8 and with a verb 0.2. The state transitions between noun and verb are shown in the picture below. Note that we have one extra state in the picture below, with the label 'End'. We have this state as an "accepting" state; we do not have any transitions from this state. We have added this state to make the transition probabilities work. Include it in the transition matrix. But do not include it in the emission probabilities or when you run the Viterbi algorithm in the next exercise. 

<center>
<img src="HMM_POS_tagging.png" width="400" height="400">
</center>  

Your task: 

Define a hidden Markov model for part-of-speech tagging for this language. In particular, specify 
1. the set of states
2. the set of output alphabet
3. the initial state discrete probability distribution
4. the transition probability matrix
5. the emission probabilities. 

(Credit: This idea of a two word language with the words _fish_ and _sleep_ is [Ralph Grishman](https://cs.nyu.edu/grishman/)'s idea.)

### YOUR ANSWER HERE

### (optional) 1(c) Find the best part-of-speech tag sequence
rubric={reasoning:1}

- Run the Viterbi algorithm to find the best part-of-speech sequence for the observed sequence **fish fish sleep**. In particular, calculate $\delta$ and $\psi$ at each state for each time step and then find the globally optimal sequence of tags for the observed sequence **fish fish sleep**. Show your work. (You may copy the latex code from the lecture notes. That said, if you do not feel like typing all this in Markdown, you may do this on paper, take a picture, and upload an image here.)

### YOUR ANSWER HERE

## Exercise 2: Topic modeling with LDA <a name="lda"></a>

In this exercise you will explore the topics in `scikit-learn`'s [20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) using [`gensim`'s `ldamodel`](https://radimrehurek.com/gensim/models/ldamodel.html). Usually, topic modeling is used for discovering the abstract "topics" that occur in a collection of documents when you do not know the actual topics present in the documents. But since 20 newsgroups text dataset is labeled with categories (e.g., sports, hardware, religion), you will be able to cross-check the topics discovered by your model with the actual topics. 

Let's load the data and examine the first few rows. Note that we won't be violating the golden rule by looking at the training subset; later we will be using a separate test subset to evaluate the model. 

Below I am giving you starter code to load the train and test portion of the data and convert the train portion into a pandas DataFrame. Note that we are using train and test splits so that we can later examine how well the LDA model we learn is able to assign topics to unseen documents. 

In [None]:
### BEGIN STARTER CODE
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
data = {'text':[], 'target_name':[], 'target':[]}
data['text'] = newsgroups_train.data
data['target_name'] = [newsgroups_train.target_names[target] for target in newsgroups_train.target]
data['target'] = [target for target in newsgroups_train.target]
df = pd.DataFrame(data)
df.head()
### END STARTER CODE

### 2(a) Preprocessing
rubric={accuracy:4,quality:1,reasoning:2}

We want our topic model to identify interesting and important patterns. For that we need to "normalize" our text. Preprocessing is a crucial step before you train an LDA model and it markedly affects the results. In this exercise you'll prepare the data for topic modeling. We have been using `nltk` for preprocessing so far. In this lab, we will use another popular Python NLP library called [spaCy](https://spacy.io/), which we briefly discussed in Lecture 2. Install the library and the models for English using the following commands. You can find more information about the installation [here](https://spacy.io/usage).

`conda install -c conda-forge spacy`

`python -m spacy download en_core_web_sm`

spaCy is a powerful library and it can do many other things, but we'll be using it for preprocessing.  With this library, you can run the NLP pipeline by simply calling the function `nlp`. You can then access information about each token in a `for` loop as shown below. 

```
doc = nlp(text)
for token in doc:
    print(token.pos_)
    print(token.lemma_)
```

Your task is to complete the function `preprocess` below to carry out preprocessing. In particular, 

1. Get rid of email addresses and other weird characters and patterns.  
2. Replace multiple spaces with a single space. 
3. Run NLP analysis using `spaCy` and exclude tokens 
    - which are stop words.
    - which have length < `min_token_len`.
    - which have irrelevant part-of-speech tags as given in `irrelevant_pos`. [Here](https://spacy.io/api/annotation/#pos-en) is the list of part-of-speech tags used by spaCy. 
4. Get lemma of each token, which is the root form of a word. You can access it using `token.lemma_` 
5. Carry out other preprocessing, if necessary. 
6. Return the preprocessed text.  

**Note that preprocessing the corpus might take time.** So here are a couple of suggestions:
- During the debugging phase, work on a smaller subset of the data. 
- You might want to add an extra column in your dataframe for preprocessed text and save the dataframe as a CSV. 

In [None]:
### BEGIN STARTER CODE
import spacy
# Load English model for SpaCy
nlp = spacy.load("en_core_web_sm")
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
def preprocess(text, 
               min_token_len = 2, 
               irrelevant_pos = ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']): 
    """
    Given text, min_token_len, and irrelevant_pos carry out preprocessing of the text 
    and return a preprocessed string. 
    
    Parameters
    -------------
    text : (str) 
        the text to be preprocessed
    min_token_len : (int) 
        min_token_length required
    irrelevant_pos : (list) 
        a list of irrelevant pos tags
    
    Returns
    -------------
    (str) the preprocessed text
    """
    #YOUR CODE HERE
### END STARTER CODE    

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2(b) Build dictionary and document-term co-occurrence matrix
rubric={accuracy:2,quality:1}

We need two things to build `gensim`'s `LdaModel`: a dictionary and a document-term co-occurrence matrix. In this exercise, you'll

1. Create a dictionary using `gensim`'s [`corpora.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html) method. 
2. Create the document-term co-occurrence matrix using `corpora.Dictionary`'s `doc2bow` method. 

In [None]:
### YOUR ANSWER HERE

### 2(c): Build a topic model
rubric={accuracy:4,reasoning:2}

In this exercise you will build an LDA topic model on the prepared data.  

1. Build an LDA model using `gensim`'s [`models.LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html) with `num_topics` = 5. Note: If you get many warning when you build your model, update your gensim installation.  See [here](https://github.com/RaRe-Technologies/gensim/pull/2296).
2. Print LDA topics with the `model.print_topics()` methods, where `model` is your LDA model. 
3. Experiment with a few choices of the `num_topics` and `passes` hyperparameters. 
4. Settle on the hyperparameters where the topics make sense to you and briefly explain your results. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2(d) Visualization and interpretation
rubric={viz:2,reasoning:4}

Once you settle on the number of topics and passes, visualize the topics and interpret them. In particular,  

1. Visualize the topics using [pyLDAvis](https://github.com/bmabey/pyLDAvis), which is a Python library for interactive topic model visualization. Note: Use `sort_topics=False`. Otherwise the topic ids in the previous exercise won't match with the topics here.
2. Using the words in each topic and their corresponding weights, manually assign a label (e.g., *sports, politics, religion*) to each topic based on the common theme in the most probable words in that topic.
3. Create a dictionary named `topic_labels` with keys as the topic id and your manually-assigned topic label as the values. An example key-value pair in the dictionary is shown in the starter code below. (Of course in your topic model the topic with id 0 might not be 'Science and technology'. I am just showing you an example here.) 

In [None]:
### BEGIN STARTER CODE
topic_labels = {0:'Science and technology'}
### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2(e) Test on unseen documents 
rubric={accuracy:4,quality:2,reasoning:2}

In this particular data, we already know the "topics" (labels) for each article. In this exercise, you will examine to what extent the topics identified by the LDA model match with the actual labels of unseen documents. I am giving you starter code to create a DataFrame for the test data. 

Your tasks:
1. Complete the function `get_topic_label_prob` below which takes an unseen document and the model as input and returns a string of the form `most likely topic label:probability of the label` (e.g., 'Science and Technology:0.435'). Hint: You can access the topic assignment of the unseen document using `lda[bow_vector]`, where `lda` is your lda model and `bow_vector` is the bow vector created using `dictionary.doc2bow`. 
2. Call `get_most_prob_topic` for each document (i.e., each cell in the `text` column) in `sample_test_df`. 
3. For around 10 to 20 documents, manually examine their gold labels (`target_name`) and LDA assigned topics. Comment on whether the LDA topic assignments make sense or not and to what extent topic assignments match with the corresponding values in the `target_names` column.  

In [None]:
### BEGIN STARTER CODE
data = {'text':[], 'target':[]}
data['text'] = newsgroups_test.data
data['target_name'] = [newsgroups_test.target_names[target] for target in newsgroups_test.target]
data['target'] = [target for target in newsgroups_test.target]
test_df = pd.DataFrame(data)
sample_test_df = test_df.sample(100)
sample_test_df
### END STARTER CODE

In [None]:
### BEGIN STARTER CODE
def get_most_prob_topic(unseen_document, model = lda):
    """
    Given an unseen_document, and a trained LDA model, this function
    finds the most likely topic (topic with the highest probability) from the 
    topic distribution of the unseen document and returns the best topic with 
    its probability. . 
    
    Parameters
    ------------
    unseen_document : (str) 
        the document to be labeled with a topic
    model : (gensim ldamodel) 
        the trained LDA model
    
    Returns: 
    -------------
        (str) a string of the form 
        `most likely topic label:probability of that label` 
    
    Examples:
    ----------
    >> get_most_prob_topic("The research uses an HMM for discovering gene sequence.", 
                            model = lda)
    Science and Technology:0.435
    """    
### END STARTER CODE    

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE