# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

## Lecture 4: Hidden Markov Models (HMMs)

## Learning outcomes

From this lesson you will be able to

- explain the motivation for using HMMs
- define an HMM
- state the Markov assumption in HMMs
- explain three fundamental questions for an HMM
- apply the forward algorithm given an HMM
- explain supervised training in HMMs

### Observable Markov models 

- Example
    - States: {uniformly, are, charming}   
    
<img src="images/observable_Markov.png" height="600" width="600"> 


[Source](https://web.stanford.edu/~jurafsky/slp3/A.pdf)

### Hidden phenomenon 

Very often the things you observe in the real world are only a function of some other **hidden** variable.



### Hidden phenomenon example 

- Speech sounds are the outputs of hidden phonemes
- Phonemes
    - distinct units of sound
    - Example: seven $\rightarrow$ seh v ax n
    
<img src="images/hmm_eks.gif" height="600" width="600"> 


[Source](https://www.uea.ac.uk/computing/research-at-the-uea-speech-group)

### Hidden phenomenon example 

- Words are the outputs of hidden parts-of-speech


<img src="images/hmm_pos_tagging.png" height="1000" width="1000"> 


[Source](https://web.stanford.edu/~jurafsky/slp3/8.pdf)

### Hidden phenomenon 

More examples

- Encrypted symbols are outputs of hidden messages
- Genes are outputs of functional relationships
- Stock prices or trader's mood are the output of market conditions


<img src="images/stock_market_hmm.png" height="1000" width="1000"> 


[Source](https://letianquant.com/hidden-markov-chain.html)


### Markov process with hidden variables: Example

- Suppose you have a little robot that is trying to estimate the posterior probability that you are **Happy (H or 🙂)** or **Sad (S or 😔)**, given that the robot has observed whether you are doing one of the following activities: 
    - **Learning data science (L or 📚)**, 
    - **Eat (E or 🍎)**, 
    - **Cry (C or 😿)**, 
    - **Social media (F)**

- The robot is trying to estimate the unknown (hidden) state $Q$, where $Q =H$ when you are happy (🙂) and $Q = S$ when you are sad (😔). 
- The robot is able to observe the activity you are doing: $O = {L, E, C, F}$ 

(Attribution: Example adapted from [here](https://www.cs.ubc.ca/~nando/340-2012/lectures/l6.pdf).)

### Markov process with hidden variables: Example

- Example questions we are interested in answering are:
    - What is $P(Q = 😔|O = F)$?
    - What is the best possible sequence of state of mind (e.g.,🙂,🙂,😔,🙂,🙂 ) given an observation sequence (e.g., L,L,C,L,L). 

### HMM ingredients

- State space (e.g., 🙂 (H), 😔 (S))
- An initial probability distribution over the states (categorical)
- Transition probabilities (categorical) 
- **Emission probabilities (categorical)** 
    - Conditional probabilities for all observations given a hidden state
    - Example: Below $P(L|🙂) = 0.7$ and $P(L|😔) = 0.1$
    

<img src="files/images/HMM_example.png" height="600" width="600"> 

### Definition of an HMM

- A hidden Markov model (HMM) is specified by the 5-tuple:  $\{S, O, \pi, T, B\}$ 
    - $S = \{s_1, s_2, \dots, s_n\}$ is a set of states (e.g., moods)
    - **$Y = \{y_1, y_2, \dots, y_k\}$ is output alphabet (e.g., set of activities)**
    - $\pi = {\pi_1, \pi_2, \dots, \pi_n}$ is discrete initial state probability distribution 
    - Transition probability matrix $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$
    - **Emission probabilities B = $b_i(o), i \in S, o \in Y\$**
    
<img src="files/images/HMM_example.png" height="600" width="600"> 

### Definition of an HMM continued

- Yielding the state sequence and the observation sequences in an unrolled HMM 
    - State sequence: $Q = {q_0,q_1, q_2, \dots q_T}, q_i \in S$ 
    - Observation sequence: $O = {o_0,o_1, o_2, \dots o_T}, o_i \in Y$

<img src="files/images/HMM_example.png" height="600" width="600"> 


<img src="files/images/HMM_unrolling_timesteps.png" height="700" width="700"> 

### Unrolling the timesteps 

- Each state produces only a single observation and the sequence of hidden states and the sequence of observations have the same length. 


<img src="files/images/HMM_unrolling_timesteps.png" height="700" width="700"> 



### HMM assumptions

- **The probability of a particular state only depends on the previous state.**
    * $P(q_i|q_0,q_1,\dots,q_{i-1})$ = $P(q_i|q_{i-1})$
    
- **The probability of an output observation $o_i$ depends only on the state that produces the observation and not on any other state or any other observation.** 
    * $P(o_i|q_0,q_1,\dots,q_{i-1}, o_0,o_1,\dots,o_{i-1})$ = $P(o_i|q_i)$

<img src="files/images/HMM_unrolling_timesteps.png" height="800" width="800"> 

### Questions? 

### Three fundamental questions for an HMM

#### Likelihood
Given a model with parameters $\theta = <\pi, T, B>$, how do we efficiently compute the likelihood of a particular observation sequence $O$?
#### Decoding
Given an observation sequence $O$ and a model $\theta$ how do we choose a state sequence $Q={q_0, q_1, \dots q_T}$ that best explains the observation sequence?
#### Learning
Training: Given a large observation sequence $O$ how do we choose the best parameters $\theta$ that explain the data $O$? 

#### Likelihood

Given a model with parameters $\theta = <\pi, T, B>$, how do we efficiently compute the likelihood of a particular observation sequence $O$?

- Example: What's the probability of the sequence below? 

<img src="files/images/HMM_example_activity_seq.png" height="400" width="400"> 

- Recall that in HMMs, the observations are dependent upon the hidden states in the same time step. 
<br><br>

<img src="files/images/HMM_likelihood_known_hidden.png" height="500" width="500"> 

### Probability of an observation sequence given state sequence 

- Suppose we know both the sequence of hidden states (moods) and the sequence of activities emitted by them. 
- $P(O|Q) = \prod\limits_{i=1}^{T} P(o_i|q_i)$
- $P(E L F C|🙂 🙂 😔 😔) = P(E|🙂) \times P(L|🙂) \times P(F|😔) \times P(C|😔)$

<img src="files/images/HMM_likelihood_known_hidden.png" height="400" width="400"> 

### Joint probability of observations and a possible hidden sequence 

- But we do not know what the hidden state sequence was. 


### Joint probability of observations and a possible hidden sequence 

- We need to look at hidden states. 
- Let's consider the joint probability of being in a particular state sequence $Q$ and generating a particular sequence $O$ of activities. 

<br>

<img src="files/images/HMM_likelihood_unknown_hidden.png" height="800" width="800"> 

### Joint probability of observations and a possible hidden sequence 

- $P(O,Q) = P(O|Q)\times P(Q) = \prod\limits_{i=1}^T P(o_i|q_i) \times \prod\limits_{i=1}^T P(q_i|q_{i-1})$ 

\begin{equation}
\begin{split}
P(E L F C, 🙂 🙂 😔 😔) = & P(🙂|start)\\ 
                          & \times P(🙂|🙂) \times P(😔|🙂) \times P(😔|😔)\\
                          & \times P(E|🙂) \times P(L|🙂) \times P(F|😔) \times P(C|😔)\\
                      = & 0.8 \times 0.7 \times 0.3 \times 0.6 \times 0.2 \times 0.7 \times 0.2 \times 0.6 
\end{split}
\end{equation}
<br>
<img src="files/images/HMM_likelihood_unknown_hidden.png" height="700" width="700"> 

### Total probability of an observation sequence 

- But we do not know the state sequence $Q$
- We need to compute the probability of activity sequence (ELFC) by summing over all possible state (mood) sequences.  

- $P(O) = \sum\limits_Q P(O,Q) = \sum\limits_QP(O|Q)P(Q)$

\begin{equation}
\begin{split}
P(E L F C) = & P(E L F C,🙂🙂🙂🙂)\\ 
             & + P(E L F C,🙂🙂🙂😔)\\
             & + P(E L F C,🙂🙂😔😔) + \dots
\end{split}
\end{equation}

- Computationally inefficient 
    - For HMMs with $n$ hidden states and an observation sequence of $T$ observations, there are $n^T$ possible hidden sequences!!
    - In real-world problems both $n$ and $T$ are large numbers. 

### How to compute $P(O)$ cleverly? 

- To avoid this complexity we use **dynamic programming**; we remember the results rather than recomputing them. 
- We make a **trellis** which is an array of states vs. time.
- The element at $(i,t)$ is $\alpha_i(t)$, which is the probability of being in state $i$ at time $t$ after seeing all previous observations: $P(o_{1:t-1}, q_t = s_i;\theta)$

<img src="files/images/HMM_trellis.png" height="600" width="600"> 

### Trellis 

- Note the alternative paths in the trellis

<img src="files/images/HMM_trellis.png" height="600" width="600"> 

### The forward procedure: intuition 

- To compute $\alpha_j(t)$, we can compute $\alpha_{i}(t-1)$ for all possible states $i$ and then use our knowledge of $a_{ij}$ and $b_j(o_t)$.
- We compute the trellis left-to-right because of the convention of time.
- Remember that $o_t$ is fixed and known.
<center>
<img src="files/images/HMM_trellis.png" height="600" width="600"> 
</center> 

### The forward procedure

Three steps of the forward procedure. 

- Initialization: Compute the $\alpha$ values for nodes in the first column of the trellis $(t = 0)$.
- Induction: Iteratively compute the $\alpha$ values for nodes in the rest of the trellis $(1 \leq t < T)$.
- Conclusion: Sum over the $\alpha$ values for nodes in the last column of the trellis $(t = T)$.

<img src="files/images/HMM_example_trellis.png" height="800" width="800"> 


### The forward procedure: Initialization $\alpha_🙂(0)$ and $\alpha_😔(0)$

- Compute the nodes in the first column of the trellis $(T = 0)$.
    * Probability of starting at state 🙂 and observing the activity E: $\alpha_🙂(0) = \pi_🙂 \times b_🙂(E) = 0.8 \times 0.2 = 0.16$ 
    * Probability of starting at state 😔 and observing the activity E: $\alpha_😔(0) = \pi_😔 \times b_😔(E) = 0.2 \times 0.1 = 0.02$  


<img src="files/images/HMM_example_trellis.png" height="1000" width="1000"> 


### The forward procedure: Induction

- Iteratively compute the nodes in the rest of the trellis $(1 \leq t < T)$.
-  To compute $\alpha_j(t+1)$ we can compute $\alpha_{i}(t)$ for all possible states $i$ and then use our knowledge of $a_{ij}$ and $b_j(o_{t+1})$ 
- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

<img src="files/images/HMM_example_trellis.png" height="1000" width="1000"> 


### The forward procedure: Induction $\alpha_🙂(1)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=1$ and observing the activity L

\begin{equation}
\begin{split}
\alpha_🙂(1) = & \alpha_🙂(0)a_{🙂🙂}b_🙂(L) + \alpha_😔(0)a_{😔🙂}b_🙂(L)\\
             = & 0.16 \times 0.7 \times 0.7 + 0.02 \times 0.4 \times 0.7\\ 
             = & 0.084\\
\end{split}
\end{equation}


<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 


### The forward procedure: Induction $\alpha_😔(1)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=1$ and observing the activity L:
\begin{equation}
\begin{split}             
\alpha_😔(1) = & \alpha_🙂(0)a_{🙂😔}b_😔(L) + \alpha_😔(0)a_{😔😔}b_😔(L)\\
             = & 0.16 \times 0.3 \times 0.1 + 0.02 \times 0.6 \times 0.1\\
             = & 0.006\\
\end{split}
\end{equation}



<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 

### The forward procedure: Induction $\alpha_🙂(2)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=2$ and observing the activity F

\begin{equation}
\begin{split}
\alpha_🙂(2) = & \alpha_🙂(1)a_{🙂🙂}b_🙂(F) + \alpha_😔(1)a_{😔🙂}b_🙂(F)\\
             = & 0.084 \times 0.7 \times 0.0 + 0.006 \times 0.4 \times 0.0\\ 
             = & 0.0\\
\end{split}
\end{equation}



<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 

### The forward procedure: Induction $\alpha_😔(2)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=2$ and observing the activity F:
\begin{equation}
\begin{split}             
\alpha_😔(2) = & \alpha_🙂(1)a_{🙂😔}b_😔(F) + \alpha_😔(1)a_{😔😔}b_😔(F)\\
             = & 0.084 \times 0.3 \times 0.2 + 0.006 \times 0.6 \times 0.2\\
             = & 0.00576\\
\end{split}
\end{equation}


<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 

### The forward procedure: Induction $\alpha_🙂(3)$ (Activity)

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=3$ and observing the activity C:

\begin{equation}
\begin{split}
\alpha_🙂(3) = & \alpha_🙂(2)a_{🙂🙂}b_🙂(C) + \alpha_😔(2)a_{😔🙂}b_🙂(C)\\
             = & 0 \times 0.7 \times 0.1 + 0.00576 \times 0.4 \times 0.1\\ 
             = & 2.3 \times 10^{-4}\\
\end{split}
\end{equation}


<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 

### The forward procedure: Induction $\alpha_😔(3)$ (Activity)

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=3$ and observing the activity C:
\begin{equation}
\begin{split}             
\alpha_😔(3) = & \alpha_🙂(2)a_{🙂😔}b_😔(C) + \alpha_😔(2)a_{😔😔}b_😔(C)\\
             = & 0.0 \times 0.3 \times 0.6 + 0.00576 \times 0.6 \times 0.6\\
             = & 2.07 \times 10^{-3}\\
\end{split}
\end{equation}


<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 

### The forward procedure: Conclusion

- Sum over all possible final states:
  * $P(O;\theta) = \sum\limits_{i=1}^{n}\alpha_i(T-1)$
  * $P(E,L,F,C) = \alpha_🙂(3) + \alpha_😔(3) = 2.3 \times 10^{-4} + 2.07 \times 10^{-3}$ 

- The forward procedure using dynamic programming needs only $2N^2T$ multiplications compared to the $(2T)N^T$ multiplications with the naive approach!! 

<img src="files/images/HMM_example_trellis.png" height="700" width="700"> 


### Generation with an HMM

- An HMM is a generative model and we can generate new sequences using an HMM
- $t = 0$
- Start in state $q_0$ = $s_i$ with probability $\pi_i$
- Emit observation symbol $o_0 = y_k$ with probability $b_i(o_0)$
- While (not forever): 
    * Go from state $q_t = s_i$ to state $q_{t+1} = s_j$ with probability $a_{ij}$
    * Emit observation symbol $o_{t+1} = y_k$ with probability $b_j(o_{t+1})$
    * $t = t + 1$  
    
<img src="files/images/HMM_example.png" height="500" width="500"> 

### Supervised training of HMMs

- Suppose we have training data where we have $O$ and corresponding $Q$, then we can use MLE to learn parameters $\theta = <\pi, T, B>$
- Get transition matrix and the emission probabilities. 
    - Suppose $i$, $j$ are unique states from the state space and $k$ is a unique observation.    
    - $\pi_0(i) = P(q_0 = i) = \frac{Count(q_0 = i)}{\#samples}$
    - $a_{ij} = P(q_{t+1} = j|q_t = i) = \frac{Count(i,j)}{Count(i)}$
    - $b_{ik} = P(o_{t} = k|q_t = i) = \frac{Count(i,k)}{Count(i)}$

<img src="files/images/HMM_unrolling_timesteps.png" height="700" width="700"> 

### Supervised training of HMMs

- Suppose we have training data where we have $O$ and corresponding $Q$, then we can use MLE to learn parameters $\theta = <\pi, T, B>$
    - Count how often $q_{i-1}$ and $q_i$ occur together normalized by how often $q_{i-1}$ occurs: 
      $p(q_i|q_{i-1}) = \frac{Count(q_{i-1} q_i)}{Count(q_{i-1})}$
    - Count how often $q_i$ is associated with the observation $o_i$.   
      $p(o_i|q_{i}) = \frac{Count(o_i \wedge q_i)}{Count(q_{i})}$    

<center>
<img src="files/images/HMM_unrolling_timesteps.png" height="700" width="700"> 
</center>    

### Unsupervised Learning of HMMs 

- So far we were assuming a supervised setting where we knew the hidden states $S$. 
- We used MLE to get transition matrix and the emission probabilities. 
- In many cases, the number of states is unknown and we cannot count them. 
- How to deal with the incomplete data?
    - Use expectation-maximization
    - Baum-Welch re-estimation
    - We do not have time to talk about it in this class but if curious here are some resources:
        * [Frank Rudzicz's slides](http://www.cs.toronto.edu/~frank/csc401/lectures2018/5-HMMs.pdf) (from page 77 to 95). 
        * [Andrew McCallum's slides](https://people.cs.umass.edu/~mccallum/courses/inlp2004a/lect10-hmm2.pdf)

### (Optional) HMMs with [ `hmmlearn`](https://hmmlearn.readthedocs.io)

In [9]:
import numpy as np
from hmmlearn import hmm

# Initializing an HMM 
states = ['Happy', 'Sad']
n_states = len(states)

observations = ['Learn', 'Eat', 'Cry', 'Facebook']
n_observations = len(observations)

model = hmm.MultinomialHMM(n_components=n_states)
model.startprob_ = np.array([0.8,0.2])
model.transprob_ = np.array([
 [0.7, 0.3],
 [0.4, 0.6]
])
model.emissionprob_ = np.array([
    [0.6, 0.3, 0.1, 0.0],
    [0.1, 0.1, 0.6, 0.2]
])
observation_sequence = np.array([[1, 0, 3, 2]])
print(observation_sequence)
model.
#print('loglikelihood of X: ', model.score(observation_sequence))
# Assume the following observation sequence: 
# Learn, Learn, Cry, Facebook, Cry, Learn, Eat, Learn, Eat, Cry, Cry
#observation_sequence = np.array([[1, 0, 3, 2]]).T
#print(observation_sequence)
#print('loglikelihood of X: ', model.score(observation_sequence))

# Fit the model
#model = model.fit(observation_sequence)

[[1 0 3 2]]


AttributeError: 'MultinomialHMM' object has no attribute 'transmat_'

In [3]:
# Likelihood computation
X, Z = model.sample(5)
print(X)
print(Z)
print('loglikelihood of X: ', model.score(X))
X, Z = model.sample(9)
print(X)
print(Z)
print('loglikelihood of X: ', model.score(X))

[[1]
 [0]
 [2]
 [2]
 [0]]
[0 0 1 1 0]
loglikelihood of X:  -5.524303857896946
[[0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [2]
 [2]
 [2]]
[0 0 0 0 0 0 1 1 1]
loglikelihood of X:  -8.276619324554098


  return np.log(self.emissionprob_)[:, np.concatenate(X)].T
  return np.log(self.emissionprob_)[:, np.concatenate(X)].T


### Summary

- Hidden Markov models (HMMs) model time-series with latent factors
- There are tons of applications associated with them and they are more realistic than Markov chains

Important ideas we learned 
- The definition of an HMM
- Three fundamental questions for HMMs
- The purpose of the forward algorithm and how to calculate $\alpha_i(t)$
- Supervised training in HMMs

### Other useful/interesting material 

- [Hidden Markov Models chapter from Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/A.pdf)
- Attribution: Many presentation ideas in this notebook are taken from [Frank Rudzicz's slides](http://www.cs.toronto.edu/~frank/csc401/lectures2018/5-HMMs.pdf).
- [Jason Eisner's lecture on hidden Markov Models](https://vimeo.com/31374528)
- [Jason Eisner's interactive spreadsheet for HMMs](https://cs.jhu.edu/~jason/papers/eisner.hmm.xls)
- [Who each player is guarding?](https://www.youtube.com/watch?v=JvNkZdZJBt4)
- [The Viterbi Algorithm: A Personal History](https://arxiv.org/pdf/cs/0504020v2.pdf)
- [A nice demo of independent vs. Markov vs. HMMs for DNA](https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter10.html)