![](img/575_banner.png)

# Lecture 3: Introduction to Hidden Markov Models (HMMs)

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

## Lecture plan, imports, LO

### Lecture plan 

- Motivation (~5 mins)
- Definition and terminology of HMMs (~15 mins)
- Q&A and activities (~5 mins) 
- Break (~5 mins)
- The forward algorithm (~25 mins)
- Supervised training of HMMs (~10 mins)
- Q&A and activities (~5 mins) 
- Final comments and summary (~2 mins)

### Imports 

In [1]:
import os
import re
import sys

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import IPython
from IPython.display import HTML, display
from nltk.tag.hmm import HiddenMarkovModelTrainer

### Learning outcomes

From this lesson you will be able to

- explain the motivation for using HMMs
- define an HMM
- state the Markov assumption in HMMs
- explain three fundamental questions for an HMM
- apply the forward algorithm given an HMM
- explain supervised training in HMMs

<br><br><br><br>

## Motivation

### Personal virtual assistants

- An important component of virtual assistants (e.g., Siri, Cortana, Google Home, Alexa) is speech recognition. 
- We ask such assistants questions. They convert the question into text, make sense of the question, and return the appropriate answer most of the times. 

In [2]:
url = "https://www.ibm.com/demos/live/speech-to-text/self-service/home"

IPython.display.IFrame(url, width=800, height=900)

- A number of speech recognition API's are available out there.
- You can access them with Python. 
- A Python module called [`SpeechRecognition`](https://pypi.org/project/SpeechRecognition/) can let you access some of these APIs. 
    - CMU Sphinx (works offline)
    - Google Speech Recognition
    - Google Cloud Speech API
    - Wit.ai
    - Microsoft Bing Voice Recognition
    - Houndify API
    - IBM Speech to Text
    - Snowboy Hotword Detection (works offline)
- Usually, you have to pay some money if you want to use these APIs.     

### Speech recognition 

- You are given a sequence of sound waves and your job is to recognize the corresponding sequence of phonemes or words. 
- Phonemes: distinct units of sound. For example: 
    - tree $\rightarrow$ T R IY
    - cat $\rightarrow$ K AE T
    - stats $\rightarrow$ S T AE T S
    - eks $\rightarrow$ E K S     
- There are ~44 phonemes in North American English. 
- Is it possible to use the ML models we learned in 571, 573, 563 for this problem?
- In written text, we know that certain transitions are more likely than others
    - "th" as in "this"
    - "sh" as in "shoe"
    - "ch" as in "chair"
    - "ck" as in "back"
- Which transition do you think is easier and more natural/efficient/common for phonemes? 
    - /s/ to /t/: "stop", "best", "fast"
    - /t/ and /r/: "try", "tree", "train"
    - /f/ to /v/: "of value"
    - /s/ to /b/
    - In other words is it easier to say "stop" or "of value"?
    
<br><br><br><br>

- Speech recognition is a sequence modeling problem. 
    - It's a good idea to incorporate sequential information in the model for speech recognition. 
- Many modern statistical speech recognition systems are based on hidden Markov models. 

> Note that the most recent speech recognition models use deep learning but HMMs are still popular. They are particular useful when the training data is small and interpretation is important. Also, it's useful to understand HMMs before moving on to deep learning models for sequence processing (e.g., RNNs). 

### What are HMMs? 

### Observable Markov models 

- Example
    - States: {uniformly, are, charming}   

![](img/observable_Markov.png)
<!-- <center> -->
<!-- <img src="img/observable_Markov.png" height="600" width="600"> -->
<!-- </center> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/A.pdf)

### Hidden phenomenon 

Very often the things you observe in the real world can be thought of as a function of some other **hidden** variables.

Example 1: 
- Observations: Acoustic features of the speech signal, hidden states: phonemes that are spoken

Example 2: 
- Observations: Words, hidden states: parts-of-speech

![](img/hmm_pos_tagging.png)
<!-- <center> -->
<!-- <img src="img/hmm_pos_tagging.png" height="1000" width="1000"> -->
<!-- </center> -->


[Source](https://web.stanford.edu/~jurafsky/slp3/8.pdf)

More examples

- Observations: Encrypted symbols, hidden states: messages
- Observations: Exchange rates, hidden states: volatility of the market

<!-- ![](img/stock_market_hmm.png) -->

<br><br><br><br>

## HMM definition and example

- Last week we used the following toy example to demonstrate how do we learn initial state probabilities and transition probabilities in Markov models. 
- Companies such Facebook or Google can track many of our activities. Suppose they want to predict our mood so that they can sent you certain ads to us. They cannot directly observe our mood but they can predict our mood depending upon our activities. So the mood is hidden here and activities are obervable.  

![](img/activity-seqs.png)
<!-- <img src="img/activity-seqs.png" height="800" width="800"> -->

### Markov process with hidden variables: Example

- Let's simplify above example. 
- Suppose you have a little robot that is trying to estimate the posterior probability that you are **Happy (H or 🙂)** or **Sad (S or 😔)**, given that the robot has observed whether you are doing one of the following activities: 
    - **Learning data science (L or 📚)**
    - **Eat (E or 🍎)** 
    - **Cry (C or 😿)** 
    - **Social media (F)**

- The robot is trying to estimate the unknown (hidden) state $Q$, where $Q =H$ when you are happy (🙂) and $Q = S$ when you are sad (😔). 
- The robot is able to observe the activity you are doing: $O = {L, E, C, F}$ 

(Attribution: Example adapted from [here](https://www.cs.ubc.ca/~nando/340-2012/lectures/l6.pdf).)

- Example questions we are interested in answering are:
    - What is the probability of observation sequence 📚📚😿📚📚?
    - What is the best possible sequence of state of mind (e.g.,🙂,🙂,😔,🙂,🙂 ) given an observation sequence (e.g., L,L,C,L,L or 📚📚😿📚📚). 

### HMM ingredients

- State space (e.g., 🙂 (H), 😔 (S))
- An initial probability distribution over the states
- Transition probabilities
- **Emission probabilities** 
    - Conditional probabilities for all observations given a hidden state
    - Example: Below $P(L|🙂) = 0.7$ and $P(L|😔) = 0.1$
    
![](img/HMM_example.png)

<!-- <center> -->
<!-- <img src="img/HMM_example.png" height="600" width="600"> -->
<!-- </center> -->

### Definition of an HMM

- A hidden Markov model (HMM) is specified by the 5-tuple:  $\{S, Y, \pi, T, B\}$ 
    - $S = \{s_1, s_2, \dots, s_n\}$ is a set of states (e.g., moods)
    - **$Y = \{y_1, y_2, \dots, y_k\}$ is output alphabet (e.g., set of activities)**
    - $\pi = {\pi_1, \pi_2, \dots, \pi_n}$ is discrete initial state probability distribution 
    - Transition probability matrix $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$
    - **Emission probabilities B = $b_i(o), i \in S, o \in Y\$**
    
![](img/HMM_example.png)    

<!-- <center> -->
<!-- <img src="img/HMM_example.png" height="600" width="600"> -->
<!-- </center> -->

- Yielding the state sequence and the observation sequences in an unrolled HMM 
    - State sequence: $Q = {q_0,q_1, q_2, \dots q_T}, q_i \in S$ 
    - Observation sequence: $O = {o_0,o_1, o_2, \dots o_T}, o_i \in Y$
<!-- ![](img/HMM_unrolling_timesteps.png) -->

<!-- <center> -->
<!-- <img src="img/HMM_example.png" height="600" width="600"> -->
<!-- </center> -->

<!-- <center> -->
<!-- <img src="img/HMM_unrolling_timesteps.png" height="700" width="700"> -->

<!-- </center> -->

Here is an example of an unrolled HMM for six time steps, a possible realization of a sequence of states and a sequence of observations. 

![](img/HMM_unrolling_timesteps.png)

<!-- <center> -->
<!-- <img src="img/HMM_unrolling_timesteps.png" height="800" width="800"> -->
<!-- </center> -->

- Each state produces only a single observation and the sequence of hidden states and the sequence of observations have the same length. 

### HMM assumptions

- **The probability of a particular state only depends on the previous state.**
    * $P(q_i|q_0,q_1,\dots,q_{i-1})$ = $P(q_i|q_{i-1})$
    
- **The probability of an output observation $o_i$ depends only on the state that produces the observation and not on any other state or any other observation.** 
    * $P(o_i|q_0,q_1,\dots,q_{i-1}, o_0,o_1,\dots,o_{i-1})$ = $P(o_i|q_i)$

<!-- ![](img/HMM_unrolling_timesteps.png) -->


<br><br><br><br>

## ❓❓ Questions for you

### Exercise 3.1: Select all of the following statements which are **True** (iClicker)

- (A) Emission probabilities in our toy example give us the probabilities of being happy or sad given that you are performing one of the four activities: Learn, Eat, Cry, Facebook.  
- (B) In hidden Markov models, the observation at time step $t$ is conditionally independent of previous observations and previous hidden states given the hidden state at time $t$. 
- (C) In hidden Markov models, given the hidden state at time $t-1$, the hidden state at time step $t$ is conditionally independent of the previous hidden states and observations. 
- (D) In hidden Markov models, each hidden state has a probability distribution over all observations. 

<br><br><br><br>

```{admonition} Exercise 3.1: V's Solutions!
:class: tip, dropdown
- (A) False
- (B) True
- (C) True
- (D) True
```

### Exercise 3.2: Discuss the following questions with your neighbour. 
1. What are the parameters $\theta$ of a hidden Markov model?
2. Below is a hidden Markov model that relates numbers of ice creams eaten by Jason to the weather. Identify observations, hidden states, transition probabilities, and emission probabilities in the model.

![](img/ice-cream-hmm.png)

<!-- <img src="img/ice-cream-hmm.png" height="600" width="600"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/A.pdf)

```{admonition} Exercise 3.2: V's Solutions!
:class: tip, dropdown
1. initial state probabilities $\pi_0$, transition probabilities $T$, emission probabilities $B$
2. 
- Observations: 1, 2, 3
- Hidden states: HOT, COLD 
- transition probabilities: 

|               | HOT  | COLD |
| ------------- |:---------:| -----:|
| HOT         | 0.6       | 0.4   |
| COLD        | 0.5       | 0.5   |

Emission probabilities: 
|               | 1  | 2 | 3 | 
| ------------- |:---------:| -----:| -----:|
| HOT         | 0.2       | 0.4   | 0.4 |
| COLD        | 0.5       | 0.4   | 0.1 |

### Three fundamental questions for an HMM

#### Likelihood
Given a model with parameters $\theta = <\pi, T, B>$, how do we efficiently compute the likelihood of a particular observation sequence $O$?
#### Decoding
Given an observation sequence $O$ and a model $\theta$ how do we choose a state sequence $Q={q_0, q_1, \dots q_T}$ that best explains the observation sequence?
#### Learning
Training: Given a large observation sequence $O$ how do we choose the best parameters $\theta$ that explain the data $O$? 

<br><br><br><br>

### Break (~5 mins)

![](img/eva-coffee.png)

<br><br><br><br>

## Likelihood

In the context of HMMs, the likelihood of an observation sequence is the probability of observing that sequence given a particular set of model parameters $\theta$. 

Given a model with parameters $\theta = <\pi, T, B>$, how do we efficiently compute the likelihood of a particular observation sequence $O$?

- Example: What's the probability of the sequence below? 

![](img/HMM_example_activity_seq.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_activity_seq.png" height="400" width="400"> -->
<!-- </center> -->

- Recall that in HMMs, the observations are dependent upon the hidden states in the same time step. 
<br><br>

![](img/HMM_likelihood_known_hidden.png)
<!-- <center> -->
<!-- <img src="img/HMM_likelihood_known_hidden.png" height="500" width="500"> -->
<!-- </center> -->

### Probability of an observation sequence given the state sequence 

- Suppose we know both the sequence of hidden states (moods) and the sequence of activities emitted by them. 
- $P(O|Q) = \prod\limits_{i=1}^{T} P(o_i|q_i)$
- $P(E L F C|🙂 🙂 😔 😔) = P(E|🙂) \times P(L|🙂) \times P(F|😔) \times P(C|😔)$

### Joint probability of observations and a possible hidden sequence 

- Let's consider the joint probability of being in a particular state sequence $Q$ and generating a particular sequence $O$ of activities. 

<br>

![](img/HMM_likelihood_unknown_hidden.png)

<!-- <center> -->
<!-- <img src="img/HMM_likelihood_unknown_hidden.png" height="600" width="500"> -->
<!-- </center> -->

- $P(O,Q) = P(O|Q)\times P(Q) = \prod\limits_{i=1}^T P(o_i|q_i) \times \prod\limits_{i=1}^T P(q_i|q_{i-1})$ 

For example, for our toy sequence: 

\begin{equation}
\begin{split}
P(E L F C, 🙂 🙂 😔 😔) = & P(🙂|start)\\ 
                          & \times P(🙂|🙂) \times P(😔|🙂) \times P(😔|😔)\\
                          & \times P(E|🙂) \times P(L|🙂) \times P(F|😔) \times P(C|😔)\\
                      = & 0.8 \times 0.7 \times 0.3 \times 0.6 \times 0.2 \times 0.7 \times 0.2 \times 0.6 
\end{split}
\end{equation}
<br>
![](img/HMM_likelihood_unknown_hidden.png)

<!-- <center> -->
<!-- <img src="img/HMM_likelihood_unknown_hidden.png" height="500" width="500"> -->
<!-- </center> -->

### Total probability of an observation sequence 

- But we do not know the hidden state sequence $Q$.
- We need to look at all combinations of hidden states. 
- We need to compute the probability of activity sequence (ELFC) by summing over all possible state (mood) sequences.  
- $P(O) = \sum\limits_Q P(O,Q) = \sum\limits_QP(O|Q)P(Q)$

\begin{equation}
\begin{split}
P(E L F C) = & P(E L F C,🙂🙂🙂🙂)\\ 
             & + P(E L F C,🙂🙂🙂😔)\\
             & + P(E L F C,🙂🙂😔😔) + \dots
\end{split}
\end{equation}

- Computationally inefficient 
    - For HMMs with $n$ hidden states and an observation sequence of $T$ observations, there are $n^T$ possible hidden sequences!!
    - In real-world problems both $n$ and $T$ are large numbers. 

### How to compute $P(O)$ cleverly? 

- To avoid this complexity we use **dynamic programming**; we remember the results rather than recomputing them. 
- We make a **trellis** which is an array of states vs. time.
- Note the alternative paths in the trellis. We are covering all the 16 combinations of states. 
- We compute $\alpha_i(t)$ at each $(i,t)$, which represents the probability of being in state $i$ at time $t$ after seeing all previous observations and emitting the current observation at time step $t$. 

![](img/HMM_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_trellis.png" height="400" width="400"> -->
<!-- </center> -->

### The forward procedure: intuition 

- To compute $\alpha_j(t)$, we can compute $\alpha_{i}(t-1)$ for all possible states $i$ and then use our knowledge of $a_{ij}$ and $b_j(o_t)$.
- We compute the trellis left-to-right because of the convention of time.
- Remember that $o_t$ is fixed and known.

### The forward procedure

Three steps of the forward procedure. 

- Initialization: Compute the $\alpha$ values for nodes in the first column of the trellis $(t = 0)$.
- Induction: Iteratively compute the $\alpha$ values for nodes in the rest of the trellis $(1 \leq t < T)$.
- Conclusion: Sum over the $\alpha$ values for nodes in the last column of the trellis $(t = T)$.

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->


### The forward procedure: Initialization $\alpha_🙂(0)$ and $\alpha_😔(0)$

- Compute the nodes in the first column of the trellis $(T = 0)$.
    * Probability of starting at state 🙂 and observing the activity E: $\alpha_🙂(0) = \pi_🙂 \times b_🙂(E) = 0.8 \times 0.2 = 0.16$ 
    * Probability of starting at state 😔 and observing the activity E: $\alpha_😔(0) = \pi_😔 \times b_😔(E) = 0.2 \times 0.1 = 0.02$  

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->


### The forward procedure: Induction

- Iteratively compute the nodes in the rest of the trellis $(1 \leq t < T)$.
-  To compute $\alpha_j(t+1)$ we can compute $\alpha_{i}(t)$ for all possible states $i$ and then use our knowledge of $a_{ij}$ and $b_j(o_{t+1})$ 
- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->


### The forward procedure: Induction $\alpha_🙂(1)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=1$ and observing the activity L

\begin{equation}
\begin{split}
\alpha_🙂(1) = & \alpha_🙂(0)a_{🙂🙂}b_🙂(L) + \alpha_😔(0)a_{😔🙂}b_🙂(L)\\
             = & 0.16 \times 0.7 \times 0.7 + 0.02 \times 0.4 \times 0.7\\ 
             = & 0.084\\
\end{split}
\end{equation}

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->


### The forward procedure: Induction $\alpha_😔(1)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=1$ and observing the activity L:
\begin{equation}
\begin{split}             
\alpha_😔(1) = & \alpha_🙂(0)a_{🙂😔}b_😔(L) + \alpha_😔(0)a_{😔😔}b_😔(L)\\
             = & 0.16 \times 0.3 \times 0.1 + 0.02 \times 0.6 \times 0.1\\
             = & 0.006\\
\end{split}
\end{equation}

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->

### The forward procedure: Induction $\alpha_🙂(2)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=2$ and observing the activity F

\begin{equation}
\begin{split}
\alpha_🙂(2) = & \alpha_🙂(1)a_{🙂🙂}b_🙂(F) + \alpha_😔(1)a_{😔🙂}b_🙂(F)\\
             = & 0.084 \times 0.7 \times 0.0 + 0.006 \times 0.4 \times 0.0\\ 
             = & 0.0\\
\end{split}
\end{equation}

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->

### The forward procedure: Induction $\alpha_😔(2)$

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=2$ and observing the activity F:
\begin{equation}
\begin{split}             
\alpha_😔(2) = & \alpha_🙂(1)a_{🙂😔}b_😔(F) + \alpha_😔(1)a_{😔😔}b_😔(F)\\
             = & 0.084 \times 0.3 \times 0.2 + 0.006 \times 0.6 \times 0.2\\
             = & 0.00576\\
\end{split}
\end{equation}

<!-- ![](img/HMM_example_trellis.png) -->

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->

### The forward procedure: Induction $\alpha_🙂(3)$ (Activity)

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$

- Probability of being at state 🙂 at $t=3$ and observing the activity C:

\begin{equation}
\begin{split}
\alpha_🙂(3) = & \alpha_🙂(2)a_{🙂🙂}b_🙂(C) + \alpha_😔(2)a_{😔🙂}b_🙂(C)\\
             = & 0 \times 0.7 \times 0.1 + 0.00576 \times 0.4 \times 0.1\\ 
             = & 2.3 \times 10^{-4}\\
\end{split}
\end{equation}

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->

### The forward procedure: Induction $\alpha_😔(3)$ (Activity)

- $\alpha_j(t+1) = \sum\limits_{i=1}^n \alpha_i(t) a_{ij} b_j(o_{t+1})$
- Probability of being at state 😔 at $t=3$ and observing the activity C:
\begin{equation}
\begin{split}             
\alpha_😔(3) = & \alpha_🙂(2)a_{🙂😔}b_😔(C) + \alpha_😔(2)a_{😔😔}b_😔(C)\\
             = & 0.0 \times 0.3 \times 0.6 + 0.00576 \times 0.6 \times 0.6\\
             = & 2.07 \times 10^{-3}\\
\end{split}
\end{equation}

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->

### The forward procedure: Conclusion

- Sum over all possible final states:
  * $P(O;\theta) = \sum\limits_{i=1}^{n}\alpha_i(T-1)$
  * $P(E,L,F,C) = \alpha_🙂(3) + \alpha_😔(3) = 2.3 \times 10^{-4} + 2.07 \times 10^{-3}$ 

![](img/HMM_example_trellis.png)

<!-- <center> -->
<!-- <img src="img/HMM_example_trellis.png" height="700" width="700"> -->
<!-- </center> -->


- The forward procedure using dynamic programming needs only $\approx 2n^2T$ multiplications compared to the $\approx(2T)n^T$ multiplications with the naive approach!! 
- Why? Discuss with your neighbour.  

<br><br><br><br>

## Supervised training of HMMs

### (Optional) Generation with an HMM

- An HMM is a generative model and we can generate new sequences using an HMM
- $t = 0$
- Start in state $q_0$ = $s_i$ with probability $\pi_i$
- Emit observation symbol $o_0 = y_k$ with probability $b_i(o_0)$
- While (not forever): 
    * Go from state $q_t = s_i$ to state $q_{t+1} = s_j$ with probability $a_{ij}$
    * Emit observation symbol $o_{t+1} = y_k$ with probability $b_j(o_{t+1})$
    * $t = t + 1$  
    
![](img/HMM_example.png)

<!-- <center> -->
<!-- <img src="img/HMM_example.png" height="500" width="500"> -->
<!-- </center> -->

### Supervised training of HMMs

- Suppose we have training data where we have $O$ and corresponding $Q$, then we can use MLE to learn parameters $\theta = <\pi, T, B>$
- Get transition matrix and the emission probabilities. 
    - Suppose $i$, $j$ are unique states from the state space and $k$ is a unique observation.    
    - $\pi_0(i) = P(q_0 = i) = \frac{Count(q_0 = i)}{\#sequences}$
    - $a_{ij} = P(q_{t+1} = j|q_t = i) = \frac{Count(i,j)}{Count(i, anything)}$
    - $b_i(k) = P(o_{t} = k|q_t = i) = \frac{Count(i,k)}{Count(i, anything)}$

![](img/HMM_unrolling_timesteps.png)

<!-- <center> -->
<!-- <img src="img/HMM_unrolling_timesteps.png" height="700" width="700"> -->
<!-- </center> -->

- Suppose we have training data where we have $O$ and corresponding $Q$, then we can use MLE to learn parameters $\theta = <\pi, T, B>$
    - Count how often $q_{i-1}$ and $q_i$ occur together normalized by how often $q_{i-1}$ occurs with anything: 
      $p(q_i|q_{i-1}) = \frac{Count(q_{i-1} q_i)}{Count(q_{i-1} \text{anything})}$
    - Count how often $q_i$ is associated with the observation $o_i$.   
      $p(o_i|q_{i}) = \frac{Count(o_i \wedge q_i)}{Count(q_{i} \text{anything})}$    

<!-- ![](img/HMM_unrolling_timesteps.png) -->

<center>
<img src="img/HMM_unrolling_timesteps.png" height="700" width="700">
</center>

**In real life, all the calculations above are done with log probabilities for numerical stability.** 

### HMM supervised training demo

Part-of-speech tagging task

- Given a text assign part-of-speech tags to the words in the text.

- Input sentence: 
<blockquote>
    MDS students are hard-working .
</blockquote>    

- POS-tagged sentence: 
<blockquote>
    MDS/<span style="color:green">PROPER_NOUN</span> students/<span style="color:green">NOUN</span> are/<span style="color:green">VERB</span> hard-working/<span style="color:green">ADJECTIVE</span> ./<span style="color:green">PUNCTUATION</span>
</blockquote>    


In [3]:
words = ["book", "that", "flight", "like", "I", "."]
POS = ["Noun", "Verb", "Punct", "Pron"]

In [4]:
corpus = [
    [("book", "Verb"), ("that", "Pron"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
    [("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [("book", "Verb"), ("like", "Noun"), ("flight", "Noun")],
    [("I", "Pron"), ("book", "Verb"), ("flight", "Noun"), (".", "Punct")],
    [
        ("I", "Pron"),
        ("like", "Verb"),
        ("that", "Pron"),
        ("book", "Noun"),
        (".", "Punct"),
    ],
]

The syntax is a bit weird. This is just for demonstration purpose. You're unlikely to use this when you carry out POS tagging.

In [5]:
trainer = HiddenMarkovModelTrainer(POS, words)
hmm = trainer.train_supervised(
    corpus,
)

In [6]:
hmm._create_cache()
P, O, X, S = hmm._cache

### From the documentation: 

The cache is a tuple (P, O, X, S) where:

- S maps symbols to integers.  I.e., it is the inverse
mapping from self._symbols; for each symbol s in
self._symbols, the following is true::

  ```self._symbols[S[s]] == s```

- O is the log output probabilities::

  ```O[i,k] = log( P(token[t]=sym[k]|tag[t]=state[i]) )```

- X is the log transition probabilities::

  ```X[i,j] = log( P(tag[t]=state[j]|tag[t-1]=state[i]) )```

- P is the log prior probabilities::

  ```P[i] = log( P(tag[0]=state[i]) )```


- Mapping between the observations (symbols) to integers. 

In [7]:
S

{'book': 0, 'that': 1, 'flight': 2, 'like': 3, 'I': 4, '.': 5}

#### HMM states

In [8]:
hmm._states

['Noun', 'Verb', 'Punct', 'Pron']

#### Log prior probabilities 
- $\pi_0$ for all states 

In [9]:
pd.DataFrame(P, index=hmm._states, columns=["pi_0"])

Unnamed: 0,pi_0
Noun,-inf
Verb,-1.0
Punct,-inf
Pron,-1.0


#### Log output probabilities

- log(P(observation | tag)) for all observations and tags. 

In [10]:
pd.DataFrame(O, index=hmm._states, columns=S.keys())

Unnamed: 0,book,that,flight,like,I,.
Noun,-1.807355,-inf,-0.807355,-2.807355,-inf,-inf
Verb,-0.584962,-inf,-inf,-1.584962,-inf,-inf
Punct,-inf,-inf,-inf,-inf,-inf,0.0
Pron,-inf,-1.0,-inf,-inf,-1.0,-inf


#### Log transition probabilities 

- Transition matrix 

In [11]:
pd.DataFrame(X, index=hmm._states, columns=hmm._states)

Unnamed: 0,Noun,Verb,Punct,Pron
Noun,-2.584963,-inf,-0.263034,-inf
Verb,-1.0,-inf,-inf,-1.0
Punct,-inf,-inf,-inf,-inf
Pron,-1.0,-1.0,-inf,-inf


### Tagging a sentence 

In [12]:
hmm.tag(["book", "flight", "."])

[('book', 'Verb'), ('flight', 'Noun'), ('.', 'Punct')]

We'll see in the next lecture the algorithm used for such tagging. 

### Let's try it out on a bigger dataset

- You don't have to understand the code. 

In [13]:
import sys
sys.path.append("code/.")
from hmm_pos_demo import *

In [14]:
import nltk
# nltk.download('brown')

In [15]:
hmm = demo_pos_supervised()


HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.733173970451787

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/

### Explanation of the output

- What do these tags (e.g., NN, AT, IN, NNS etc) mean? Where do they come from?
    - These tags come from [the Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
    - The Penn Treebank tagset consists of 36 POS tags to label different parts of speech of words in English. 
- Entropy is a common metric used to measure the degree of uncertainty or ambiguity in the tagging process. 
    - Lower entropy $\rightarrow$ the tagger is relatively certain about the tags
    - High entropy $\rightarrow$ the tagger is less certain about the tags    

### Let's try it out on a new unseen sentence

In [16]:
hmm.tag(["keep", "the", "book", "on", "the", "table", "."])

[('keep', 'VB'),
 ('the', 'AT'),
 ('book', 'NN'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('table', 'NN'),
 ('.', '.')]

### Other libraries 

Some other libraries 
- [hmmlearn](https://hmmlearn.readthedocs.io/en/latest/)
- [pomegranate](https://github.com/jmschrei/pomegranate)
> Note that there are not many actively maintained off-the-shelf libraries available for supervised training of HMMs. [seqlearn](https://pypi.org/project/seqlearn/) used to be part of `sklearn`. But it's separated now and is not being maintained. 

### Why not use traditional ML models? 

- We could extract features and treat it as a multi-class classification problem of predicting POS for each word. Some example features could be: 
    - Whether the word ends with an "ing" (for verbs)
    - What's the previous word?         
    - Or whether the word occurs at the beginning or end of a sentence  
- But coming up with such features is time consuming and limited. It can get unwieldy quite quickly and it leads to fragile and overfit models.     
- HMM provide a much more elegant way to model sequences and usually they are a preferred way to model sequences.  

<br><br><br><br>

## ❓❓ Questions for you

### Exercise 3.3: Discuss the following question with your neighbour. 

- Give an advantage of using the forward procedure compared to summing over all possible state combinations of length T. 

```{admonition} Exercise 3.3: V's Solutions!
:class: tip, dropdown
The forward procedure is a computationally efficient procedure compared to the method of summing over all possible state combinations of length $T$. The former requires $2Tn^T$ multiplications compared to $2n^2T$ multiplications in the latter, where $N$ is the number of states and $T$ is the number of time steps.

```

<br><br><br><br>

## Quick summary

### Summary

- Hidden Markov models (HMMs) model time-series with latent factors.
- There are tons of applications associated with them and they are more realistic than Markov models. 
- The most successful application of HMMs is speech recognition. 


### Important ideas we learned 

- HMM ingredients
    - Hidden states (e.g., Happy, Sad)
    - Output alphabet or output symbols (e.g., learn, study, cry, facebook)
    - Discrete initial state probability distribution
    - Transition probabilities
    - Emission probabilities    

![](img/HMM_example.png)

<!-- <center> -->
<!-- <img src="img/HMM_example.png" height="600" width="600"> -->
<!-- </center> -->

### Fundamental questions for HMMs 

- Three fundamental questions for HMMs: 
    - likelihood
    - decoding
    - parameter learning 
- The forward algorithm is a dynamic programming algorithm to efficiently calculate the probability of an observation sequence given an HMM. 

### Supervised training of HMMs 
- HMMs for POS tagging.
- Not many tools out there for supervised training of HMMs. 

### Coming up

- Decoding: Viterbi algorithm
    - Given an HMM model and an observation sequence, how do we efficiently compute the corresponding hidden state sequence. 
- Unsupervised training of HMMs (Optional)

<br><br>

### Resources

- [Hidden Markov Models chapter from Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/A.pdf)
- Attribution: Many presentation ideas in this notebook are taken from [Frank Rudzicz's slides](http://www.cs.toronto.edu/~frank/csc401/lectures2018/5-HMMs.pdf).
- [Jason Eisner's lecture on hidden Markov Models](https://vimeo.com/31374528)
- [Jason Eisner's interactive spreadsheet for HMMs](https://cs.jhu.edu/~jason/papers/eisner.hmm.xls)
- [Who each player is guarding?](https://www.youtube.com/watch?v=JvNkZdZJBt4)
- [The Viterbi Algorithm: A Personal History](https://arxiv.org/pdf/cs/0504020v2.pdf)
- [A nice demo of independent vs. Markov vs. HMMs for DNA](https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter10.html)

<br><br><br><br>