# Hidden Markov Models

## Introduction

Probabilistic models for sequences of observations, $X _ { 1 } , \dots , X _ { T }$ of arbitrary length $T$. Focus on the case where we have the observation occur at discrete "time steps".

## Markov models

Joint distribution:
$$p \left( X _ { 1 : T } \right) = p \left( X _ { 1 } \right) p \left( X _ { 2 } | X _ { 1 } \right) p \left( X _ { 3 } | X _ { 2 } \right) \ldots = p \left( X _ { 1 } \right) \prod _ { t = 2 } ^ { T } p \left( X _ { t } | X _ { t - 1 } \right)$$

If we assume the transition function $p \left( X _ { t } | X _ { t - 1 } \right)$ is independent of time, then the chain is called homogeneous, stationary or time-invariant ==> parameter sharing (tying) ==> allow us to model an arbitrary number of variables using a fixed number of parameters ==> Stochasitc processes.

If the observed variables are discrete $X _ { t } \in \{ 1 , \ldots , K \}$, this is called discrete-state or finite-state Markov chain. 

### Transition Matrix:
the conditional distribution: $p \left( X _ { t } | X _ { t - 1 } \right)$ can be written as a $K \times K$ matrix, a.k.a transition matrix $A$, where $A _ { i j } = p \left( X _ { t } =j | X _ { t - 1 } = i \right)$: is the probability of going from state $i$ to state $j$. Each row of the matrix sums to one, $\sum _ { j } A _ { i j } = 1$, also called stochastic matrix

$n$-step transition matrix $A(n)$ is defined as: 
$$A _ { i j } ( n ) \triangleq p \left( X _ { t + n } = j | X _ { t } = i \right)$$

which is the probability of getting from $i$ to $j$ in exactly $n$ steps. 

The Chapman-Kolmogorov equations: 

$$A _ { i j } ( m + n ) = \sum _ { k = 1 } ^ { K } A _ { i k } ( m ) A _ { k j } ( n )$$

The probability of getting from $i$ to $j$ in $m + n$ steps is is the probability of getting from $i$ to $k$ in $m$ steps, and the from $k$ to $j$ in $n$ steps, summed up overall $k$. 

$$\mathbf { A } ( m + n ) = \mathbf { A } ( m ) \mathbf { A } ( n )$$

Hence: $$\mathbf { A } ( n ) = \mathbf { A } \mathbf { A } ( n - 1 ) = \mathbf { A } \mathbf { A } \mathbf { A } ( n - 2 ) = \cdots = \mathbf { A } ^ { n }$$




## Hidden Markov Models
HMM consists of a discrete-time, discrete-state markov chain, with hidden states: $z _ { t } \in \{ 1 , \ldots , K \}$, plus an observation model $p \left( \mathbf { x } _ { t } | z _ { t } \right)$. 

Joint distribution:
$$p \left( \mathbf { z } _ { 1 : T } , \mathbf { x } _ { 1 : T } \right) = p \left( \mathbf { z } _ { 1 : T } \right) p \left( \mathbf { x } _ { 1 : T } | \mathbf { z } _ { 1 : T } \right) = \left[ p \left( z _ { 1 } \right) \prod _ { t = 2 } ^ { T } p \left( z _ { t } | z _ { t - 1 } \right) \right] \left[ \prod _ { t = 1 } ^ { T } p \left( \mathbf { x } _ { t } | z _ { t } \right) \right]$$

Observations can be discrete or continuous. 
+ If they are discrete, observation matrix: $p \left( \mathbf { x } _ { t } = l | z _ { t } = k , \boldsymbol { \theta } \right) = B ( k , l )$
+ If they are continuous, conditional gaussian: $p \left( \mathbf { x } _ { t } | z _ { t } = k , \boldsymbol { \theta } \right) = \mathcal { N } \left( \mathbf { x } _ { t } | \boldsymbol { \mu } _ { k } , \mathbf { \Sigma } _ { k } \right)$

## Inference in HMM
Different kinds of inference:
+ Filtering: to compute the belief state $p \left( z _ { t } | \mathbf { x } _ { 1 : t } \right)$ online. 

+ Smoothing: to compute $p \left( z _ { t } | \mathbf { x } _ { 1 : T } \right)$ offline

+ Fixed lag smoothing: to compute $p \left( z _ { t - \ell } | \mathbf { x } _ { 1 : t } \right)$, where $\ell > 0$ is called the lag

+ Prediction: to compute $p \left( z _ { t + h } | \mathbf { x } _ { 1 : t } \right)$, where $h > 0$ is called prediction horizon.
    $$p \left( z _ { t + 2 } | \mathbf { x } _ { 1 : t } \right) = \sum _ { z _ { t + 1 } } \sum _ { z _ { t } } p \left( z _ { t + 2 } | z _ { t + 1 } \right) p \left( z _ { t + 1 } | z _ { t } \right) p \left( z _ { t } | \mathbf { x } _ { 1 : t } \right)$$
    Prediction about future observation: $$p \left( \mathbf { x } _ { t + h } | \mathbf { x } _ { 1 : t } \right) = \sum _ { z _ { t + h } } p \left( \mathbf { x } _ { t + h } | z _ { t + h } \right) p \left( z _ { t + h } | \mathbf { x } _ { 1 : t } \right)$$
    
+ MAP estimation: to compute $\arg \max _ { \mathbf { z } _ { 1 : T } } p \left( \mathbf { z } _ { 1 : T } | \mathbf { x } _ { 1 : T } \right)$, which is a most probable state sequence. ***Viterbi decoding*

+ Posterior samples: $\mathbf { z } _ { 1 : T } \sim p \left( \mathbf { z } _ { 1 : T } | \mathbf { x } _ { 1 : T } \right)$

+ Probability of the evidence: $p \left( \mathbf { X } _ { 1 : T } \right)$ by summing up over all hidden paths, $p \left( \mathbf { x } _ { 1 : T } \right) = \sum _ { \mathbf { z } _ { 1 : T } } p \left( \mathbf { z } _ { 1 : T } , \mathbf { x } _ { 1 : T } \right)$

![](../images/17.HMM.png)

### The forwards algorithm:
Recursively compute the filtered marginals: $p \left( z _ { t } | \mathbf { x } _ { 1 : t } \right)$

We can also compute the log probability of the evidence:
$$\log p \left( \mathbf { x } _ { 1 : T } | \boldsymbol { \theta } \right) = \sum _ { t = 1 } ^ { T } \log p \left( \mathbf { x } _ { t } | \mathbf { x } _ { 1 : t - 1 } \right) = \sum _ { t = 1 } ^ { T } \log Z _ { t }$$

![](../images/17.Forward2.png)

where $\alpha_t = p \left( z _ { t } | \mathbf { x } _ { 1 : t } \right)$ is the filtered belief state at time $t$ and is a vector of $K$ numbers. $Z _ { t } \triangleq p \left( \mathbf { x } _ { t } | \mathbf { x } _ { 1 : t - 1 } \right)$ is normalization constant.

### The forwards-backwards algorithm

Compute the smoothed marginals: $p \left( z _ { t } = j | \mathbf { x } _ { 1 : T } \right)$ 

Let: 
+ $\alpha _ { t } ( j ) \triangleq p \left( z _ { t } = j | \mathbf { x } _ { 1 : t } \right)$ be the filered belief state
+ $\beta _ { t } ( j ) \triangleq p \left( \mathbf { x } _ { t + 1 : T } | z _ { t } = j \right)$ be the conditional likelihood of future evidence given that the hidden state at time $t$ is $j$. 
+ $\psi_t(i,j) = B(i, j)$: Observation matrix
+ $\phi_t(i) = A(i)$: Transition matrix
+ $\gamma _ { t } ( j ) \triangleq p \left( z _ { t } = j | \mathbf { x } _ { 1 : T } \right)$ be the desired smoothed posterior marginals, 

    Since: $$p \left( z _ { t } = j | \mathbf { x } _ { 1 : T } \right) \propto p \left( z _ { t } = j , \mathbf { x } _ { t + 1 : T } | \mathbf { x } _ { 1 : t } \right) \propto p \left( z _ { t } = j | \mathbf { x } _ { 1 : t } \right) p \left( \mathbf { x } _ { t + 1 : T } | z _ { t } = j , \mathbf { x } _ { 1 : t } \right)$$

    so we have: $$\gamma _ { t } ( j ) \propto \alpha _ { t } ( j ) \beta _ { t } ( j )$$

    We compute $\alpha$;s in a left-to-right fashion in forwards algorithm, we compute $\beta$'s in right-to-left fashion. If we have already computed $\beta_t$, we can compute $\beta_{t-1}$ as follows:
    $$\boldsymbol { \beta } _ { t - 1 } = \boldsymbol { \Phi } \left( \boldsymbol { \psi } _ { t } \odot \boldsymbol { \beta } _ { t } \right)$$

    with the base case: $$\beta _ { T } ( i ) = p \left( \mathbf { x } _ { T + 1 : T } | z _ { T } = i \right) = p ( \emptyset | z _ { T } = i ) = 1$$ 
    which is the probability of a non-event.

+ $\xi _ { t , t + 1 } ( i , j ) \triangleq  p \left( z _ { t } = i , z _ { t + 1 } = j | \mathbf { x } _ { 1 : T } \right)$: be two-slice smooth marginals: 
    $$\xi _ { t , t + 1 } ( i , j ) = \alpha _ { t } ( i ) \Phi _ { t + 1 } ( j ) \beta _ { t + 1 } ( j ) \psi ( i , j )$$
    Or matrix form: $$\boldsymbol { \xi } _ { t , t + 1 } \propto \mathbf { \Psi } \odot \left( \boldsymbol { \alpha } _ { t } \left( \boldsymbol { \phi } _ { t + 1 } \odot \boldsymbol { \beta } _ { t + 1 } \right) ^ { T } \right)$$
    
### The Viterbi algorithm: 
Compute: $\mathbf { z } ^ { * } = \arg \max _ { \mathbf { z } _ { 1 : T } } p \left( \mathbf { z } _ { 1 : T } | \mathbf { x } _ { 1 : T } \right)$

Trellis diagram: the weight of the path $z_1, z_2, \ldots, z_t$ given by:
$$\log \pi _ { 1 } \left( z _ { 1 } \right) + \log \phi _ { 1 } \left( z _ { 1 } \right) + \sum _ { t = 2 } ^ { T } \left[ \log \psi \left( z _ { t - 1 } , z _ { t } \right) + \log \phi _ { t } \left( z _ { t } \right) \right]$$

![](../images/17.Viterbi.png)

Let:
+ $\delta _ { t } ( j ) \triangleq \underset { z _ { 1 } , \ldots , z _ { t - 1 } } { \max } p \left( \mathbf { z } _ { 1 : t - 1 } , z _ { t } = j | \mathbf { x } _ { 1 : t } \right)$ be the proability of ending up in state $j$ at time $t$, given that we take the most probable path. The key insight is that the most probable path to state $j$ at time $t$ must consist of the most probable path to some other state $j$ at time $t-1$, followed by a transition from $i$ to $j$:

    $$\delta _ { t } ( j ) = \max _ { i } \delta _ { t - 1 } ( i ) \psi ( i , j ) \phi _ { t } ( j )$$
    where $\delta _ { 1 } ( j ) = \pi _ { j } \phi _ { 1 } ( j )$, initial state.
    
+ $\alpha_t(j)$ tells us the most likeliy previous state on the most probable path to $z_t = j$.
    $$a _ { t } ( j ) = \underset { i } { \operatorname { argmax } } \delta _ { t - 1 } ( i ) \psi ( i , j ) \phi _ { t } ( j )$$
    
We compute the most probable final state: 
$z _ { T } ^ { * } = \arg \max _ { i } \delta _ { T } ( i )$

We compute the most probable sequence of states using traceback:
$z _ { t } ^ { * } = a _ { t + 1 } \left( z _ { t + 1 } ^ { * } \right)$

### Forwards filterings, backwards sampling:
sample paths fromt he posterior:
$$\mathbf { z } _ { 1 : T } ^ { s } \sim p \left( \mathbf { z } _ { 1 : T } | \mathbf { x } _ { 1 : T } \right)$$

From posterior: 
$$p \left( \mathbf { z } _ { 1 : T } | \mathbf { x } _ { 1 : T } \right) = p \left( z _ { T } | \mathbf { x } _ { 1 : T } \right) \prod _ { t = T - 1 } ^ { 1 } p \left( z _ { t } | z _ { t + 1 } , \mathbf { x } _ { 1 : T } \right)$$

So we have the process: 
$$z _ { T } ^ { s } \sim p \left( z _ { T } = i | \mathbf { x } _ { 1 : T } \right) = \alpha _ { T } ( i )$$

then
$$z_t \sim p \left( z _ { t } = i | z _ { t + 1 } = j , \mathbf { x } _ { 1 : t } \right) = \frac { \phi _ { t + 1 } ( j ) \psi ( i , j ) \alpha _ { t } ( i ) } { \alpha _ { t + 1 } ( j ) }$$
