# HMM Model details


The probability distributions

- $P(Y_{i}|Y_{i-1})$ are called transition probabilities; 
- $P(Y_{1}|Y_{0} = {\tt start})$ are the initial probabilities
- $P(Y_{N+1}={\tt stop} |Y_{N})$ the final probabilities

A first order HMM model has the following independence assumptions over the joint distribution $P(X=x,Y=y)$:

- $\textbf{Independence of previous states.}$: The probability of
    being in a given state at position $i$ only depends on
    the state of the previous position $i-1$. Formally:
    
    \begin{equation*}
    P (Y_i = y_i | Y_{i-1} = y_{i-1}, Y_{i-2} = y_{i-2}, \ldots, Y_1 = y_1) = P (Y_i = y_i | Y_{i-1} = y_{i-1})
    \end{equation*} 
    
    defining a first order Markov chain


- $\textbf{Homogeneous transition.}$: The probability of
    making a transition from state $c_l$ to state $c_k$ is independent of
    the particular position in the sequence. That is, for all $i,t \in \{1,\ldots,N\}$,
    
     \begin{equation*}
    P (Y_i = c_k | Y_{i-1} = c_l) =  P (Y_{t} = c_k | Y_{t-1} = c_l)
     \end{equation*}


- $\textbf{Observation independence.}$  The probability of
    observing $X_i = x_i$ at position $i$ is fully determined by the state $Y_i$
    at that position. Formally, 
    
     \begin{equation*}
     P (X_i = x_i | Y_1=y_1, \ldots, Y_i=y_i, \ldots, Y_N=y_N) = P(X_i = x_i | Y_i = y_i)
      \end{equation*}
     
     This probability is independent of the
    particular position so, for every $i$ and $t$, we can write:  
    
     \begin{equation*}
    P(X_i = w_j | Y_i = c_k) = P(X_{t} = w_j | Y_{t} = c_k)
     \end{equation*}

These conditional independence assumptions are crucial to allow
efficient inference, as it will be described.


### Table summary

 The distributions that define the HMM model are summarized in the following table

<img src="../images_for_notebooks/day_2/Hmm_table.png" style="max-width:100%; width: 75%">

### Joint distribution $P(X,Y)$

The joint probability of a first order HMM can be written as follows:
$$
P(X_1=x_1,\ldots,X_N=x_N,Y_1=y_1,\ldots,Y_N=y_N)= 
P_{\mathrm{init}}(y_1|\text{ start}) 
\cdot
\left(
\prod_{i=1}^{N-1} P_{\mathrm{trans}}(y_{i+1}|y_i)
\right)
\times
P_{\mathrm{final}}(\text{ stop}|y_N)
\cdot 
\prod_{i=1}^{N} P_{\mathrm{emiss}}(x_i|y_i)
$$

#### Example: computing the probability of a pair $(x,y)$
the probability of an HMM for the first training instance of Example 2.1, which is 

$$
(x,y) = ([\text{walk}, \text{walk}, \text{shop}, \text{clean}],  [\text{rainy}, \text{sunny},\text{ sunny}, \text{sunny}])
$$
can be computed as

$$
P(X_1=x_1,\ldots,X_4=x_4,Y_1=y_1,\ldots,Y_4=y_4)= 
P_{\text{init}}(\text{rainy}|\text{ start}) 
\cdot
P_{\mathrm{trans}}(\text{ sunny}|\text{ rainy}) 
\cdot
P_{\mathrm{trans}}(\text{ sunny}|\text{ sunny}) 
\cdot
P_{\mathrm{trans}}(\text{ sunny}|\text{ sunny}) 
\cdot
P_{\mathrm{final}}(\text{ stop}|\text{ sunny}) 
\cdot
P_{\mathrm{emiss}}(\text{ walk}|\text{ rainy}) 
\cdot
P_{\mathrm{emiss}}(\text{ walk}|\text{ sunny}) 
\cdot
P_{\mathrm{emiss}}(\text{ shop}|\text{ sunny})
\cdot
P_{\mathrm{emiss}}(\text{ clean}|\text{ sunny}).
$$

# HMM Maximum Likelihood Training

We have seen how to compute the probability of a pair $(x,y)$ given the probabilities $P_{\text{init}}, P_{\text{trans}},P_{\text{final}},P_{\text{emiss}}$.

Now we will study how to find the parameters that define $P_{\text{init}}, P_{\text{trans}},P_{\text{final}},P_{\text{emiss}}$. We will refer to the set of parameters as $\theta$.

Given a dataset $\mathcal{D}_L$, we will try to find the parameters $\theta$ that maximize the log likelihood function:

$$
\log \prod_{m=1}^M P_{\theta} (X=x^m,Y=y^m) =  \sum_{m=1}^M  \log P_{\theta} (X=x^m,Y=y^m)
$$

where the joint distribution $P_{\theta} (X=x^m,Y=y^m)$ is given by the formula 

$$
P(X_1=x_1,\ldots,X_N=x_N,Y_1=y_1,\ldots,Y_N=y_N)= 
P_{\mathrm{init}}(y_1|\text{ start}) 
\cdot
\left(
\prod_{i=1}^{N-1} P_{\mathrm{trans}}(y_{i+1}|y_i)
\right)
\times
P_{\mathrm{final}}(\text{ stop}|y_N)
\cdot 
\prod_{i=1}^{N} P_{\mathrm{emiss}}(x_i|y_i)
$$

In some applications  (such as speech recognition) 
the observation variables are continuous, hence the emission distributions are real-valued ( e.g. mixtures of Gaussians). In our case, both the state set and the observation set are discrete (and finite), therefore we use
multinomial distributions for the emission and 
transition probabilities. 

Multinomial distributions are attractive for several reasons: first of
all, they are easy to implement; secondly, the maximum likelihood estimation of the parameters has a simple closed form. The parameters are just normalized counts of events that occur in the corpus.

 Let us define the following
quantities, called sufficient statistics, that represent the counts of
each event in the corpus:


- Initial counts:
$$C_{\text{init}}(c_k) = \sum_{m=1}^M
\mathbb{1} (y^m_1 = c_k)
$$

- Transition counts: $$
C_{\text{trans}}(c_k,c_l) =
\sum_{m=1}^M  \sum_{i = 2}^{N}
\mathbb{1} (y^m_i = c_k \wedge y^m_{i-1} = c_l)
$$

- Final counts:
$$
C_{\text{final}}(c_k) = \sum_{m=1}^M
\mathbb{1} (y^m_N = c_k)
$$

- Emission counts:
$$
C_{\text{emiss}}(w_j,c_k) = \sum_{m=1}^M
\sum_{i = 1}^{N}
\mathbb{1} (x^m_i = w_j \wedge y^m_i = c_k)
$$

Here $y^m_i$,  the underscript denotes the state index position for a given sequence, and the superscript denotes the sequence index in the dataset, and the same applies for the observations.
Note that $\mathbb{1}$ is an indicator function that has the value 1 when the
particular event happens, and zero otherwise. In other words, the previous
equations go through the training corpus and count how
often each event occurs. For example trainsition counts, counts how many times $c_k$ follows state $c_l$. Therefore, $C_{\text{trans}}(\text{ sunny},\text{ rainy})$ contains the number of times that a sunny day followed a rainy day.


#### Sanity check for the HMM


- Initial counts must sum to the number of sentences  $$ \sum_{k=1}^K C_{\text{init}}(c_k) = M$$

- Transition counts and Final Counts should sum to the number of tokens: $$\sum_{k,l=1}^K C_{\text{trans}}(c_k,c_l)  + \sum_{k=1}^K C_{\text{final}}(c_k) = M \cdot N$$

- Emission counts must sum to the number of tokens
$$
\sum_{j=1}^J \sum_{k=1}^K C_{\text{emiss}}(w_j,c_k) = M \cdot N 
$$

## Training an HMM: Finding the parameters of the distributions 

The following formulas specify how to find the parameters of the HMM:

$$
P_{\text{init}}(c_k \,\vert\, \text{start}) = \frac{C_{\text{init}}(c_k)}{ \sum_{k=1}^K
C_{\text{init}} (c_l)}
$$

$$
P_{\text{final}}(\text{stop} \,\vert\, c_l) = \frac{C_{\text{final}}(c_l) }
{\sum_{k=1}^K C_{\text{trans}}(c_k,c_l) + C_{\text{final}}(c_l)}
$$

$$
P_{\text{trans}}( c_k \,\vert\, c_l) = \frac{C_{\text{trans}}(c_k, c_l) }
{\sum_{p=1}^K C_{\text{trans}}(c_p,c_l) + C_{\text{final}}(c_l)}
$$

$$
P_{\text{emiss}} (w_j \,\vert\, c_k) = \frac{C_{\text{emiss}} (w_j, c_k) }{\sum_{q=1}^J C_{\text{emiss}}(w_q,c_k)}
$$


## Exercise 2.2 

 The provided function train supervised from the hmm.py file implements the above parameter estimates.  Run this function given the simple dataset above and look at the estimated probabilities. Are they correct? 

You can also check the variables ending in  counts instead of  probs to see the raw counts (for example, typing ``hmm.initial_counts`` will show you the raw counts of initial states). How are the counts related to the probabilities?

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import sys
# We will this append to ensure we can import lxmls toolking
sys.path.append('../../lxmls-toolkit')

In [6]:
import numpy as np
import lxmls.sequences.hmm as hmmc
import lxmls.readers.simple_sequence as ssr

# Load data
simple = ssr.SimpleSequence()

# instanciate HMM model using the loaded data
hmm = hmmc.HMM(simple.x_dict, simple.y_dict)

# Train the HMM
hmm.train_supervised(simple.train)

In [4]:
print "Initial Counts:\n", hmm.initial_counts ,"\n"
print "Transition Counts:\n", hmm.transition_counts ,"\n"
print "Final Counts:\n", hmm.final_counts ,"\n"
print "Emission Counts\n", hmm.emission_counts

Initial Counts:
[ 2.  1.] 

Transition Counts:
[[ 2.  0.]
 [ 2.  5.]] 

Final Counts:
[ 0.  3.] 

Emission Counts
[[ 3.  2.]
 [ 1.  3.]
 [ 0.  3.]
 [ 0.  0.]]


In [7]:
print "initial_probs "
print hmm.initial_counts / np.sum(hmm.initial_counts)

print "\ntransition_probs"
print hmm.transition_counts / (np.sum(hmm.transition_counts, 0) + hmm.final_counts)

print "\nfinal_probs"
print hmm.final_counts / (np.sum(hmm.transition_counts, 0) + hmm.final_counts)

print "\nemission_probs"
print hmm.emission_counts / np.sum(hmm.emission_counts, 0)

initial_probs 
[ 0.66666667  0.33333333]

transition_probs
[[ 0.5    0.   ]
 [ 0.5    0.625]]

final_probs
[ 0.     0.375]

emission_probs
[[ 0.75   0.25 ]
 [ 0.25   0.375]
 [ 0.     0.375]
 [ 0.     0.   ]]


OBSERVATION:

**If we stack trainsition and final counts and normalize them we get
a proper conditional probability distribution**

In [8]:
transitions_with_final_counts = np.vstack((hmm.transition_counts,
                                           hmm.final_counts))

In [9]:
transitions_with_final_counts 

array([[ 2.,  0.],
       [ 2.,  5.],
       [ 0.,  3.]])

In [10]:
transitions_with_final_counts/ np.sum(transitions_with_final_counts,0)

array([[ 0.5  ,  0.   ],
       [ 0.5  ,  0.625],
       [ 0.   ,  0.375]])

In [11]:
print "Initial Probabilities:\n", hmm.initial_probs ,"\n"
print "Transition 'Probabilities':\n", hmm.transition_probs ,"\n"
print "Final 'Probabilities':\n", hmm.final_probs ,"\n"
print "Emission Probabilities\n", hmm.emission_probs

Initial Probabilities:
[ 0.66666667  0.33333333] 

Transition 'Probabilities':
[[ 0.5    0.   ]
 [ 0.5    0.625]] 

Final 'Probabilities':
[ 0.     0.375] 

Emission Probabilities
[[ 0.75   0.25 ]
 [ 0.25   0.375]
 [ 0.     0.375]
 [ 0.     0.   ]]
