## Sequential Models

HMMs are useful for modelling sequential data where the iid assumption no longer holds. They are very useful in speech applications where successive frames are not iid.

The HMM is described by
1. A matrix of transition probabilities $\mathbf{A}$
1. The distribution of initial state $\mathbf{\pi}$
1. The parameters of the emission probabilities $\phi$


We will use the package [`hmmlearn`](https://hmmlearn.readthedocs.io/en/latest/index.html) to generate sample sequences and train HMM models.

Let's generate a synthetic sequence assumed to be generated by an HMM whose parameters are

$\begin{equation*}
\mathbf{A}= \begin{bmatrix}
    0.7&0.2& 0.1\\
    0.3&0.5& 0.2 \\
    0.3& 0.3& 0.4
\end{bmatrix}
\end{equation*}
$

$\begin{equation*}
\mathbf{\pi}= \begin{bmatrix}
    0.6\\
    0.3\\
    0.1
\end{bmatrix}
\end{equation*}
$

The emmision probabilities are of the form $\mathcal{N}(\mu_i,\sigma_i)$ where $\mu_1=0,\mu_2=2,\mu_3=4$ and $\sigma_i=\sqrt{0.1}$.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from hmmlearn import hmm



model = hmm.GaussianHMM(n_components=3, covariance_type="full")
model.startprob_ = np.array([.6, 0.3, 0.1])
model.transmat_ = np.array([[0.7, 0.2, 0.1],
                            [0.3, 0.5, 0.2],
                            [0.3, 0.3, 0.4]])

model.means_ = np.array([[0.0], [2.0], [4.0]])
model.covars_ = np.tile(.01, (3, 1, 1))
X, Z = model.sample(100)

%matplotlib inline
plt.figure()
plt.plot(X)


 The joint distribution of latent variables and observed variables is given by
  
  $\begin{equation*}
    p(\mathbf{X},\mathbf{Z}|\theta) = p(\mathbf{z}_1|\pi)\Big(\prod_{i=2}^Tp(\mathbf{z}_i|\mathbf{z}_{i-1},\mathbf{A})\Big)\prod_{i=1}^Tp(\mathbf{x}_i|\mathbf{z}_{i},\phi)
  \end{equation*}
  $
  
 Where $\theta = \{\pi, \mathbf{A}, \phi\}$
 Given data, the parameters $\theta$ are estimated using maximum likelihood. In particular this is accomplished using the Expectation-Maximization (EM) algorithm

In [None]:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
plt.figure()
plt.subplot(211)
plt.plot(X)
plt.ylabel(r'$\mathbf{X}$')
plt.subplot(212)
plt.plot(Z2, 'bo')
plt.ylabel(r'$\mathbf{Z}$')


We see that the sequence infered correctly assigns samples drawn from the same class. However, note that the labels are interchanged.

## Left-Right HMM

The left-right HMM is useful in speech modeling. It allows transitions only to forward states. In addition we can limit the number of steps allowed.

In [None]:
st = 0.9 # self transition

lr_hmm = hmm.GaussianHMM(n_components=3, covariance_type="full")
lr_hmm.startprob_ = np.array([1.0, 0.0, 0.0])
lr_hmm.transmat_ = np.array([[st, 1 - st, 0.0],
                             [0.0, st, 1 - st],
                             [0.0, 0.0, 1.0]])
lr_hmm.means_ = np.array([[0.0], [2.0], [4.0]])
lr_hmm.covars_ = np.tile(.01, (3, 1, 1))

X, Z = lr_hmm.sample(100)

%matplotlib inline
plt.figure()
plt.plot(X)

We then infer the states

In [None]:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
plt.figure()
plt.subplot(211)
plt.plot(X)
plt.ylabel(r'$\mathbf{X}$')
plt.subplot(212)
plt.plot(Z2, 'bo')
plt.ylabel(r'$\mathbf{Z}$')

## HMMs for Speech Processing - VAD and Speech Segmentation
We consider the application of HMMs to the problem of voice activity detection and detecting fricatives. We will use the energy and zero crossing rate as features


In [None]:
import librosa
signal, sampling_rate = librosa.load('sita.wav', sr=None)

rmse = librosa.feature.rms(y=signal, frame_length=512, hop_length=256)
zcr = librosa.feature.zero_crossing_rate(y=signal, frame_length=512, hop_length=256)

plt.figure()
plt.plot(np.arange(len(signal)) / sampling_rate, signal)
plt.plot(np.arange(len(signal))[::256] / sampling_rate, rmse.T, 'r')
plt.plot(np.arange(len(signal))[::256] / sampling_rate, zcr.T, 'g')
plt.xlabel(r'Time(s)')
plt.xlim([1, 1.5])

In this problem, a two state model is reasonable with one state representing speech and the other silence. 

In [None]:
vad_model_energy = hmm.GaussianHMM(n_components=2, covariance_type="full", n_iter=100)
vad_model_energy.fit(rmse.T)

plt.figure()
plt.plot(np.arange(len(signal)) / sampling_rate, signal)
plt.plot(np.arange(len(signal))[::256] / sampling_rate, rmse.T, 'r')
plt.plot(np.arange(len(signal))[::256] / sampling_rate, vad_model_energy.predict(rmse.T), 'g')
plt.xlim([.5, 2])

In [None]:
# explore parameters
vad_model_energy.transmat_

With the zero crossing rate we can estimate the location of the fricative

In [None]:
vad_model_zcr = hmm.GaussianHMM(n_components=2, covariance_type="full", n_iter=100)
vad_model_zcr.fit(zcr.T)

plt.figure()
plt.plot(np.arange(len(signal)) / sampling_rate, signal)
plt.plot(np.arange(len(signal))[::256] / sampling_rate, zcr.T, 'r')
plt.plot(np.arange(len(signal))[::256] / sampling_rate, vad_model_zcr.predict(zcr.T), 'g')
plt.xlim([.5, 2])