## Speech Modelling

In this notebook we will explore acoustic modelling using the Gaussian Mixture Model or GMM. We will see the use of the GMM in voice activity detection and in speech recognition

### Voice Activity Detection
In this problem we aim to determine sections of the audio signal corresponding to speech.

#### Data
For the experiments, we use data from the [TensorFlow Speech Recognition Challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data). This dataset contains recordings of isolated words such as "Yes", "No", "Up", "Down", etc

Before running the cell below, create the `data` directory in the same folder containing this notebook.


In [None]:
#obtain the data and visualise
import os
import urllib.request
import tarfile

url = 'http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz'
data_files = os.listdir('data/')

if 'speech_commands_v0.01.tar.gz' not in data_files:
    print('Downloading...')
    urllib.request.urlretrieve(url, 'data/speech_commands_v0.01.tar.gz')
    print('Done!')
else:
    print('Data Available')
    
# unzip
 
if 'yes' not in data_files:
    try:    
        tf = tarfile.open('data/speech_commands_v0.01.tar.gz')
        tf.extractall('data/')
    except:
        pass

## Visualisation
We now obtain a random file with the word "yes" and plot its waveform and spectrogram

Install [librosa](https://github.com/librosa/librosa)

In [None]:
%matplotlib inline
import librosa
import librosa.display
import random
import matplotlib.pyplot as plt
import numpy as np
import time


# set up speech processing parameters
NFFT = 512  # 32 ms at 16kHz
HOP_LENGTH = 256 
NUM_MFCC = 13


yes_files = os.listdir('data/yes')
signal, sampling_rate = librosa.load('data/yes/' + random.choice(yes_files), sr=None) # use native sampling rate
    
# compute spectrogram
spect = np.abs(librosa.stft(signal, n_fft=NFFT, hop_length=HOP_LENGTH))

plt.figure()
plt.subplot(211)
librosa.display.specshow(librosa.amplitude_to_db(spect, ref=np.max),
                         sr=sampling_rate,
                         hop_length=HOP_LENGTH,
                         y_axis='linear', 
                         x_axis='time')
plt.title('Spectrogram of "Yes"')
plt.xticks([])
plt.xlabel('')

plt.subplot(212)
plt.plot(np.arange(len(signal)) / sampling_rate, signal)
plt.xlim([0, 1]);


## Voice activity detection
VAD is a classification problem. For each speech frame, we seek to detemine whether the frame is speech or not. We will need to select features for our task.

Two features we can start with are
* energy
* zero crossing rate

The short time energy is computed as
       \begin{equation}
        E_{\hat{n}}=\sum_{m=-\infty}^\infty (x[m]w[\hat{n} - m])^2
        \end{equation}
        
and the ZCR is
\begin{equation}
        Z_{\hat{n}}=\sum_{m=-\infty}^\infty0.5|\mathrm{sgn}\{x[m]\} - \mathrm{sgn}\{x[m-1]\}|w[\hat{n} - m]
       \end{equation}
Where    
        \begin{equation}
\mathrm{sgn}\{x\}=\left\{ \begin{array}{ll}
1 & x\geq 0\\
-1 & x<0
\end{array} \right.
\end{equation}

The signal is processed frame by frame by moving a window across the signal.


## Exercise 1
Write code to compute the short time energy of the signal. As input the function should take the signal, window size and overlap (or hop length) between segments. Do not use the inbuilt `librosa` functions


In [None]:
# Your code here

## Exercise 2
Using the short time energy function you have computed, plot the frame energy as a function of time on the same graph as the speech signal.

In [None]:
#Your code here

## Exercise 3

Use `librosa` to compute the short time energy and compare it with your function. Comment on similarities and differences.

In [None]:
# your code here

## Modelling

We will work with the `energy` to build a voice activity detection system.

### Exercise 4

Plot a histogram of the log of the energy

In [None]:
# Your code here

We will attempt to fit a one dimensional GMM to the log energy.

We will use `scikit-learn`. Installation instructions can be found [here](https://scikit-learn.org/stable/install.html)

In [None]:
# replace the line below with the appropriate value
log_energy = np.zeros((len(signal),1))
from sklearn import mixture
gmm = mixture.GaussianMixture(n_components=2)

gmm.fit(log_energy)

In [None]:
#get the component means
gmm.means_

The component with the highest mean corresponds to the speech.

### Exercise 5
Get the mean corresponding to the speech component and predict the category of each frame. Finally plot the speech signal as well as an indicator funtion that is 1 over a speech frame and 0 over a non-speech frame.

In [None]:
# Your code here