## Import libraries

In [1]:
import numpy as np
import scipy
import tqdm

## Data

In [None]:
example = np.load('lab1_example.npz', allow_pickle=True)['example'].item()

dict_keys(['samples', 'samplingrate', 'frames', 'preemph', 'windowed', 'spec', 'mspec', 'mfcc', 'lmfcc'])

* **samples**: speech samples for one utterance
* **samplingrate**: sampling rate
* **frames**: speech samples organized in overlapping frames
* **preemph**: pre-emphasized speech samples
* **windowed**: hamming windowed speech samples
* **spec**: squared absolute value of Fast Fourier Transform
* **mspec**: natural log of spec multiplied by Mel filterbank
* **mfcc**: Mel Frequency Cepstrum Coefficients
* **lmfcc**: Liftered Mel Frequency Cepstrum Coefficients

The files contain the spoken utterances of isolated digits (2 instances each).
The digits are "oh", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine".

In [None]:
data = np.load('lab1_data.npz', allow_pickle=True)['data']

* **filename**: filename of the wave file in the database
* **samplingrate**: sampling rate of the speech signal (20kHz in all examples)
* **gender**: gender of the speaker for the current utterance (man, woman)
* **speaker**: speaker ID for the current utterance (ae, ac)
* **digit**: digit contained in the current utterance (o, z, 1, ..., 9)
* **repetition**: whether this was the first (a) or second (b) repetition
* **samples**: array of speech samples

## Mel Frequency Cepstrum Coefficients step-by-step

### 4.1 Enframe

Implement the `enframe` function in lab1_proto.py.

This will take as input speech samples, the frame length in samples and the number of samples overlap between consecutive frames and outputs a two dimensional array where each row is a frame of samples. Consider only the frames that fit into the original signal disregarding extra samples.

Apply the enframe function to the utterance `example['samples']` with window length of 20 milliseconds and shift of 10 ms (figure out the length and shift in samples from the sampling rate).

Use the pcolormesh function from matplotlib.pyplot to plot the resulting array. Verify that your result corresponds to the array in example['frames'].

### 4.2 Pre-emphasis

Implement the `preemp` function in `lab1_proto.py`.

To do this, define a pre-emphasis filter with pre-emphasis coefficient 0.97 using the `lfilter` function from `scipy.signal`.

Explain how you defined the filter coefficients.

Apply the filter to each frame in the output from the `enframe` function. This should correspond to the `example['preemph']` array.

### 4.3 Hamming Window

Implement the `windowing` function in `lab1_proto.py.` 

To do this, define a hamming window of the correct size using the `hamming` function from `scipy.signal` with extra option `sym=False`.

Plot the window shape and explain why this windowing should be applied to the frames of speech signal.

Apply hamming window to the pre-emphasized frames of the previous step. This should correspond to the `example['windowed']` array.

### 4.4 Fast Fourier Transform

Implement the `powerSpectrum` function in `lab1_proto.py`. To do this, compute the Fast Fourier Transform (FFT) of the input from `scipy.fftpack` and then the squared modulus of the result.

 Apply your function to the windowed speech frames, with FFT length of 512 samples.

Plot the resulting power spectrogram with `pcolormesh`. Beware of the fact that the FFT bins correspond to frequencies that go from 0 to fmax and back to 0.

What is fmax in this case according to the Sampling Theorem? The array should correspond to example['spec'].

### 4.5 Mel filterbank log spectrum

Implement the `logMelSpectrum` function in `lab1_proto.py`. Use the `trfbank` function, provided
in the `lab1_tools.py` file, to create a bank of triangular filters linearly spaced in the Mel
frequency scale. Plot the filters in linear frequency scale. Describe the distribution of the
f
ilters along the frequency axis. Apply the filters to the output of the power spectrum from the
previous step for each frame and take the natural log of the result. Plot the resulting filterbank
outputs with `pcolormesh`. This should correspond to the `example['mspec']` array.

### 4.6 Cosine Transofrm and Liftering

Implement the `cepstrum` function in `lab1_proto.py`. To do this, apply the Discrete Cosine Transform (`dct` function from `scipy.fftpack.realtransforms`) to the outputs of the filterbank.

Use coefficients from 0 to 12 (13 coefficients).

Note that using the $n=13$ input parameter in `dct` is not the same as running without the argument and taking the first 13 elements in the results.

Note that $n=13$ does not only determine the number of output DCT coefficients the function returns, but will also truncate the input in case n is smaller than the input length.

Apply liftering using the function `lifter` in `lab1_tools.py`. This last step is used to correct the range of the coefficients. Plot the resulting coefficients with `pcolormesh`. These should correspond to `example['mfcc']` and `example['lmfcc']` respectively.

Once you are sure all the above steps are correct, use the `mfcc` function (`lab1_proto.py`) to compute the liftered MFCCs for all the utterances in the `data` array.

Observe differences for different utterances.

## Feature Correlation

Concatenate all the MFCC frames from all utterances in the data array into a big feature
[N ×M] array where N is the total number of frames in the data set and M is the number of
coefficients. Then compute the correlation coefficients between features4 and display the result
with pcolormesh. Are features correlated? Is the assumption of diagonal covariance matrices
for Gaussian modelling justified? Compare the results you obtain for the MFCC features with
those obtained with the Mel filterbank features ('mspec' features).

## Explore Speech Segments with Clustering

Using the concatenated data from the previous section, train a Gaussian mixture model with sklearn.mixture.GMM (or sklearn.mixture.GaussianMixture depending on the version of sklearn).

Vary the number of components for example: 4, 8, 16, 32.

Once the models are trained, compute GMM posteriors for utterances containing the same words.
Plot the results with pcolormesh and observe the evolution of the GMM posteriors in time.

* Can you say something about the classes discovered by the unsupervised learning method?
* Do the classes roughly correspond to the phonemes you expect to compose each word?
* Are those classes a stable representation of the word if you compare utterances from different speakers.

As an example, plot and discuss the GMM posteriors for the model with 32 components for the four occurrences of the word “seven” (utterances 16, 17, 38, and 39).

## Comparing Utterances

Given two utterances of length $N$ and $M$ respectively, compute an $[N × M]$ matrix of local Euclidean distances between each MFCC vector in the first utterance and each MFCC vector in the second utterance.

Write a function called dtw (lab1_proto.py) that takes as input this matrix of local distances between MFCC vector in the first utterance and each MFCC vector in the second utterance.

Write a function called dtw (lab1_proto.py) that takes as input this matrix of local distances and outputs the result of the Dynamic Time Warping algorithm. The main output is the global distance between the two sequences (utterances), but you may want to output also the best path for debugging reasons.

For each pair of utterances in the data array:

1. compute the local Euclidean distances between MFCC vectors in the first and second utterance
2. compute the global distance between utterances with the dtw function you have written

Store the global pairwise distances in a matrix D (44×44).
Display the matrix with pcolormesh.

Compare distances within the same digit and across different digits.
Does the distance separate digits well even between different speakers?

Run hierarchical clustering on the distance matrix D using the linkage function from scipy.cluster.hierarchy.
Use the ”complete” linkage method.
Display the results with the function dendrogram from the same library, and comment them.
Use the tidigit2labels function (lab1_tools.py) to create labels to add to the dendrogram to simplify the interpretation of the results.