# Script for Training Gaussian Mixture Models for TIMIT

In [1]:
####################################################################################
### RUNNING THIS CELL FIRST ##########
### will suppresses warnings on memory leaks, deprecation warnings and future warnings 
### It is brute force .  
### Best is not to run it when you want to debug code or new installations
import os, warnings 
os.environ["OMP_NUM_THREADS"] = '2'  
warnings.filterwarnings("ignore")
####################################################################################

In [2]:
#!pip install git+https://github.com/compi1234/pyspch.git
try:
    import pyspch
except ModuleNotFoundError:
    try:
        print(
        """
        To enable this notebook on platforms as Google Colab, 
        install the pyspch package and dependencies by running following code:

        !pip install git+https://github.com/compi1234/pyspch.git
        """
        )
    except ModuleNotFoundError:
        raise

In [2]:
%matplotlib inline
import io, os, sys
import logging

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import metrics as skmetrics 
from sklearn.neural_network import MLPClassifier
from IPython.display import display, HTML, Audio

# reproducibility 
np.random.seed(0)
logging.basicConfig(level=logging.INFO)

# print and plot
mpl.rcParams['figure.figsize'] = [7.,7.]
mpl.rcParams['font.size'] = 11
np.set_printoptions(precision=3)
cmap_jet2 = sns.mpl_palette("jet",60)[5:55]

# pyspch
import pyspch
from pyspch.stats import GMM
import pyspch.core as Spch

# 
from pyspch.core.utils_clf import train_GMM, train_MLP, test_clf
from pyspch.core.utils_timit import load_timit_data, print_dataset_statistics, select_subset

## 1.  Get TIMIT Data and do Feature Extraction

#### TIMIT database & Feature Extraction
- we read the MEL24 mel filterbank features
- postprocess this to obtain MFCC39 features
- we'll later select 13, 26, 39 features in different setups

At this point all data is stored sequentially sentence-by-sentence.
For our classification experiments (main purpose in this notebook) we restructure the data
as a long list of frames(feature vectors) with a synchronized list of labels.
The classes (set of all labels) are 41 phonetic symbols, i.e. the CMU-39 set + silence + closure.    
The labels were obtained from phonetic segmentations that come with the TIMIT database.

In [5]:
def extract_Xy(data):
    X = data.get_features_as_numpy()
    y = data.get_labels_as_numpy()
    return(X,y)
#
alphabet = 'timit41'
timit41 = pyspch.timit.get_timit_alphabet('timit41')
ftrs = "mel24"  # "mel24" or "mel80" are available
modify_feature_args = {"Norm":"meanvar","Deltas":"delta_delta2","n_cep":13}
#
data = load_timit_data(corpus="train",ftrs=ftrs,alphabet=alphabet)
data.modify_features(modify_feature_args)
X_timit_train, y_timit_train = extract_Xy(data)
#
data = load_timit_data(corpus="test",ftrs=ftrs,alphabet=alphabet)
data.modify_features(modify_feature_args)
X_timit_test, y_timit_test = extract_Xy(data)

## 2. Statistics For the TIMIT Database
These are the numbers for the TRAIN database:   
Number of speakers 600  
Number of samples 1417087   
Minimun number of samples per class:  226266   
Maximum number of samples per class:  1218 

You will note that the data distribution is heavily skewed.
This is normal as some phonemes are way more frequent than others.  
Moreover the average length of a phoneme plays a role as we take each frame as an observation and just look for the corresponding label in the TIMIT transcriptions.
Some phonemes thus have thousands of examples available for training, while others (e.g. 'zh') are only marginally represented, e.g.     
The biggest class is 'sil' with 200k+ samples, the biggest phone classes are 's' and 'ih' with 80k+ samples.  The smallest classes are 'b', 'uh' and 'zh' with less than 5000 samples each.  Detailed counts per subdatabase and class are given below

In [6]:
print_dataset_statistics(y_timit_train,Details=True,txt="Full Train Database")
print_dataset_statistics(y_timit_test,Details=True,txt="Full Test Database")

Statistics for Full Train Database:
Number of classes 41
Number of samples 1417087
Minimun/Maximum number of samples per class:  1218  /  226626
[('aa', 37481), ('ae', 59237), ('ah', 39140), ('ao', 36404), ('aw', 11723), ('ay', 34376), ('b', 3823), ('ch', 7116), ('cl', 137933), ('d', 8158), ('dh', 10640), ('eh', 35133), ('er', 51761), ('ey', 28955), ('f', 22828), ('g', 6031), ('hh', 14142), ('ih', 84270), ('iy', 62901), ('jh', 7063), ('k', 25064), ('l', 43515), ('m', 24859), ('n', 46996), ('ng', 8513), ('ow', 26853), ('oy', 11024), ('p', 11482), ('r', 39609), ('s', 84491), ('sh', 25828), ('sil', 226626), ('t', 28720), ('th', 6862), ('uh', 4077), ('uw', 26444), ('v', 12033), ('w', 20946), ('y', 11322), ('z', 31490), ('zh', 1218)]
Statistics for Full Test Database:
Number of classes 41
Number of samples 517845
Minimun/Maximum number of samples per class:  643  /  81852
[('aa', 14133), ('ae', 21072), ('ah', 14948), ('ao', 14233), ('aw', 3751), ('ay', 12613), ('b', 1562), ('ch', 2244), ('c

## Exercise 1:  Use the vow6-database

##### Downsampling and data selection
For most simple experiments in this notebook we will downsample the full database (typically with a factor 10).   
For some experiments we also work with a subset of phonemes only.

#### 1.1 Do the Reference Experiment

In this experiment we perform vowel recognition from frame data  
The experiment is easy in the sense that the used vowels are rather well distinguishable.   
The experiment is hard because you base your prediction on a single frame positioned well inside but also at the boundaries of a vowel. 

We have entered quite reasonable default values in the next cell:
- use of static and dynamic MFCCs (26D)
- 8 mixtures in the GMM per class
- 4 iterations for the GMM training (the KMeans initialization routine does almost all the work, so little will be gained from iterating)
- priors are used as derived from the training set (set the priors variable to 'training' or 'uniform')
- with these default settings you should achieve an accuracy of approximately 68%  (if you don't get this result, there is probably something wrong in your setup !!)

In [34]:
vow6=['iy','aa','uw','ih','eh','er']
plosives=['p','t','k','b','d','g']
timit41 = pyspch.timit.get_timit_alphabet('timit41')
#
def gmm_experiment(X_train,y_train,X_test,y_test,classes,ng=1,priors='training',Verbose=False):
    kwargs =dict(verbose=0,verbose_interval=1,max_iter=4,init_params='kmeans') 
    clf_GMM,acc_train = train_GMM(X_train, y_train,  classes=classes, n_components=ng, Verbose=Verbose,**kwargs)
    acc_test,cm = test_clf(clf_GMM,X_test, y_test,priors=priors,Verbose=Verbose)
    return(clf_GMM,acc_train,acc_test,cm)
    
def train_gmms(X_train,y_train,classes,ng=1):
    kwargs =dict(verbose=0,verbose_interval=1,max_iter=4,init_params='kmeans') 
    models = []
    for nftrs in [13,26,39]:
        clf,acc_train = train_GMM(X_train[:,0:nftrs], y_train,  classes=classes, n_components=ng, **kwargs)
        models.append(clf)
        print("Train Accuracy (#FEAT=%d): %.2f%%" %(nftrs,acc_train) )
    return(models)

In [23]:
classes=vow6
X_train, y_train = select_subset(X_timit_train,y_timit_train,labels=classes)
X_test, y_test = select_subset(X_timit_test,y_timit_test,labels=classes)
print_dataset_statistics(y_train,Details=True,txt="TRAIN")
print_dataset_statistics(y_test,Details=True,txt="TEST")

Statistics for TRAIN:
Number of classes 6
Number of samples 297990
Minimun/Maximum number of samples per class:  26444  /  84270
[('aa', 37481), ('eh', 35133), ('er', 51761), ('ih', 84270), ('iy', 62901), ('uw', 26444)]
Statistics for TEST:
Number of classes 6
Number of samples 108802
Minimun/Maximum number of samples per class:  7962  /  28815
[('aa', 14133), ('eh', 12793), ('er', 21134), ('ih', 28815), ('iy', 23965), ('uw', 7962)]


In [44]:
results = []
for nftrs in [13,26,39]:
    for ng in [1,4,16]:
        clf,acc_train,acc_test,cm = gmm_experiment(X_train[:,0:nftrs],y_train, X_test[:,0:nftrs],y_test,classes, ng=ng, Verbose=True)
        results.append([nftrs,ng,acc_train,acc_test,clf,cm])
results_df = pd.DataFrame([ r[0:4] for r in results],columns=['nftrs','ng','acc_train','acc_test'])

Training Set:  Accuracy = 61.49%
Training Set:  LL(per sample) = -16.06
Training Set:  BIC = 9572271.91
Accuracy = 62.35%
Training Set:  Accuracy = 62.51%
Training Set:  LL(per sample) = -15.19
Training Set:  BIC = 9062445.59
Accuracy = 63.05%
Training Set:  Accuracy = 64.10%
Training Set:  LL(per sample) = -14.54
Training Set:  BIC = 8695761.92
Accuracy = 64.21%
Training Set:  Accuracy = 62.95%
Training Set:  LL(per sample) = -33.87
Training Set:  BIC = 20189052.25
Accuracy = 63.38%
Training Set:  Accuracy = 65.84%
Training Set:  LL(per sample) = -31.83
Training Set:  BIC = 18983745.09
Accuracy = 66.03%
Training Set:  Accuracy = 69.12%
Training Set:  LL(per sample) = -30.29
Training Set:  BIC = 18114963.26
Accuracy = 69.16%
Training Set:  Accuracy = 60.38%
Training Set:  LL(per sample) = -47.66
Training Set:  BIC = 28408302.75
Accuracy = 60.92%
Training Set:  Accuracy = 63.47%
Training Set:  LL(per sample) = -43.84
Training Set:  BIC = 26150748.41
Accuracy = 63.60%
Training Set:  Accu

In [45]:
print('classes: ',classes)
print_dataset_statistics(y_train,Details=False,txt="TRAIN")
#print_dataset_statistics(y_test,Details=False,txt="TEST")
display(results_df)

classes:  ['iy', 'aa', 'uw', 'ih', 'eh', 'er']
Statistics for TRAIN:
Number of classes 6
Number of samples 297990
Minimun/Maximum number of samples per class:  26444  /  84270


Unnamed: 0,nftrs,ng,acc_train,acc_test
0,13,1,61.49099,62.350876
1,13,4,62.514849,63.053988
2,13,16,64.102487,64.20654
3,26,1,62.951441,63.37935
4,26,4,65.839458,66.034632
5,26,16,69.116413,69.155898
6,39,1,60.384241,60.920755
7,39,4,63.468573,63.602691
8,39,16,67.379442,67.340674


In [46]:
# TRAIN/TEST for full TIMIT
X_train = X_timit_train
y_train = y_timit_train
X_test = X_timit_test
y_test = y_timit_test
classes = timit41
#
results = []
for nftrs in [13,26,39]:
    for ng in [1,8,64]:
        clf,acc_train,acc_test,cm = gmm_experiment(X_train[:,0:nftrs],y_train, X_test[:,0:nftrs],y_test,classes, ng=ng, Verbose=True)
        results.append([nftrs,ng,acc_train,acc_test,clf,cm])
results_df = pd.DataFrame([ r[0:4] for r in results],columns=['nftrs','ng','acc_train','acc_test'])

Training Set:  Accuracy = 43.93%
Training Set:  LL(per sample) = -15.70
Training Set:  BIC = 44498262.86
Accuracy = 43.75%
Training Set:  Accuracy = 46.48%
Training Set:  LL(per sample) = -14.50
Training Set:  BIC = 41213630.63
Accuracy = 45.91%
Training Set:  Accuracy = 49.55%
Training Set:  LL(per sample) = -13.88
Training Set:  BIC = 40339821.14
Accuracy = 46.82%
Training Set:  Accuracy = 47.30%
Training Set:  LL(per sample) = -33.53
Training Set:  BIC = 95047361.14
Accuracy = 46.79%
Training Set:  Accuracy = 53.18%
Training Set:  LL(per sample) = -30.64
Training Set:  BIC = 87075940.92
Accuracy = 52.44%
Training Set:  Accuracy = 59.47%
Training Set:  LL(per sample) = -29.00
Training Set:  BIC = 84150432.25
Accuracy = 56.36%
Training Set:  Accuracy = 45.50%
Training Set:  LL(per sample) = -50.70
Training Set:  BIC = 143726594.15
Accuracy = 44.94%
Training Set:  Accuracy = 51.21%
Training Set:  LL(per sample) = -46.24
Training Set:  BIC = 131428425.11
Accuracy = 50.47%
Training Set: 

In [47]:
#
print('classes: ',classes)
print_dataset_statistics(y_train,Details=False,txt="TRAIN")
print_dataset_statistics(y_test,Details=False,txt="TEST")
display(results_df)

classes:  ['aa', 'ae', 'ah', 'ao', 'aw', 'er', 'ay', 'b', 'ch', 'd', 'dh', 'eh', 'm', 'ng', 'ey', 'f', 'g', 'hh', 'ih', 'iy', 'jh', 'k', 'l', 'n', 'ow', 'oy', 'p', 'r', 's', 'sh', 't', 'th', 'uh', 'uw', 'v', 'w', 'y', 'z', 'zh', 'sil', 'cl']
Statistics for TRAIN:
Number of classes 41
Number of samples 1417087
Minimun/Maximum number of samples per class:  1218  /  226626
Statistics for TEST:
Number of classes 41
Number of samples 517845
Minimun/Maximum number of samples per class:  643  /  81852


Unnamed: 0,nftrs,ng,acc_train,acc_test
0,13,1,43.928566,43.748033
1,13,8,46.476681,45.906401
2,13,64,49.553838,46.81536
3,26,1,47.302389,46.793925
4,26,8,53.181562,52.43963
5,26,64,59.472919,56.364549
6,39,1,45.498759,44.93835
7,39,8,51.20603,50.47437
8,39,64,57.649954,54.615377


### Saving Selected Results to disk

In [62]:
import pickle
def save_pickle(data,filename):
    picklefile = open(filename, 'wb')
    pickle.dump(data, picklefile),
    picklefile.close()  
GMM39_64 = results[8][4]
save_pickle(GMM39_64,"S41_D39_G64_FULL.pkl")