**EIE558 Speech Recognition Lab (Part 1): Spoken-Digit Recognition**

In this lab, you will train and evaluate a CNN model that comprises several 1-D CNN layers for spoken digit recognition. By default, the input to the CNN is an MFCC matrix of size *C* x *T*, where *C* is the number MFCC coefficients per frame and *T* is the number of frames. 

Two pooling methods are available for converting frame-based features to utterance-based features. They are adaptive average pooling and statistics pooling. The former uses PyTorch's AdaptiveAvgPooling2d() to average the last convolutional layer's activation across the frame axis. The latter concatenates the mean and the standard deviation of the activations across frames, which is commonly used in the x-vector network. If no pooling method is used, the number of frames for each utterance should be the same so that the number of nodes after flattening is identical for all utterances.

<font color="green">*Step 1: Prepare environment*<font>

In [None]:
# If you use Colab, run this cell to mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
!mkdir -p /content/drive/MyDrive/Learning/EIE558/asr
%cd /content/drive/MyDrive/Learning/EIE558/asr

In [None]:
# Check the version of PyTorch
import torch
print(torch.__version__)

<font color="green">*Step 2: Download programs and data. If the 'python-asr' directory exists and is empty, you may delete the 'python-asr' directory and run this step again.*<font> <font color="red">*In case the website http://bioinfo.eie.polyu.edu.hk is too slow or busy, you may find the files [here](https://polyuit-my.sharepoint.com/:f:/g/personal/enmwmak_polyu_edu_hk/EpX3v5ykT_VLoiBa8jrpJ70B52X4XbEPQcyrDnLAquEcIA?e=d5Xjrv)<font>*



In [None]:
%%bash
pwd
dir="python-asr"
if [ ! -d "$dir" ]; then
  echo "Directory $dir does not exist. Downloading ${dir}.tgz"
  wget http://bioinfo.eie.polyu.edu.hk/download/EIE558/asr/${dir}.tgz;
  tar zxf ${dir}.tgz;
  rm -f ${dir}.tgz*;
else
  echo "Directory $dir already exist"
fi

In [None]:
# If you run this notebook file on your own computer, run this cell
%cd python-asr
!ls

In [None]:
# If you run this notebook file on Colab, run this cell
%cd /content/drive/MyDrive/Learning/EIE558/asr/python-asr/
!ls

<font color="green">*Download datasets (532 MBytes). If the 'data' directory exists and is empty, 
you may delete the 'data' directory and run this step again.*<font> <font color="red">*In case the website http://bioinfo.eie.polyu.edu.hk is too slow or busy, you may find the files [here](https://polyuit-my.sharepoint.com/:f:/g/personal/enmwmak_polyu_edu_hk/EpX3v5ykT_VLoiBa8jrpJ70B52X4XbEPQcyrDnLAquEcIA?e=d5Xjrv) This step will take a while.<font>*

In [None]:
%%bash
dir="data" 
if [ ! -d $dir ]; then
  echo "Directory $dir does not exist. Downloading ${dir}.zip"
  wget http://bioinfo.eie.polyu.edu.hk/download/EIE558/asr/${dir}.zip;
  unzip -o ${dir}.zip;
  rm -f ${dir}.zip*;
else
  echo "Directory $dir already exist"
fi

<font color="green">*Step 3: Train a CNN model. It may take several hours to train a model if you use all of the training data in the list file "data/digits/train.lst". You may want to use the pre-trained models in the folder "models/" if you want to obtain test accuracy only. Read the file "digitrec.py" and "model.py" to see how to implement a CNN for spoken digit recognition. If you want to train your own models, you may modify the file "digitrec.py such that "data/digits/train.lst" is replaced by "data/digits/short_train.lst" and "data/digits/test.lst" is replaced by data/digits/short_test.lst". With these modifications, it will take about 30 minutes to train a network. But the accuracy is lower.*</font>

In [None]:
# Make sure that you are still under the folder 'python-asr'
!pwd

In [None]:
# Create reduced training and test set to reduce training and test time
!more data/digits/train.lst | sed -n '1,2000p' > data/digits/short_train.lst
!more data/digits/test.lst | sed -n '1,500p' > data/digits/short_test.lst
!mkdir -p models/mymodels
!python3 digitrec.py --pool_method stats --model_file models/mymodels/spokendigit_cnn_stats.pth

<font color="green">*Step 4: Load the trained model (or the pre-trained model) and evaluate it*</font>

In [None]:
# Define the prediction function, using a DataLoader object that comprises 
# the test data as input
from digitrec import get_default_device

@torch.no_grad()
def predict_dl(model, dl):
    device = get_default_device()
    torch.cuda.empty_cache()
    batch_probs = []
    batch_targ = []
    for xb, yb in dl:
        xb = xb.float().to(device)
        yb = yb.float().to(device)
        probs = model(xb)
        batch_probs.append(probs.cpu().detach())
        batch_targ.append(yb.cpu().detach())
    batch_probs = torch.cat(batch_probs)
    batch_targ = torch.cat(batch_targ)
    return [list(values).index(max(values)) for values in batch_probs], batch_targ

In [None]:
# Load the trained model
from digitrec import get_default_device
from model import CNNModel
device = get_default_device()
model = CNNModel(pool_method='stats').to(device)
model.load_state_dict(torch.load('models/mymodels/spokendigit_cnn_stats.pth'))

In [None]:
# Evaluate the loaded model
from torch.utils.data import Dataset, DataLoader
from digitrec import SpeechDataset, evaluate
test_set = SpeechDataset(filelist='data/digits/short_test.lst', rootdir='data/digits', n_mfcc=20)
test_dl = DataLoader(test_set, batch_size=64, shuffle=False, num_workers=8, pin_memory=True)
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

In [None]:
# Load the pre-trained model that uses statistics pooling in its embedding layer.
from digitrec import get_default_device
from model import CNNModel
device = get_default_device()
model = CNNModel(pool_method='stats').to(device)
model.load_state_dict(torch.load('models/spokendigit_cnn_stats.pth', 
                                 map_location=device))

In [None]:
# Evaluate the loaded model
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

In [None]:
# Load the pre-trained model that uses adaptive average pooling in its embedding layer.
model = CNNModel(pool_method='adapt').to(device)
model.load_state_dict(torch.load('models/spokendigit_cnn_adapt.pth', 
                                 map_location=device))

In [None]:
# Evaluate the loaded model
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

In [None]:
# Load the pre-trained model that use flattening in its embedding layer.
model = CNNModel(pool_method='none').to(device)
model.load_state_dict(torch.load('models/spokendigit_cnn_none.pth', 
                                 map_location=device))

In [None]:
# Evaluate the loaded model
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

<font color="blue">*Explain the performance difference between (1) CNN with statistics pooling, (2) CNN with average pooling, and (3) CNN with flattening*</font>

<font color="green">*Step 5: Varying the kernel size. Increase the kernel size in "model.py" to 7 (or even larger) and repeat Step 4. Record the test loss and accuracy. Reduce the kernel size to 1 and observe the results. Can the CNN still capture the temporal characteristics in the MFCCs when kernel_size=1? Explain your answer.*</font> <font color="red">*If the model remains unchanged even after you have saved the file "model.py", you may reset the runtime by selecting "Runtime", followed by "Reset runtime".*</font>


<font color="green">*Step 6: Reduce the depth of the network so that the conv2, conv3, and conv4 in "model.py" are removed. After the change, the network only have one convolutional layer. Observe the performance of the network. Note that large and deep networks may not necessary produce better results, especially when the amount of training data is limited.*</font>

In [None]:
from model import CNNModel
model = CNNModel(pool_method='adapt')
print(model)

In [None]:
!python3 digitrec.py --pool_method stats --model_file models/mymodels/spokendigit_resnet_stats.pth