**EIE558 Speech Recognition Lab (Part 1): Spoken-Digit Recognition**

In this lab, you will train and evaluate a CNN model that comprises several 1-D CNN layers for spoken digit recognition. By default, the input to the CNN is an MFCC matrix of size *C* x *T*, where *C* is the number MFCC coefficients per frame and *T* is the number of frames. 

Two pooling methods are available for converting frame-based features to utterance-based features. They are adaptive average pooling and statistics pooling. The former uses Pytorch's AdaptiveAvgPooling2d() to average the last convolutional layer's activation across the frame axis. The latter concatenates the mean and the standard deviation of the activations across frames, which is commonly used in the x-vector network. If no pooling method is used, the number of frames for each utterance should be the same so that the number of nodes after flattening is identical for all utterances.

<font color="green">*Step 1: Install PyTorch*<font>

In [2]:
from google.colab import drive
drive.mount('/content/drive')
!mkdir -p /content/drive/MyDrive/Learning/EIE558
%cd /content/drive/MyDrive/Learning/EIE558/

Mounted at /content/drive
/content/drive/MyDrive/Learning/EIE558


In [None]:
# Create working directory. Ignore this step if 'EIE558' directory is existing. 
!mkdir -p /content/drive/MyDrive/Learning/EIE558

In [None]:
# Go to working directory.
%cd /content/drive/MyDrive/Learning/EIE558/

In [3]:
# Make sure that GPU will be used by clicking "Edit" --> "Notebook Setting"
!pip3 install torch==1.5.1 torchaudio==0.5 -f https://download.pytorch.org/whl/cu101/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu101/torch_stable.html
Collecting torch==1.5.1
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.5.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (704.4MB)
[K     |████████████████████████████████| 704.4MB 25kB/s 
[?25hCollecting torchaudio==0.5
[?25l  Downloading https://files.pythonhosted.org/packages/56/22/f9b9448cd7298dbe2adb428a1527dd4b3836275337da6f34da3efcd12798/torchaudio-0.5.0-cp37-cp37m-manylinux1_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 10.6MB/s 
[31mERROR: torchvision 0.9.0+cu101 has requirement torch==1.8.0, but you'll have torch 1.5.1+cu101 which is incompatible.[0m
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.5.1+cu101 which is incompatible.[0m
[31mERROR: torchaudio 0.5.0 has requirement torch==1.5.0, but you'll have torch 1.5.1+cu101 which is incompatible.[0m
Installing collected packages: torch, torchaudio
  Found existing installation: to

<font color="green">*Step 2: Download data*<font>. <font color="red">*In case the website http://bioinfo.eie.polyu.edu.hk is too slow or busy, you may find the files [here](https://polyuit-my.sharepoint.com/:f:/g/personal/enmwmak_polyu_edu_hk/EpX3v5ykT_VLoiBa8jrpJ70B52X4XbEPQcyrDnLAquEcIA?e=d5Xjrv)<font>*



In [5]:
# Download dataset. If the 'data' directory exists and is empty, 
# you may delete the 'data' directory and run this step again.
%%shell
pwd
dir="python-asr" 
if [ ! -d $dir ]; then
  echo "Directory $dir does not exist. Downloading ${dir}.tgz"
  wget http://bioinfo.eie.polyu.edu.hk/download/EIE558/asr/${dir}.tgz;
  unzip -o ${dir}.tgz;
  rm -f ${dir}.tgz*;
else
  echo "Directory $dir already exist"
fi

/content/drive/MyDrive/Learning/EIE558
Directory python-asr already exist




In [3]:
%cd /content/drive/MyDrive/Learning/EIE558/python-asr/
!ls

/content/drive/MyDrive/Learning/EIE558/python-asr
data  digitrec.py  model.py  models  __pycache__  short_test.lst  sphrec.py


In [None]:
%%shell
dir="data" 
if [ ! -d $dir ]; then
  echo "Directory $dir does not exist. Downloading ${dir}.zip"
  wget http://bioinfo.eie.polyu.edu.hk/download/EIE558/asr/${dir}.zip;
  unzip -o ${dir}.zip;
  rm -f ${dir}.zip*;
else
  echo "Directory $dir already exist"
fi

<font color="green">*Step 3: Train a CNN model. It may take several hours to train a model if you use all of the training data in the list file "data/digits/train.lst". You may want to use the pre-trained models in the folder "models/" if you want to obtain test accuracy only. Read the file "digitrec.py" and "model.py" to see how to implement a CNN for spoken digit recognition. If you want to train your own models, you may modify the file "digitrec.py such that "data/digits/train.lst" is replaced by "data/digits/short_train.lst" and "data/digits/test.lst" is replaced by data/digits/short_test.lst". With these modifications, it will take about 30 minutes to train a network. But the accuracy is lower.*</font>

In [5]:
%cd /content/drive/MyDrive/Learning/EIE558/python-asr
!more data/digits/train.lst | sed -n '1,2000p' > data/digits/short_train.lst
!more data/digits/test.lst | sed -n '1,500p' > data/digits/short_test.lst
!mkdir -p models/mymodels
!python3 digitrec.py --pool_method stats --model_file models/mymodels/spokendigit_cnn_stats.pth

/content/drive/MyDrive/Learning/EIE558/python-asr
Epoch  0
100% 32/32 [01:04<00:00,  2.03s/it]
Last lr:  0.00027089460458501675  Train_loss:  2.319438934326172  Val_loss:  2.272307872772217  Accuracy: 8.62%
Epoch  1
100% 32/32 [01:04<00:00,  2.03s/it]
Last lr:  0.0007554032818386438  Train_loss:  2.190788745880127  Val_loss:  2.1043598651885986  Accuracy: 10.82%
Epoch  2
100% 32/32 [01:04<00:00,  2.02s/it]
Last lr:  0.001  Train_loss:  1.9589784145355225  Val_loss:  1.9117239713668823  Accuracy: 39.63%
Epoch  3
100% 32/32 [01:05<00:00,  2.03s/it]
Last lr:  0.0009504846320134736  Train_loss:  1.8161309957504272  Val_loss:  1.8114361763000488  Accuracy: 59.90%
Epoch  4
100% 32/32 [01:05<00:00,  2.05s/it]
Last lr:  0.000811745653949763  Train_loss:  1.7427036762237549  Val_loss:  1.756650686264038  Accuracy: 66.98%
Epoch  5
100% 32/32 [01:05<00:00,  2.04s/it]
Last lr:  0.0006112620219362892  Train_loss:  1.7039320468902588  Val_loss:  1.7252854108810425  Accuracy: 73.75%
Epoch  6
100% 32/

<font color="green">*Step 4: Load the trained model (or the pre-trained model) and evaluate it.*</font>

In [3]:
%cd /content/drive/MyDrive/Learning/EIE558/python-asr
!ls -F data

/content/drive/MyDrive/Learning/EIE558/python-asr
digits/  noise/  speech/  text/


In [4]:
# load model. The example below is a pretrained model using adaptive average pooling.
%cd /content/drive/MyDrive/Learning/EIE558/python-asr
!ls models
from model import CNNModel
import torch
DEVICE = torch.device('cuda')
model = CNNModel(pool_method='adapt').to(DEVICE)
model.load_state_dict(torch.load('models/spokendigit_cnn_adapt.pth'))

/content/drive/MyDrive/Learning/EIE558/python-asr
mymodels		   spokendigit_cnn_none.pth
spokendigit_cnn_adapt.pth  spokendigit_cnn_stats.pth


<All keys matched successfully>

In [5]:
@torch.no_grad()
def predict_dl(model, dl):
    torch.cuda.empty_cache()
    batch_probs = []
    batch_targ = []
    for xb, yb in dl:
        xb = xb.float().to(torch.device('cuda'))
        yb = yb.float().to(torch.device('cuda'))
        probs = model(xb)
        batch_probs.append(probs.cpu().detach())
        batch_targ.append(yb.cpu().detach())
    batch_probs = torch.cat(batch_probs)
    batch_targ = torch.cat(batch_targ)
    return [list(values).index(max(values)) for values in batch_probs], batch_targ

In [6]:
!more data/digits/test.lst | sed -n '1,500p' > data/digits/short_test.lst 
from torch.utils.data import Dataset, DataLoader
from digitrec import SpeechDataset, evaluate
test_set = SpeechDataset(filelist='data/digits/short_test.lst', rootdir='data/digits', n_mfcc=20)
test_dl = DataLoader(test_set, batch_size=64, shuffle=False, num_workers=16, pin_memory=True)
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

  cpuset_checked))


Loss:  1.737992286682129 
Accuracy:  0.6930589079856873


In [None]:
# load the model that was trained on the trimed dataset. The example below is
# a model using statistics pooling in its embedding layer.
%cd /content/drive/MyDrive/Learning/EIE558/python-asr
!ls models
from model import CNNModel
import torch
DEVICE = torch.device('cuda')
model = CNNModel(pool_method='adapt').to(DEVICE)
model.load_state_dict(torch.load('models/mymodels/spokendigit_cnn_adapt.pth'))

In [None]:
!more data/digits/test.lst | sed -n '1,500p' > data/digits/short_test.lst 
from torch.utils.data import Dataset, DataLoader
from digitrec import SpeechDataset, evaluate
test_set = SpeechDataset(filelist='data/digits/short_test.lst', rootdir='data/digits', n_mfcc=20)
test_dl = DataLoader(test_set, batch_size=64, shuffle=False, num_workers=16, pin_memory=True)
r = evaluate(model, test_dl)
yp, yt = predict_dl(model, test_dl)
print("Loss: ", r['loss'], "\nAccuracy: ", r['accuracy'])

<font color="green">*Step 5: Varying the kernel size. Increase the kernel size in "model.py" to 7 (or even larger) and repeat Step 4 and Step 5. Record the test loss and accuracy. Reduce the kernel size to 1 and observe the results. Can the CNN still capture the temporal characteristics in the MFCCs when kernel_size=1? Explain your answer.*</font>

<font color="green">*Step 6: Reduce the depth of the network so that the conv2, conv3, and conv4 in "model.py" are removed. After the change, the network only have one convolutional layer. Observe the performance of the network. Note that large and deep networks may not necessary produce better results, especially when the amount of training data is limited.*</font>