# mini-project: Automatic Speech Recognition

Automatic Speech Recognition (ASR) is probably the most famous task in speech processing. The task itself is simple - we need to build an *automatic* system that transcribes an audio clip into a sequence of words in a certain language, but the model design and the details can be very complicated.

As what you have learned in the course, conventional ASR systems contained multiple components. Typically we can broadly categories them into ***acoustic model*** and ***language model***, where the acoustic model maps the waveform (or any time-frequency representations) to a list of phonemes, and the language model needs to learn the temporal dependencies bewteen phonemes (phoneme-level language model) or words (word-level language model). Prior to the advances in deep learning, acoustic modeling was typically done by HMMs and Gaussian mixture models (HMM-GMMs), and the language model was typically done by famous statistical methods such as N-grams. People need to first train the acoustic model (with phoneme labels) and the language model independently and then combine them to get the full pipeline. The implementation of such pipelines can be complex and difficult if you don't have enough experience on it, so we will not ask you to do so in this course.

With the development of neural networks, the acoustic modeling part (HMM-GMM) can be well replaced by more advanced neural networks (HMM-DNN). Moreover, given a pretrained language model, the DNN acoustic model can be trained to maximize its performance on the given language model, which is something that GMMs cannot do. Sometimes people call this **end-to-end training**, as the DNN acoustic models are trained to maximize the ASR performance (i.e., minimize the word-error-rate (WER)) instead of phoneme recognition accuracy.

Sequence-to-sequence (Seq2Seq) models have provided another view for ASR - combine the acoustic model and the language model within a same neural network, and directly generate the character sequences. As the name implies, Seq2Seq models map one sequence to another sequence, and this is done by a deep neural network. In the task of ASR, the input sequence is the waveform (or spectrogram), and the output sequence is the words. In this pipeline, we don't really have a clear definition of "acoustic model" and "language model", and what we have is a neural network that takes waveform (or MFCC) as input and generating the probability for different characters in the output sequence. Such pipelines are not only end-to-end trained, but are also **end-to-end processed**, since now the language model part is also jointly optimized and directly generates the output characters. This is the so-called ***E2E ASR***, which is also one of the most popular research direction in the development of ASR systems. Several applications, e.g., Google voice search and Google Home/Amazom Alexa, have already deployed their E2E ASR systems to replace the conventional ones.

Here we will implement a simplified version of the famous [***Listen, attend and spell (LAS)***](https://ieeexplore.ieee.org/document/7472621) framework. LAS is one of the first and most famous E2E ASR systems and has made  pure E2E ASR systems comparable to hybrid systems (HMM-DNN acoustic model + language model). It takes the input spectrogram as input, and generates the output **characters** without the need for phoneme-level labels or an explicit language model. Since modern E2E ASR models requires millions of hours of data for a good generalization ability, in this homework we will focus on how to overfit the network on a small training set.

## 1. Data Preparation

We will use a small subset of the [Librispeech](https://www.openslr.org/12) dataset for this homework. Download the data from the provided Google Drive link and extract them to the same directory as this notebook.

In [1]:
import numpy as np
import librosa
import os
import time
import h5py
import soundfile as sf

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import matplotlib.pyplot as plt
from IPython.display import Audio

In [2]:
dir_path = 'ASR'  # directory path

# walk through the directory
wav_files = []
txt_files = []
for (dirpath, dirnames, filenames) in os.walk(dir_path):
    for file in filenames:
        if '.flac' in file:
            wav_files.append(dirpath+'/'+file)
        elif '.txt' in file:
            txt_files.append(dirpath+'/'+file)
        
wav_files = sorted(wav_files)
txt_files = sorted(txt_files)
num_data = len(wav_files)
print(num_data)

# print sample filenames
print(wav_files[0])

# the .txt files contains the transcription for the wave files
# load one of them and take a look at it
with open(txt_files[0], "r") as sample_file:
    sample_transcription = sample_file.readlines()
print(sample_transcription[0])

2556
ASR/1188/133604/1188-133604-0000.flac
1188-133604-0000 YOU WILL FIND ME CONTINUALLY SPEAKING OF FOUR MEN TITIAN HOLBEIN TURNER AND TINTORET IN ALMOST THE SAME TERMS



In [3]:
# listen to one example
y, sr = librosa.load(wav_files[0])

Audio(y, rate=sr)

We need to match the corresponding transcriptions to the file names of the wave files. Let's do that here.

In [4]:
# match the transcription labels to their corresponding wave file names

transcription = []
for i in range(len(txt_files)):
    with open(txt_files[i], "r") as sample_file:
        transcription += sample_file.readlines()
        
# check if the number of elements in the transcription matches the number of utterances
assert (len(transcription) == num_data)

# note that the transcription sequences contain a file name at the beginning
# remove it for each transcription
# there might also be a '\n' representing a "new line"
# we also need to remove it
transcription = [' '.join(transcription[i].rstrip('\n').split(' ')[1:]) for i in range(num_data)]

# print a sample
# the header is now removed
print(transcription[0])

YOU WILL FIND ME CONTINUALLY SPEAKING OF FOUR MEN TITIAN HOLBEIN TURNER AND TINTORET IN ALMOST THE SAME TERMS


Remember that we will train an E2E system that directly generates the characters (iucluding numbers, space and punctuation marks). Let's count the total number of them in our training data.

In [5]:
# count the number of characters in the dataset
characters = []
for i in range(num_data):
    characters += list(set(transcription[i]))
characters = sorted(list(set(characters)))
print(characters, len(characters))

[' ', "'", 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'] 28


Here we have 28 characters (one space and one apostrophe), hence the output of our model will be the estimated probability distribution across the 28 characters (like a 28-way classification).

In language modeling tasks, we often need to explicitly mark the end of the sequence. Typically this is done by inserting an "end-of-sequence (EOS)" token at the end. We can treat it as a new "character".

In [6]:
# BOS and EOS tokens
# we use "#" to represent BOS and "/" to represent EOS
characters.append('/')
characters.append('#')

transcription = [transcription[i] + '/' for i in range(num_data)]

# transform the characters into label indices, since we will use cross entropy loss during training
label_transcription = []
for i in range(num_data):
    # convert the string into an array
    this_transcription = np.array([transcription[i][j] for j in range(len(transcription[i]))])
    
    # transform the characters into label indices
    this_label = []
    for j in range(len(this_transcription)):
        this_label.append(np.where(np.array(characters) == this_transcription[j])[0][0])
    label_transcription.append(this_label)
    
label_length = [len(label_transcription[i]) for i in range(len(label_transcription))]

# print a sample label
print(transcription[0])
print(label_transcription[0])

YOU WILL FIND ME CONTINUALLY SPEAKING OF FOUR MEN TITIAN HOLBEIN TURNER AND TINTORET IN ALMOST THE SAME TERMS/
[26, 16, 22, 0, 24, 10, 13, 13, 0, 7, 10, 15, 5, 0, 14, 6, 0, 4, 16, 15, 21, 10, 15, 22, 2, 13, 13, 26, 0, 20, 17, 6, 2, 12, 10, 15, 8, 0, 16, 7, 0, 7, 16, 22, 19, 0, 14, 6, 15, 0, 21, 10, 21, 10, 2, 15, 0, 9, 16, 13, 3, 6, 10, 15, 0, 21, 22, 19, 15, 6, 19, 0, 2, 15, 5, 0, 21, 10, 15, 21, 16, 19, 6, 21, 0, 10, 15, 0, 2, 13, 14, 16, 20, 21, 0, 21, 9, 6, 0, 20, 2, 14, 6, 0, 21, 6, 19, 14, 20, 28]


Long sequences can be hard for the model to transcribe, especially when our model is small and computational resouce is limited. In this homework, we only use 400 utterances whose transcription labels have no more than 50 characters. If you are interested in a more challenging task, you can try to use all the data after this homework.

In [7]:
# select 400 utterances within 50 characters
select_idx = np.where(np.array(label_length) <= 50)[0][:400]

wav_files = np.array(wav_files)[select_idx]


print(num_data)

400


## 2. Network Architecture

![](https://opennmt.net/OpenNMT/img/las.png)

The diagram briefly shows the LAS pipeline. An ***RNN encoder*** contains multiple RNN layers to extract feature from the given input feature. This is the ***listener*** in the LAS framework. An ***RNN decoder*** predicts the probability across the 28 candidates for the current time step based on the prediction of the previous time step and the encoder outputs. This is the ***speller*** in the LAS framework. An ***attention*** module is added to the speller to better utilize the encoder outputs during the prediction process in the decoder.

### 2.1 Encoder

The RNN used in the original LAS model is a ***pyramidal encoder***, where multiple RNN layers are applied at different time scales. For example, the output of the first RNN is downsampled by a factor of two (via calculating the average for every two consecutive frames) and passed to the second RNN layer. However, in this homework we will not use this architecture, but simply use a deep LSTM model similar to the ones we used in the previous homework.

Before we start with the architecture in the encoder, we first notice that the input/output for the ASR task is different from all the previous tasks - different input utterances can have *different lengths*. We cannot truncate all the inputs to a same length as what we did for enhancement/separation/AED, since here we need to transcribe a full sentence and such truncation can destroy the completeness of the labels. Moreover, since we do batch-wise training in all models, we need to make sure utterances with different lengths can be concatenated into a same batch, and the model is able to deal with those utterances within a batch.

We typically have the following procedure to process those variable-length inputs in a batch:
- For a batch of utterances, calculate the length for each sequence, and check the maximum length of them
- Zero-pad the utterances whose length are smaller than the maximum length (pad at the end)
- Concatenate the padded utterances into a batch, and pass it to [pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence) in Pytorch (check the link for how the function works)

To reverse the *pack_padded_sequence*, you can use [pad_packed_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html) to transform the packed utterances back to origin.

We provide an example below.

In [8]:
# a batch of utterances with different length

x1 = torch.rand(6, 4)  # seq_length, feature_dim
x2 = torch.rand(13, 4)  # seq_length, feature_dim
x3 = torch.rand(10, 4)  # seq_length, feature_dim

input_length = [x1.shape[0], x2.shape[0], x3.shape[0]]
max_length = max(input_length)

# pad x1 since
zero_feature = torch.zeros(max_length - x1.shape[0], x1.shape[1])
x1 = torch.cat([x1, zero_feature], 0)
zero_feature = torch.zeros(max_length - x3.shape[0], x3.shape[1])
x3 = torch.cat([x3, zero_feature], 0)

# concatenate the utterances to form a batch
batch_x = torch.cat([x1.unsqueeze(0), x2.unsqueeze(0), x3.unsqueeze(0)], 0)  # 3, max_length, feature_dim
print("Batch feature shape: ", batch_x.shape)
# wrap it with pack_padded_sequence
batch_x_packed = nn.utils.rnn.pack_padded_sequence(batch_x, input_length, batch_first=True, enforce_sorted=False)

# let's take a look at the packed batch feature
print("\nPacked feature: ", batch_x_packed)
print("\nPacked feature data shape: ", batch_x_packed.data.shape)

# unpack the features

batch_x_unpacked, lens_unpacked = nn.utils.rnn.pad_packed_sequence(batch_x_packed, batch_first=True, total_length=max_length)
print("\nUnpacked feature: ", batch_x_unpacked)

print("\nUnpacked feature is the same as the original batch feature: ", torch.equal(batch_x_unpacked, batch_x))

Batch feature shape:  torch.Size([3, 13, 4])

Packed feature:  PackedSequence(data=tensor([[0.5301, 0.2675, 0.5479, 0.9219],
        [0.2873, 0.8899, 0.8324, 0.5298],
        [0.5723, 0.3448, 0.5104, 0.2134],
        [0.3783, 0.2898, 0.5684, 0.8451],
        [0.8093, 0.1661, 0.4574, 0.7041],
        [0.9836, 0.5938, 0.2634, 0.0374],
        [0.0787, 0.9017, 0.5463, 0.8032],
        [0.3102, 0.7769, 0.2235, 0.1065],
        [0.9116, 0.0774, 0.1864, 0.7992],
        [0.6271, 0.1110, 0.9587, 0.5418],
        [0.8274, 0.1583, 0.4707, 0.1101],
        [0.7948, 0.0780, 0.7017, 0.4516],
        [0.2416, 0.9324, 0.1467, 0.2139],
        [0.3126, 0.2478, 0.0209, 0.7836],
        [0.2680, 0.0892, 0.8580, 0.0865],
        [0.8226, 0.0906, 0.3906, 0.1810],
        [0.2734, 0.2140, 0.4030, 0.7325],
        [0.1947, 0.2809, 0.8497, 0.2949],
        [0.9202, 0.1693, 0.1192, 0.4242],
        [0.6470, 0.8047, 0.7554, 0.4054],
        [0.8824, 0.8751, 0.8863, 0.7617],
        [0.4924, 0.0896, 0.6584, 0.

In previous homework we always used ***unidirectional LSTM*** for sequence modeling tasks. LAS uses a **bidirectional LSTM (BLSTM)** to perform better sequence modeling. BLSTM contains another LSTM layer that **scans the input from an inverse direction**, which allows the model to capture not only the forward dependency but also the backward dependency within a sequence. Moreover, we introduce [***residual connection***](https://ieeexplore.ieee.org/document/7780459/) in the BLSTM layers to improve the performance and stabilize the training.

Residual connection is probably the most important finding in deep learning in the past 5 years. The idea is rather simple - instead of asking a layer to learn a mapping $y = f(x)$, residual connection adds the input to the output: $y = f(x) + x$. This simple task have made neural networks really *deep* and significantly boosted the performance. Although it was originally proposed for convolutional networks, it has been extensively applied in all types of architectures and become a basic (and default) building unit in any networks.

In [9]:
# example implementation of deep residual LSTM
# support inputs with variable length

class ResidualLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, bidirectional=False, packed_input=False):
        super(ResidualLSTM, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_direction = int(bidirectional) + 1
        self.packed_input = packed_input
        
        self.rnn = nn.LSTM(input_size, hidden_size, 1, batch_first=True, bidirectional=bidirectional)
        self.proj = nn.Linear(hidden_size*self.num_direction, input_size)

    def forward(self, input, input_length=0, hidden_state=None):
        # input shape: batch_size, sequence_length, feature_dim
        
        if self.packed_input:
            # unpack input for residual connection
            input_unpack, _ = nn.utils.rnn.pad_packed_sequence(input, batch_first=True)

        rnn_output, hidden_state = self.rnn(input, hidden_state)
        
        if self.packed_input:
            # unpack output
            rnn_output, _ = nn.utils.rnn.pad_packed_sequence(rnn_output, batch_first=True)
        
        batch_size, seq_length = rnn_output.shape[:2]
    
        # project the output back to the input dimension
        proj_output = self.proj(rnn_output.contiguous().view(-1, rnn_output.shape[2])).view(batch_size, seq_length, -1)
        
        # residual connection
        if self.packed_input:
            output = input_unpack + proj_output
            # pack output
            output = nn.utils.rnn.pack_padded_sequence(output, input_length, batch_first=True, enforce_sorted=False)
        else:
            output = input + proj_output
            
        return output, hidden_state
            
    
class DeepResidualLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1, bidirectional=False, packed_input=False):
        super(DeepResidualLSTM, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_direction = int(bidirectional) + 1
        self.num_layers = num_layers
        
        self.layers = nn.ModuleList([])
        for i in range(self.num_layers):
            self.layers.append(ResidualLSTM(input_size, hidden_size, bidirectional, packed_input))
    
    def forward(self, input, input_length=0, hidden_state=None):
        # input shape: batch_size, sequence_length, feature_dim
        output = input
        if hidden_state == None:
            hidden_state = [None]*self.num_layers
        for i in range(self.num_layers):
            output, layer_hidden_state = self.layers[i](output, input_length, hidden_state[i])
            hidden_state[i] = layer_hidden_state
        return output, hidden_state

Now we take a look at how to use them to construct an LSTM encoder that receives variable-length inputs.

In [10]:
# define a 2-layer deep residual BLSTM as encoder
sample_BLSTM = DeepResidualLSTM(input_size=4, hidden_size=8, num_layers=2, bidirectional=True, packed_input=True)

# use the packed tensor we defined above
print(batch_x_packed.data.shape)

# pass to the model
sample_output, _ = sample_BLSTM(batch_x_packed, input_length)
print(sample_output[0].shape)
# note that the input and output of the residual LSTM will have the same feature dimension

# unpack the output
sample_output_unpacked, _ = nn.utils.rnn.pad_packed_sequence(sample_output, batch_first=True, total_length=max_length)
print(sample_output_unpacked.shape)

torch.Size([29, 4])
torch.Size([29, 4])
torch.Size([3, 13, 4])


### 2.2 Attention + Decoding

The attention-based speller contains an attention module and an RNN decoder to generate the sequence of character probabilities. There are two main questions here:
- What is an attention module?
- How does the RNN decoder generate the sequential output step-by-step?

We start with the second question. We have already worked with RNNs in the previous homeworks. But note that in all previous models, we often assume that **the length of the output is the same as the length of the input**. In the VAD task and the enhancement/separation tasks, the input is the spectrogram of the audio clip, and the output is either a frame-level label (speech/nonspeech) or another spectrogram. In these pipelines, the way we use the RNNs is simple - for every frame in the input sequence, we generate a output frame from the RNN layers. However, the ASR task, or more general the language modeling tasks, are different - the input and output sequences can have different length. In this case, we cannot assume that every frame in the input sequence corresponds to an output frame. We need to modify the way we generate the output sequence.

Sequence-to-sequence (Seq2Seq) models were designed for this task - mapping one sequence to another. Typically Seq2Seq models contain an **encoder** and a **decoder**, where the encoder takes the input sequence as the model input and generates another sequence (or a single compressed vector), and the decoder takes the encoder output as the model input and generates another sequence **in an autoregressive way** - it generates the output frames one by one, and terminates when it generates EOS or meets a predefined maximum sequence length. I borrow an image below for an illustration. In this figure, the word "Yes" is generated for the 1st frame, and it is used as an additional input to generate the 2nd frame (the wrod "what's"). The generation process stops when "END" ("EOS") is generated.
![](https://i.stack.imgur.com/YjlBt.png)

Traditionally, the encoder output is squeezed to a single vector and then serves as the initial hidden state (or part of the input) of the decoder RNN. The input to the current frame of the decoder RNN should be the output of the previous frame, and for the first frame we don't have a previous frame. In this case, a "begin-of-sequence (BOS)" token is used for the first frame. I'll provide an example below.

In [11]:
# example of a traditional seq2seq model
# I don't use variable-length inputs here just for simplicity

batch_size = 2
seq_length = 7
feature_dim = 4
input_sequence = torch.rand(batch_size, seq_length, feature_dim)

# encoder BLSTM
encoder_BLSTM = DeepResidualLSTM(input_size=feature_dim, hidden_size=feature_dim*2, num_layers=2, 
                                 bidirectional=True, packed_input=False)

encoder_output, _ = encoder_BLSTM(input_sequence)
encoder_output = encoder_output.contiguous()

# we use mean-pooling to squeeze the encoder output into a single vector
encoder_vector = encoder_output.mean(1)  # batch_size, feature_dim*2

# the encoder output vector is used as part of the input to the decoder RNN
# decoder RNN cannot be bidirectional, since we need to generate the outputs one by one
# moreover, the output from a previous step will be the input for the next step
decoder_LSTM = DeepResidualLSTM(input_size=feature_dim+len(characters), hidden_size=feature_dim*2, 
                                num_layers=2, bidirectional=False)

# another MLP for estimating the output character probabilities
# takes the concatenation of encoder context vector and the decoder output as input
# no activation function for the last layer if you use nn.CrossEntropyLoss for objective
decoder_prob = nn.Sequential(nn.Linear(feature_dim*2+len(characters), feature_dim*2),
                             nn.Tanh(),
                             nn.Linear(feature_dim*2, len(characters)-1)
                            )
# the output dimension is len(characters)-1 since we won't have BOS in the outputs

max_decoding_length = 20  # you need to set a larger value for the actual dataset

decoder_output_prob = []
decoder_output_label = []
for step in range(max_decoding_length):
    if step == 0:
        # for the first step, use BOS as part of input
        # we set BOS as an all-zero vector
        prev_label = torch.zeros(batch_size, len(characters))  # the feature dimension is len(characters) because we will use one-hot vectors later
    else:
        # starting from the second step, use the predicted label from the previous step as part of the input
        # generate one-hot vector
        prev_label = torch.zeros(batch_size, len(characters))
        prev_label.scatter_(1, decoder_output_label[-1], 1)  # batch_size, len(characters)

    concat_input = torch.cat([prev_label, encoder_vector], 1)  # batch_size, feature_dim+len(characters)

    # treat it as a sequence with only one time step, send it to the decoder RNN
    if step == 0:
        current_decoder_output, current_decoder_state = decoder_LSTM(concat_input.unsqueeze(1))
    else:
        # use previous decoder state
        current_decoder_output, current_decoder_state = decoder_LSTM(concat_input.unsqueeze(1), current_decoder_state)
    # concatenate the encoder context vector with the decoder output
    prob_input = torch.cat([current_decoder_output.squeeze(1), encoder_vector], 1)

    # pass to the output MLP
    current_decoder_prob = decoder_prob(prob_input)  # batch_size, len(characters)-1
    decoder_output_prob.append(current_decoder_prob.unsqueeze(1))  # batch_size, 1, len(characters)-1
    # select the one with the highest probability as the predicted character
    _, sample_character_index = torch.max(current_decoder_prob, dim=1)  # select the one with the highest probability
    decoder_output_label.append(sample_character_index.unsqueeze(1))

decoder_output_prob = torch.cat(decoder_output_prob, 1)  # batch, max_length, len(characters) - 1
decoder_output_label = torch.cat(decoder_output_label, 1)  # batch, max_length
print(decoder_output_prob.shape, decoder_output_label.shape)

torch.Size([2, 20, 29]) torch.Size([2, 20])


We have provided an example about a Seq2Seq pipeline. We now go to the first question above: what is an attention module?

Ww see above that the encoder output sequence is squeezed into a single vector and used throughout the decoding process. The intuition is that the squeezed vector contains all the information in the input sequence and it can help the decoding process. However, sometimes we may want to focus on a certain part of the input, as the output may only be related to a small portion of the input. An example is shown below:
![](https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg)

Note the difference here: the encoder outputs are not squeezed into a single vector. A set of ***attention weights*** are calculated, and the encoder output at different frames are ***weighted summed*** to form a ***context vector***. Intuitively, the attention weights allow us to adjust the importance of the encoder outputs at different frames in the decoder process. In the ASR task, we typically assume that a word can be solely transcribed by just a few frames in the input (e.g., 100~200 ms), hence there is no use for us to use a globally squeezed encoder output - something that is **more focused on those frames** will be enough. In this case, the attention weights can have higher values on those frames and lower values on others, allowing the decoder to only use the encoder feature from that specific period.

The attention weights are calculated for each stage in the decoding process, and it takes the predicted label from the previous stage as well as all the encoder frames as input. I provide an example below.

In [12]:
# we use the same encoder and decoder LSTM as above, but defines a new *attention* module

# use an FC layer for decoder LSTM outputs to match the feature dimension to the encoder outputs
# we calculate the similarity between the rescaled decoder LSTM output and all the encoder outputs
decoder_FC = nn.Linear(feature_dim+len(characters), feature_dim)

# input length in this batch
input_length = [4, 7]

# the encoder process is the same, so we only modify the decoder process
decoder_output_prob = []
decoder_output_label = []
for step in range(max_decoding_length):
    if step == 0:
        # for the first step, use BOS as the previous output label
        # set it to an all-zero matrix
        prev_label = torch.zeros(batch_size, len(characters))  # BOS

        # the previous attention context vector for the first step is also zero
        # since we don't have an output from the attention module yet
        context_vector = torch.zeros(batch_size, feature_dim)  # batch_size, feature_dim

    else:
        # starting from the second step, use the predicted label from the previous step as part of the input
        # generate one-hot vector
        prev_label = torch.zeros(batch_size, len(characters))
        prev_label.scatter_(1, decoder_output_label[-1], 1)  # batch_size, len(characters)
        # we have the attention context vector from last step, so we use it

    # step 1: calculate decoder RNN output
    # takes the previous attention context vector (zero) and the previous output label (BOS) as input
    decoder_input = torch.cat([prev_label, context_vector], 1)  # batch_size, len(characters)+feature_dim

    # pass it to decoder RNN
    if step == 0:
        current_decoder_output, current_decoder_state = decoder_LSTM(decoder_input.unsqueeze(1))
    else:
        # use previous decoder state
        current_decoder_output, current_decoder_state = decoder_LSTM(decoder_input.unsqueeze(1), current_decoder_state)

    # step 2: update the attention context vector
    # pass the current decoder output to the decoder FC
    this_embedding = decoder_FC(current_decoder_output.squeeze(1))  # batch_size, feature_dim

    # calculate the similarity between this embedding and all the encoder output embeddings
    # similarity is calculated by dot product
    similarity = encoder_output.bmm(this_embedding.unsqueeze(2))  # batch_size, seq_length, 1

    # the attention weights have the constraint that their summation should be 1
    # so here we use a Softmax function
    # also note that input utterances in the batch might have different actual lengths
    # in the calculation of attention weights, we only want to consider the valid encoder outputs

    attention_weight = [F.softmax(similarity[i, :input_length[i]], dim=0).view(-1,1) for i in range(batch_size)]
    # calculate the weighted sum of the valid encoder outputs
    context_vector = [(encoder_output[i,:input_length[i]] * attention_weight[i]).sum(0).unsqueeze(0) for i in range(batch_size)]  # 1, feature_dim
    context_vector = torch.cat(context_vector, 0)  # batch_size, feature_dim

    # step 3: concatenate the new attention context vector with the decoder output
    # pass to the output MLP to generate the final output
    # this is the same as the standard decoder without attention

    prob_input = torch.cat([current_decoder_output.squeeze(1), context_vector], 1)
    current_decoder_prob = decoder_prob(prob_input)  # batch_size, len(characters)-1
    decoder_output_prob.append(current_decoder_prob.unsqueeze(1))  # batch_size, 1, len(characters)-1
    _, sample_character_index = torch.max(current_decoder_prob, dim=1)  # select the one with the highest probability
    decoder_output_label.append(sample_character_index.unsqueeze(1))

decoder_output_prob = torch.cat(decoder_output_prob, 1)  # batch, max_length, len(characters)-1
decoder_output_label = torch.cat(decoder_output_label, 1)  # batch, max_length
print(decoder_output_prob.shape, decoder_output_label.shape)

torch.Size([2, 20, 29]) torch.Size([2, 20])


You may wonder what does the attention weights look like. Here is an example:
![](https://pravn.files.wordpress.com/2018/10/las_attention.png?w=739)
We can observe that every output character only relies on a few neighbouring frames in the input feature, which is indeed what we expect.

### 2.3 Character Embedding

In the example above we use the one-hot vector for the characters as input to the model. However, we have better representations rather than the one-hot vector: the ***character embeddings***. You might have heard about ***word embeddings*** in natural language processing tasks, where each word has its own representation vector and words with similar semantic meaning have a high similarity between their representations/embeddings:
![](https://www.ibm.com/blogs/research/wp-content/uploads/2018/10/WMEFig1.png)

The idea is same here: we assign a vector for each character in our character list, and allow it to be jointly optimized with the entire network.

In [13]:
# example of character embeddings

character_embedding = nn.Embedding(len(characters), feature_dim)

# need to redefine the decoder LSTM, decoder FC and decoder MLP
# since the dimension of the embedding is differenet from len(characters)

decoder_LSTM = DeepResidualLSTM(input_size=feature_dim*2, hidden_size=feature_dim*2, 
                                num_layers=2, bidirectional=False)

decoder_FC = nn.Linear(feature_dim*2, feature_dim)

decoder_prob = nn.Sequential(nn.Linear(feature_dim*3, feature_dim*2),
                             nn.Tanh(),
                             nn.Linear(feature_dim*2, len(characters)-1)
                            )

decoder_output_prob = []
decoder_output_label = []

for step in range(max_decoding_length):
    if step == 0:
        # for the first step, use BOS as the previous output label
        # BOS is the last character in our character list
        # no need for one-hot vector, we only need the index
        prev_label = torch.ones(batch_size).long() * (len(characters) - 1)  # index for BOS

    else:
        # starting from the second step, use the predicted label from the previous step as part of the input
        prev_label = decoder_output_label[-1].squeeze(1)  # index for the previous character
        
    # extract their embeddings
    prev_embedding = character_embedding(prev_label)
    
    # the rest is the same
    decoder_input = torch.cat([prev_embedding, context_vector], 1)  # batch_size, feature_dim*2

    # pass it to decoder RNN
    if step == 0:
        current_decoder_output, current_decoder_state = decoder_LSTM(decoder_input.unsqueeze(1))
    else:
        # use previous decoder state
        current_decoder_output, current_decoder_state = decoder_LSTM(decoder_input.unsqueeze(1), current_decoder_state)
    this_embedding = decoder_FC(current_decoder_output.squeeze(1))  # batch_size, feature_dim

    # calculate the similarity between this embedding and all the encoder output embeddings
    # similarity is calculated by dot product
    similarity = encoder_output.bmm(this_embedding.unsqueeze(2))  # batch_size, seq_length, 1

    attention_weight = [F.softmax(similarity[i, :input_length[i]], dim=0).view(-1,1) for i in range(batch_size)]
    # calculate the weighted sum of the valid encoder outputs
    context_vector = [(encoder_output[i,:input_length[i]] * attention_weight[i]).sum(0).unsqueeze(0) for i in range(batch_size)]  # 1, feature_dim
    context_vector = torch.cat(context_vector, 0)  # batch_size, feature_dim
    
    prob_input = torch.cat([current_decoder_output.squeeze(1), context_vector], 1)
    current_decoder_prob = decoder_prob(prob_input)  # batch_size, len(characters)-1
    decoder_output_prob.append(current_decoder_prob.unsqueeze(1))  # batch_size, 1, len(characters)-1
    _, sample_character_index = torch.max(current_decoder_prob, dim=1)  # select the one with the highest probability
    decoder_output_label.append(sample_character_index.unsqueeze(1))
    
decoder_output_prob = torch.cat(decoder_output_prob, 1)  # batch, max_length, len(characters)-1
decoder_output_label = torch.cat(decoder_output_label, 1)  # batch, max_length
print(decoder_output_prob.shape, decoder_output_label.shape)

torch.Size([2, 20, 29]) torch.Size([2, 20])


### 2.4 Teacher Forcing

You may notice that the predicted label from the previous step is used as part of the input to predict the current step. However, at the beginning of the training period, the predictions are almost random and inaccurate. If we ask the model to predict the current step with a wrong previous step, it may hurt the actual optimization or convergence.

A technique called [***teacher forcing***](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/) can be used in such autoregression sequence generation tasks to improve the convergence. The idea of teacher forcing is pretty simple: we know that the previous predicted character can be a mess, so we use the ***previous target label*** in the training process. An example is provided below.

In [14]:
# apply teacher forcing for decoding

total_epoch = 100

max_decoding_length = 20

teacher_forcing = True

for epoch in range(1, total_epoch+1):
    
    for step in range(max_decoding_length):
        if step == 0:
            # no teacher forcing for the first step
            # always use BOS
            pass
        else:
            if teacher_forcing:
                # use the oracle label at the previoius step
                # prev_label = actual_label[:,step-1]
                pass
            else:  # this is for test time
                # use the previous predicted label
                # prev_label = decoder_output_label[-1].squeeze(1)
                pass

# do not use teacher forcing during test time!

### 2.5 Model Training and  Evaluation

During training, we can let the decoder to decode until the *max_decoding_length* is reached, and we use cross entropy loss on the output character probabilities and the target character labels as the training objective. Note that the length of the targets can be different and may be shorter than *max_decoding_length*, so we need to pad the labels to a same length in order to concatenate the targets into a batch and apply proper training. However, we do not want the padded parts to affect the training as they are not really valid parts in the target labels. Here, we need to make use of the *ignore_index* parameter in the [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) function: by defining an invalid index, the function will omit all labels with this index and prevent them from generating gradients in the backpropagation step. Let's see an example.

In [15]:
# use ignore_index in nn.CrossEntropyLoss

label1 = torch.from_numpy(np.array([3,12,5,7,2,28])).long()
label2 = torch.from_numpy(np.array([6,26,8,15,19,0,2,0,5,28])).long()

# pad label1 with a given null index
# let's set it to len(characters), as index len(characters)-1 is the last entry (BOS)
ignore_index = len(characters)
label1 = torch.cat([label1, torch.ones(len(label2)-len(label1)).long() * ignore_index], 0)

# concatenate the labels into batch
batch_label = torch.cat([label1.unsqueeze(0), label2.unsqueeze(0)], 0)
print(batch_label.shape)

# randomly generate some outputs
model_output = torch.randn(batch_label.shape[0], batch_label.shape[1], len(characters)-1)

# CE loss
# ignore labels whose entries are ignore_index
objective = nn.CrossEntropyLoss(ignore_index=ignore_index)
loss = objective(model_output.view(-1, len(characters)-1), batch_label.view(-1,))
print(loss)

torch.Size([2, 10])
tensor(3.9383)


Note that the since the target labels are padded and teacher forcing requires the use of target labels during training, it is possible that this *ignore_index* will be used as *prev_label* during training. Hence we need to assign one more character embedding for it.

In [16]:
# adjust character embeddings

character_embedding = nn.Embedding(len(characters)+1, feature_dim)  # one more embedding for ignore_index

# do not need to change other parts

Note that the RNN decoder may need to generate a lot of character predictions step-by-step as *max_decoding_length* can be large. In this case, the convergence of the model may be hard as the accumulated gradient can be large, especially when the attention module is also applied. To alleviate this, we can use the [nn.utils.clip_grad_norm_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) function to clip the gradient norm to a maximum value. A sample usage below:

In [17]:
sample_FC = nn.Linear(4, 8)
sample_input = torch.rand(2, 4)
sample_input = torch.rand(2, 4)
sample_optimizer = optim.Adam(sample_FC.parameters(), lr=1e-3)

sample_output = sample_FC(sample_input) * 100
sample_loss = sample_output.mean()

sample_loss.backward()
print("Gradient before clipping: ", sample_FC.weight.grad)
# clip the gradient by a maximum norm of 5
nn.utils.clip_grad_norm_(sample_FC.parameters(), 5.)
print("Gradient after clipping: ", sample_FC.weight.grad)
sample_optimizer.step()

Gradient before clipping:  tensor([[4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339],
        [4.6183, 9.5318, 0.5038, 5.0339]])
Gradient after clipping:  tensor([[0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190],
        [0.4761, 0.9827, 0.0519, 0.5190]])


The evaluation of the model is done by comparing the predicted labels and the target characters. ***Character error rate (CER)*** and ***Word error rate (WER)*** are used as the metric here, which calculates the edit distance between the prediction and the target. Note that the length of the output of your model is *max_decoding_length*, you need to find the first EOS and truncate the output before calculating CER and WER as EOS represents the end of the generation process. Here we directly use the [*fastwer*](https://pypi.org/project/fastwer/) library.

In [18]:
# install the library if you haven't done so
!pip install pybind11 fastwer



In [19]:
import fastwer 

# compare two sequences
target = transcription[0][:7]
print(target)
prediction = decoder_output_label[0][:7]
print(prediction)
# prediction is a set of indices, we need to convert them back to the characters
prediction = [characters[prediction[i]] for i in range(len(prediction))]
# concat the characters
prediction = ''.join(prediction)
print(prediction)

# calcualte sentence-level CER/WER via the fastwer.score_sent function
sample_cer = fastwer.score_sent(prediction, target, char_level=True)
sample_wer = fastwer.score_sent(prediction, target)
print("Sentence-level CER: {:.2f}%; Sentence-level WER: {:.2f}%.".format(sample_cer, sample_wer))

# if you have multiple sentences, you can also calculate corpus-level CER/WER via the fastwer.score function
# a list of targets and predictions
target_list = [transcription[i][:7] for i in range(2)]
prediction_list = [decoder_output_label[i][:7] for i in range(2)]
prediction_list = [[characters[prediction_list[j][i]] for i in range(len(prediction_list[j]))] for j in range(len(prediction_list))]
prediction_list = [''.join(prediction_list[i]) for i in range(len(prediction_list))]

all_cer = fastwer.score(prediction_list, target_list, char_level=True)
all_wer = fastwer.score(prediction_list, target_list)
print("Corpus-level CER: {:.2f}%; Corpus-level WER: {:.2f}%.".format(all_cer, all_wer))

# note that both CER and WER can be higer than 100%

THEN HE
tensor([21, 24, 10,  4, 24, 10,  4])
TWICWIC
Sentence-level CER: 85.71%; Sentence-level WER: 100.00%.
Corpus-level CER: 85.71%; Corpus-level WER: 100.00%.


Now we have provided examples for all components in this E2E ASR system - the BLSTM encoder, the LSTM decoder, the attention module, the training objective, and the evaluation metrics. It's your turn to combine them, build a full pipeline, and train it on the dataset we prepared at the beginning.

Note that:
- You need to extract MFCC features from those waveforms and use then as the input to the BLSTM encoder ("listener").
- You need to properly handle the variable-length input/output problem.
- You need to evaluate CER and WER on the entire training set **without teacher forcing**. You need to achieve CER and WER lower than 10% on this training set to get the full mark.
- You can play with the hidden size, number of layers, learning rate, or other hyperparameters in any parts of the network.

In [20]:
# TODO: build a full system and train it

class LAS(nn.Module):
    def __init__(self, MFCC_dim=40, num_target=len(characters), max_decoding_length=max_label_length):
        super(LAS, self).__init__()
        
        self.input_dim = MFCC_dim
        self.hidden_dim = 64
        self.num_target = num_target
        self.max_decoding_length = max_decoding_length
        
        # one additional character embedding for padded invalid labels
        # during decoding, this entry might be used in teacher forcing
        # during testing, this will never be generated
        self.character_embedding = nn.Embedding(self.num_target+1, self.hidden_dim)
        
        # encoder LSTM
        self.encoder_LSTM = DeepResidualLSTM(self.input_dim, self.hidden_dim, num_layers=2, 
                                             bidirectional=True, packed_input=True)
        
        # decoder LSTM
        self.decoder_LSTM = DeepResidualLSTM(self.hidden_dim+self.input_dim, self.hidden_dim, 
                                             num_layers=2, bidirectional=False, packed_input=False)
        
        # FC layers for matching the dimension in attention
        self.decoder_FC = nn.Linear(self.hidden_dim+self.input_dim, self.input_dim)
        
        # output MLP
        # BOS and the padded invalid label will not be in the final output
        self.decoder_prob = nn.Sequential(nn.Linear(self.hidden_dim+self.input_dim*2, self.hidden_dim),
                                          nn.Tanh(),
                                          nn.Linear(self.hidden_dim, self.num_target-1)  
                                         )
        
    def encoding(self, input_feature, input_length):
        # input_feature: MFCC of shape (batch_size, max_frame, MFCC_dim)
        # input_length: denotes the actual length for each utterance in the batch, shape (batch_size,)
        # input label: oracle target labels of shape (batch_size, max_frame)
        
        # encoder
        
        enc_output, _ = self.encoder_LSTM(input_feature, input_length)
        
        # unpack the output
        enc_output, _ = nn.utils.rnn.pad_packed_sequence(enc_output, batch_first=True)
        
        return enc_output
    
    def decoding(self, enc_output, context_vector, prev_label, 
                 input_length, current_decoder_state=None):
        
        batch_size = enc_output.shape[0]
        
        # extract character embedding
        prev_label = self.character_embedding(prev_label)

        # step 1: calculate decoder RNN output
        decoder_input = torch.cat([prev_label, context_vector], 1)

        # pass it to decoder RNN
        current_decoder_output, current_decoder_state = self.decoder_LSTM(decoder_input.unsqueeze(1), 
                                                                          hidden_state=current_decoder_state)

        # step 2: update the attention context vector
        this_embedding = self.decoder_FC(current_decoder_output.squeeze(1))
        similarity = enc_output.bmm(this_embedding.unsqueeze(2)) 
        attention_weight = [F.softmax(similarity[i, :input_length[i]], dim=0).view(-1,1) for i in range(batch_size)]
        context_vector = [(enc_output[i,:input_length[i]] * attention_weight[i]).sum(0).unsqueeze(0) for i in range(batch_size)]
        context_vector = torch.cat(context_vector, 0)

        # step 3: concatenate the new attention context vector with the decoder output
        prob_input = torch.cat([current_decoder_output.squeeze(1), context_vector], 1)
        current_decoder_prob = self.decoder_prob(prob_input)

        return current_decoder_prob, current_decoder_state, context_vector

    def forward(self, input_feature, input_length, input_target_label=None, teacher_forcing=True):
        
        # encoding
        enc_output = self.encoding(input_feature, input_length)
        batch_size = enc_output.shape[0]
        
        decoder_output_prob = []
        decoder_output_label = []
        
        for step in range(self.max_decoding_length):
            if step == 0:
                # use BOS at the beginning
                # BOS is the last entry in the character list, second last entry in the character embeddings
                prev_label = torch.ones(batch_size).long() * (self.num_target - 1)
                context_vector = torch.zeros(batch_size, self.input_dim)
                current_decoder_state = None
            else:
                if teacher_forcing:
                    # apply teacher forcing
                    prev_label = input_target_label[:,step-1]
                else:
                    prev_label = decoder_output_label[-1].squeeze(1)
                        
            # decoding
            current_decoder_prob, current_decoder_state, context_vector = self.decoding(enc_output, context_vector, 
                                                                                        prev_label, input_length,
                                                                                        current_decoder_state)
            
            decoder_output_prob.append(current_decoder_prob.unsqueeze(1))
            _, sample_character_index = torch.max(current_decoder_prob, dim=1)  # select the one with the highest probability
            decoder_output_label.append(sample_character_index.unsqueeze(1))

        decoder_output_prob = torch.cat(decoder_output_prob, 1)  # batch, max_length, len(characters)-1 (no BOS)
        decoder_output_label = torch.cat(decoder_output_label, 1)  # batch, max_length

        return decoder_output_prob, decoder_output_label
        
    
# CE loss
def CE(output, target):
    # output shape: (batch, max_length, num_target)
    # target shape: (batch, max_length)
    
    batch_size, max_length, num_target = output.shape
    
    loss = nn.CrossEntropyLoss(ignore_index=len(characters))
    
    return loss(output.view(-1, num_target), target.view(-1,))

In [21]:
# calculate all MFCC first 

MFCC_feature = []
MFCC_length = []
for i in range(len(wav_files)):
    y, sr = librosa.load(wav_files[i])
    this_MFCC = librosa.feature.mfcc(y, sr, S=None, n_mfcc=40, n_fft=2048, hop_length=512)
    # normalization
    this_MFCC_mean = np.mean(this_MFCC, 1)
    this_MFCC_std = np.sqrt(np.var(this_MFCC, 1) + 1e-6)
    this_MFCC = (this_MFCC - this_MFCC_mean[:,np.newaxis]) / this_MFCC_std[:,np.newaxis]
    MFCC_feature.append(this_MFCC)
    MFCC_length.append(this_MFCC.shape[1])
    
# pad MFCC features whose length is shorter than the maximum length
max_MFCC_length = max(MFCC_length)
zero_vector = np.zeros((40, max_MFCC_length))
for i in range(len(wav_files)):
    if MFCC_feature[i].shape[1] < max_MFCC_length:
        MFCC_feature[i] = np.concatenate([MFCC_feature[i], zero_vector[:,:max_MFCC_length-MFCC_feature[i].shape[1]]], 1)
        
# pad label target label whose length is shorter than the maximum length
# the index for the padded part should be different than any existing indices in the character
ignore_vector = np.ones((max_label_length)) * len(characters)
for i in range(len(label_transcription)):
    if len(label_transcription[i]) < max_label_length:
        label_transcription[i] = np.concatenate([label_transcription[i], 
                                                 ignore_vector[:max_label_length-len(label_transcription[i])]], 0)

In [22]:
# dataloader preparation

from torch.utils.data import Dataset, DataLoader

batch_size = 8

class dataset_pipeline(Dataset):
    def __init__(self, data, data_length, label_index, label_character, label_length, validation=False):
        super(dataset_pipeline, self).__init__()
        
        self.data = data
        self.data_length = data_length
        self.label_index = label_index
        self.validation = validation
        self.label_length = label_length
        if self.validation:
            self.label_character = label_character
        
        self._len = len(self.data)  # number of utterances
    
    def __getitem__(self, index):
        MFCC = torch.from_numpy(self.data[index].T).type(torch.float)
        MFCC_length = self.data_length[index]
        label_index = torch.from_numpy(np.array(self.label_index[index])).long()
        if self.validation:
            label_character = self.label_character[index]
            return MFCC, MFCC_length, label_index, label_character
        else:
            label_length = self.label_length[index]
            return MFCC, MFCC_length, label_index, label_length
    
    def __len__(self):
        return self._len
    
train_loader = DataLoader(dataset_pipeline(MFCC_feature, MFCC_length, 
                                           label_transcription, transcription,
                                           label_length), 
                          batch_size=batch_size, 
                          shuffle=True,
                         )

validation_loader = DataLoader(dataset_pipeline(MFCC_feature, MFCC_length, 
                                                label_transcription, transcription,
                                                label_length, validation=True), 
                               batch_size=batch_size, 
                               shuffle=False,
                              )

dataset_len = len(train_loader)
log_step = dataset_len // 4

In [23]:
# training and validation pipeline

def train(model, epoch, versatile=True):
    start_time = time.time()
    model = model.train()
    train_loss = 0.
    
    # load batch data
    for batch_idx, data in enumerate(train_loader):
        MFCC, MFCC_length, label_index, label_length = data
        # pack the input batch
        MFCC_packed = nn.utils.rnn.pack_padded_sequence(MFCC, MFCC_length, 
                                                        batch_first=True, enforce_sorted=False)
        
        optimizer.zero_grad()
        
        # apply teacher forcing
        decoder_output_prob, decoder_output_label = model(MFCC_packed, MFCC_length, label_index)
        
        # CE as objective
        loss = CE(decoder_output_prob, label_index)
        
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 5.)
        optimizer.step()
        
        train_loss += loss.data.item()
        
        #print(loss.data.item())
        
        if versatile:
            if (batch_idx+1) % log_step == 0:
                elapsed = time.time() - start_time
                print('| epoch {:3d} | {:5d}/{:5d} batches | ms/batch {:5.2f} | CE {:5.4f} |'.format(
                    epoch, batch_idx+1, len(train_loader),
                    elapsed * 1000 / (batch_idx+1), 
                    train_loss / (batch_idx+1)
                    ))
    
    train_loss /= (batch_idx+1)
    print('-' * 99)
    print('    | end of training epoch {:3d} | time: {:5.2f}s | CE {:5.4f} |'.format(
            epoch, (time.time() - start_time), train_loss))
    
    return train_loss

def validate(model, epoch):
    start_time = time.time()
    model = model.eval()
    
    all_decoder_output_label = []
    all_label_character = []
    
    # load batch data
    for batch_idx, data in enumerate(validation_loader):
        MFCC, MFCC_length, label_index, label_character = data
        # pack the input batch
        MFCC_packed = nn.utils.rnn.pack_padded_sequence(MFCC, MFCC_length, 
                                                        batch_first=True, enforce_sorted=False)
        
        with torch.no_grad():
            # no teacher forcing
            decoder_output_prob, decoder_output_label = model(MFCC_packed, MFCC_length, teacher_forcing=False)
            
            decoder_output_label = decoder_output_label.data.numpy()
            
            for batch in range(decoder_output_label.shape[0]):
                this_decoder_output_label = decoder_output_label[batch]
                EOS_location = np.where(this_decoder_output_label == (len(characters) - 2))[0]
                if len(EOS_location) > 0:
                    this_decoder_output_label = this_decoder_output_label[:EOS_location[0]]
                all_decoder_output_label.append(this_decoder_output_label)
                all_label_character.append(label_character[batch][:-1])
        
    # calculate CER and WER on the entire training set
    num_utterance = len(all_decoder_output_label)
    
    prediction_list = [[characters[all_decoder_output_label[j][i]] for i in range(len(all_decoder_output_label[j]))] 
                       for j in range(num_utterance)]
    prediction_list = [''.join(prediction_list[i]) for i in range(num_utterance)]

    all_cer = fastwer.score(prediction_list, all_label_character, char_level=True)
    all_wer = fastwer.score(prediction_list, all_label_character)
    print('    | end of validation epoch {:3d} | time: {:5.2f}s | Corpus-level CER: {:.2f}% | Corpus-level WER: {:.2f}% |'.format(
        epoch, (time.time() - start_time), all_cer, all_wer))
    
    return all_cer

In [24]:
# main 
model_save = 'best_LAS.pt'

model = LAS()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# main function

training_loss = []
validation_loss = []

total_epoch = 150

for epoch in range(1, total_epoch + 1):
    training_loss.append(train(model, epoch))
    validation_loss.append(validate(model, epoch))
    if training_loss[-1] == np.min(training_loss):
        print('      Best training model found.')
    if validation_loss[-1] == np.min(validation_loss):
        with open(model_save, 'wb') as f:
            torch.save(model.state_dict(), f)
            print('      Best validation model found and saved.')
    print('-' * 99)

| epoch   1 |    12/   50 batches | ms/batch 246.15 | CE 3.2627 |
| epoch   1 |    24/   50 batches | ms/batch 251.64 | CE 3.0808 |
| epoch   1 |    36/   50 batches | ms/batch 258.83 | CE 2.9726 |
| epoch   1 |    48/   50 batches | ms/batch 261.35 | CE 2.8925 |
---------------------------------------------------------------------------------------------------
    | end of training epoch   1 | time: 13.06s | CE 2.8785 |
    | end of validation epoch   1 | time:  2.59s | Corpus-level CER: 106.75% | Corpus-level WER: 178.46% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch   2 |    12/   50 batches | ms/batch 249.41 | CE 2.5267 |
| epoch   2 |    24/   50 batches | ms/batch 262.53 | CE 2.4910 |
| epoch   2 |    36/   50 batches | ms/batch 261.43 | CE 2.4834 |
| epoch   2 |    48/   50 batches | ms/batch 261.61 | CE 2.4634 |
--------------------------

| epoch  13 |    12/   50 batches | ms/batch 246.47 | CE 1.6384 |
| epoch  13 |    24/   50 batches | ms/batch 255.86 | CE 1.6431 |
| epoch  13 |    36/   50 batches | ms/batch 257.92 | CE 1.6430 |
| epoch  13 |    48/   50 batches | ms/batch 256.79 | CE 1.6435 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  13 | time: 12.90s | CE 1.6470 |
    | end of validation epoch  13 | time:  2.57s | Corpus-level CER: 99.87% | Corpus-level WER: 165.63% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch  14 |    12/   50 batches | ms/batch 260.89 | CE 1.5678 |
| epoch  14 |    24/   50 batches | ms/batch 256.53 | CE 1.5902 |
| epoch  14 |    36/   50 batches | ms/batch 256.04 | CE 1.6020 |
| epoch  14 |    48/   50 batches | ms/batch 254.58 | CE 1.6090 |
---------------------------

| epoch  25 |    12/   50 batches | ms/batch 247.01 | CE 1.0626 |
| epoch  25 |    24/   50 batches | ms/batch 253.27 | CE 1.0830 |
| epoch  25 |    36/   50 batches | ms/batch 257.10 | CE 1.0926 |
| epoch  25 |    48/   50 batches | ms/batch 260.13 | CE 1.1090 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  25 | time: 13.02s | CE 1.1083 |
    | end of validation epoch  25 | time:  2.58s | Corpus-level CER: 84.53% | Corpus-level WER: 119.39% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch  26 |    12/   50 batches | ms/batch 252.35 | CE 1.0426 |
| epoch  26 |    24/   50 batches | ms/batch 261.59 | CE 1.0723 |
| epoch  26 |    36/   50 batches | ms/batch 259.84 | CE 1.0846 |
| epoch  26 |    48/   50 batches | ms/batch 263.01 | CE 1.0892 |
---------------------------

| epoch  37 |    24/   50 batches | ms/batch 265.24 | CE 0.6574 |
| epoch  37 |    36/   50 batches | ms/batch 264.25 | CE 0.6720 |
| epoch  37 |    48/   50 batches | ms/batch 262.91 | CE 0.6762 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  37 | time: 13.20s | CE 0.6797 |
    | end of validation epoch  37 | time:  2.60s | Corpus-level CER: 82.67% | Corpus-level WER: 114.10% |
      Best training model found.
---------------------------------------------------------------------------------------------------
| epoch  38 |    12/   50 batches | ms/batch 250.66 | CE 0.6259 |
| epoch  38 |    24/   50 batches | ms/batch 261.09 | CE 0.6180 |
| epoch  38 |    36/   50 batches | ms/batch 263.54 | CE 0.6337 |
| epoch  38 |    48/   50 batches | ms/batch 263.19 | CE 0.6449 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  38 | time

---------------------------------------------------------------------------------------------------
    | end of training epoch  49 | time: 13.40s | CE 0.3717 |
    | end of validation epoch  49 | time:  2.65s | Corpus-level CER: 68.02% | Corpus-level WER: 89.43% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch  50 |    12/   50 batches | ms/batch 255.42 | CE 0.3396 |
| epoch  50 |    24/   50 batches | ms/batch 266.67 | CE 0.3434 |
| epoch  50 |    36/   50 batches | ms/batch 267.59 | CE 0.3502 |
| epoch  50 |    48/   50 batches | ms/batch 270.66 | CE 0.3495 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  50 | time: 13.51s | CE 0.3518 |
    | end of validation epoch  50 | time:  2.68s | Corpus-level CER: 66.75% | Corpus-level WER: 88.93% |
      Best training model 

    | end of validation epoch  61 | time:  2.64s | Corpus-level CER: 61.23% | Corpus-level WER: 77.86% |
---------------------------------------------------------------------------------------------------
| epoch  62 |    12/   50 batches | ms/batch 258.93 | CE 0.2597 |
| epoch  62 |    24/   50 batches | ms/batch 260.84 | CE 0.2565 |
| epoch  62 |    36/   50 batches | ms/batch 264.22 | CE 0.2684 |
| epoch  62 |    48/   50 batches | ms/batch 267.27 | CE 0.2743 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  62 | time: 13.35s | CE 0.2745 |
    | end of validation epoch  62 | time:  2.65s | Corpus-level CER: 57.98% | Corpus-level WER: 75.47% |
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch  63 |    12/   50 batches | ms/batch 258.82 | CE 0.2492 |
| epoch  63 |    24/   50 batches | ms/batch 260.43 | 

    | end of validation epoch  73 | time:  2.65s | Corpus-level CER: 43.85% | Corpus-level WER: 57.17% |
---------------------------------------------------------------------------------------------------
| epoch  74 |    12/   50 batches | ms/batch 264.35 | CE 0.1425 |
| epoch  74 |    24/   50 batches | ms/batch 270.69 | CE 0.1586 |
| epoch  74 |    36/   50 batches | ms/batch 269.61 | CE 0.1561 |
| epoch  74 |    48/   50 batches | ms/batch 268.70 | CE 0.1586 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  74 | time: 13.43s | CE 0.1584 |
    | end of validation epoch  74 | time:  2.67s | Corpus-level CER: 37.26% | Corpus-level WER: 48.50% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch  75 |    12/   50 batches | ms/batch 264.09 | CE 0.1246 |
| epoch  75 |    24/ 

| epoch  86 |    36/   50 batches | ms/batch 271.29 | CE 0.1368 |
| epoch  86 |    48/   50 batches | ms/batch 265.60 | CE 0.1409 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  86 | time: 13.31s | CE 0.1405 |
    | end of validation epoch  86 | time:  2.68s | Corpus-level CER: 32.06% | Corpus-level WER: 41.91% |
---------------------------------------------------------------------------------------------------
| epoch  87 |    12/   50 batches | ms/batch 263.47 | CE 0.1213 |
| epoch  87 |    24/   50 batches | ms/batch 267.72 | CE 0.1382 |
| epoch  87 |    36/   50 batches | ms/batch 271.59 | CE 0.1375 |
| epoch  87 |    48/   50 batches | ms/batch 270.29 | CE 0.1398 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  87 | time: 13.49s | CE 0.1410 |
    | end of validation epoch  87 | time:  2.68s | Corpus-level CER: 34.09% |

| epoch  99 |    12/   50 batches | ms/batch 270.77 | CE 0.1378 |
| epoch  99 |    24/   50 batches | ms/batch 271.63 | CE 0.1431 |
| epoch  99 |    36/   50 batches | ms/batch 266.71 | CE 0.1575 |
| epoch  99 |    48/   50 batches | ms/batch 267.08 | CE 0.1683 |
---------------------------------------------------------------------------------------------------
    | end of training epoch  99 | time: 13.35s | CE 0.1692 |
    | end of validation epoch  99 | time:  2.70s | Corpus-level CER: 38.72% | Corpus-level WER: 50.33% |
---------------------------------------------------------------------------------------------------
| epoch 100 |    12/   50 batches | ms/batch 254.98 | CE 0.1606 |
| epoch 100 |    24/   50 batches | ms/batch 263.54 | CE 0.1587 |
| epoch 100 |    36/   50 batches | ms/batch 265.40 | CE 0.1538 |
| epoch 100 |    48/   50 batches | ms/batch 266.25 | CE 0.1551 |
---------------------------------------------------------------------------------------------------
    | 

| epoch 112 |    12/   50 batches | ms/batch 268.29 | CE 0.0738 |
| epoch 112 |    24/   50 batches | ms/batch 274.83 | CE 0.0665 |
| epoch 112 |    36/   50 batches | ms/batch 272.28 | CE 0.0683 |
| epoch 112 |    48/   50 batches | ms/batch 271.82 | CE 0.0665 |
---------------------------------------------------------------------------------------------------
    | end of training epoch 112 | time: 13.62s | CE 0.0666 |
    | end of validation epoch 112 | time:  2.70s | Corpus-level CER: 10.06% | Corpus-level WER: 13.01% |
---------------------------------------------------------------------------------------------------
| epoch 113 |    12/   50 batches | ms/batch 269.61 | CE 0.0524 |
| epoch 113 |    24/   50 batches | ms/batch 272.52 | CE 0.0537 |
| epoch 113 |    36/   50 batches | ms/batch 269.20 | CE 0.0573 |
| epoch 113 |    48/   50 batches | ms/batch 270.31 | CE 0.0617 |
---------------------------------------------------------------------------------------------------
    | 

| epoch 124 |    12/   50 batches | ms/batch 267.56 | CE 0.0122 |
| epoch 124 |    24/   50 batches | ms/batch 267.88 | CE 0.0123 |
| epoch 124 |    36/   50 batches | ms/batch 265.49 | CE 0.0124 |
| epoch 124 |    48/   50 batches | ms/batch 265.05 | CE 0.0127 |
---------------------------------------------------------------------------------------------------
    | end of training epoch 124 | time: 13.25s | CE 0.0127 |
    | end of validation epoch 124 | time:  2.68s | Corpus-level CER: 0.38% | Corpus-level WER: 0.35% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch 125 |    12/   50 batches | ms/batch 258.92 | CE 0.0117 |
| epoch 125 |    24/   50 batches | ms/batch 260.26 | CE 0.0118 |
| epoch 125 |    36/   50 batches | ms/batch 263.46 | CE 0.0118 |
| epoch 125 |    48/   50 batches | ms/batch 267.20 | CE 0.0119 |
------------------------------

| epoch 136 |    24/   50 batches | ms/batch 270.03 | CE 0.0073 |
| epoch 136 |    36/   50 batches | ms/batch 270.34 | CE 0.0071 |
| epoch 136 |    48/   50 batches | ms/batch 270.32 | CE 0.0072 |
---------------------------------------------------------------------------------------------------
    | end of training epoch 136 | time: 13.48s | CE 0.0072 |
    | end of validation epoch 136 | time:  2.68s | Corpus-level CER: 0.21% | Corpus-level WER: 0.18% |
      Best training model found.
---------------------------------------------------------------------------------------------------
| epoch 137 |    12/   50 batches | ms/batch 262.57 | CE 0.0065 |
| epoch 137 |    24/   50 batches | ms/batch 266.32 | CE 0.0069 |
| epoch 137 |    36/   50 batches | ms/batch 263.42 | CE 0.0070 |
| epoch 137 |    48/   50 batches | ms/batch 267.22 | CE 0.0069 |
---------------------------------------------------------------------------------------------------
    | end of training epoch 137 | time: 1

| epoch 149 |    12/   50 batches | ms/batch 246.36 | CE 0.0040 |
| epoch 149 |    24/   50 batches | ms/batch 254.35 | CE 0.0042 |
| epoch 149 |    36/   50 batches | ms/batch 262.54 | CE 0.0043 |
| epoch 149 |    48/   50 batches | ms/batch 265.95 | CE 0.0042 |
---------------------------------------------------------------------------------------------------
    | end of training epoch 149 | time: 13.33s | CE 0.0042 |
    | end of validation epoch 149 | time:  2.68s | Corpus-level CER: 0.00% | Corpus-level WER: 0.00% |
      Best training model found.
      Best validation model found and saved.
---------------------------------------------------------------------------------------------------
| epoch 150 |    12/   50 batches | ms/batch 267.90 | CE 0.0040 |
| epoch 150 |    24/   50 batches | ms/batch 271.23 | CE 0.0039 |
| epoch 150 |    36/   50 batches | ms/batch 267.63 | CE 0.0040 |
| epoch 150 |    48/   50 batches | ms/batch 268.32 | CE 0.0040 |
------------------------------

Let's print an example output to see how the model performs.

In [25]:
# TODO: print an example output
# remember to load the model with the lowest CER first

sample_idx = 128
sample_utterance = wav_files[sample_idx]
sample_target = transcription[sample_idx]

model = LAS()
model.load_state_dict(torch.load(model_save))
model.eval()

# calculate all MFCC first 

MFCC_feature = []
MFCC_length = []
y, sr = librosa.load(sample_utterance)
this_MFCC = librosa.feature.mfcc(y, sr, S=None, n_mfcc=40, n_fft=2048, hop_length=512)
# normalization
this_MFCC_mean = np.mean(this_MFCC, 1)
this_MFCC_std = np.sqrt(np.var(this_MFCC, 1) + 1e-6)
this_MFCC = (this_MFCC - this_MFCC_mean[:,np.newaxis]) / this_MFCC_std[:,np.newaxis]

this_MFCC = torch.from_numpy(this_MFCC.T).unsqueeze(0)
MFCC_length = [this_MFCC.shape[1]]

MFCC_packed = nn.utils.rnn.pack_padded_sequence(this_MFCC, MFCC_length, 
                                                batch_first=True, enforce_sorted=False)

with torch.no_grad():
    decoder_output_prob, decoder_output_label = model(MFCC_packed, MFCC_length, teacher_forcing=False)
    decoder_output_label = decoder_output_label[0].data.numpy()
    EOS_location = np.where(decoder_output_label == (len(characters) - 2))[0]
    if len(EOS_location) > 0:
        decoder_output_label = decoder_output_label[:EOS_location[0]]
        
    output_label = np.array(characters)[decoder_output_label]
    print('Transcribed sequence: ', ''.join(output_label))
    print('Target sequence: ', sample_target[:-1])

Transcribed sequence:  HOW DID HER MOTHER EVER LET HER GO
Target sequence:  HOW DID HER MOTHER EVER LET HER GO


## Discussion: Beam Search

Note that in this homework we assume that the decoder always select the character with the highest probability when generating the output sequence. This is called ***greedy decoding***. A better way is to perform a [***beam search***](https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/). The idea of beam search is to keep multiple candidates instead of only use the one with the highest probability. As the decoding continues, it keeps the top K sequences with the maximum ***sequence generation probabilities*** (remember the evaluation problem in HMM?). 
![](https://d2l.ai/_images/beam-search.svg)

In the example in the figure, the decoding process maintains 2 best sequences at each step among the 5 candidates. Greedy decoding can then be viewed as a special case of beam search with a beam size of 1 (keep top 1 candidate at each case).

## Discussion: Scheduled Sampling

One problem for teacher forcing is that it introduces a **training-testing mismatch**, as target label is always used in training while never available in testing. The original LAS paper has proposed a workaround which applied teacher forcing with 90% probability, and for the rest 10% cases the network used the predicted label from last step as input. However, as we mentioned above in the beam search part, using the label with the maximum probability might not always be selected by the beam search algorithm during decoding. To alleviate this, LAS ***samples*** a character given the previous output probability. For example, suppose we have 3 target characters 'A', 'B' and 'C', and previous step predicts their probability as 0.2, 0.3, and 0.5. Then when the current step needs a character from the previous step as input, it *samples* a character with the probability, which means that there's 20% probability that it samples 'A', 30% probability that it samples 'B', and 50% probability that it samples 'C'. This can be done by the [Categorical](https://pytorch.org/docs/stable/distributions.html#categorical) function in Pytorch.

An improved method for solving this training-testing mismatch is called [***scheduled sampling***](https://papers.nips.cc/paper/2015/hash/e995f98d56967d946471af29d7bf99f1-Abstract.html). The idea is that instead of using a fixed teacher forcing probability, we can start will a fully-teacher-forced training and gradually decrease the tracher forcing probability. After certain number of training iterations, the network only uses the previous predicted labels and no longer requires the target labels.

## Discussion: Language Model Rescoring

LAS does not have an explicit language model, and the network directly generates the probability of different characters given the input MFCC. A method called ***language model rescoring*** can introduce another pretrained language model into the decoding process (either greedy or beam search) to improve the result. The idea is to first train an explicit language model, e.g., N-gram model, on any available texts. During decoding, the probability of generating different characters given the previous character is the sum of the network output probability *and* the language model probability. This allows the prediction to be more accurate.