# Lab 12: Transformers, Speech Recognition
# Chanapa Pananookooln | st121395

## Lab Summary

In this lab we learn about the Transformer implementation from pytorch tutorial and how to adapt it to the speech recognition task.
Multiple layers of TransformerEncoderLayer are combined as TransformerEncoder. It will take the input and the input will be masked such that the model can only consider the earlier positions.

There is also PositionalEncoding that contain the positions of the tokens in the sequence it will be the same size as the embedding so they can be summed. In the Transformer paper they used fuctions of sine and cos to encode the positions.

We also used torchtext module to help the processing of text data and allows mroe efficient batch processing. 

To work with time sequence im pytorch the required tensor dimension for both inputs and targets is (time x batch_element)

To initialize the model we hace to define these parameters :

    ntokens # the size of vocabulary
    emsize  # embedding dimension
    nhid  the dimension of the feedforward network model in nn.TransformerEncoder
    nlayers # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
    nhead # the number of heads in the multiheadattention models
    dropout  # the dropout value

And our model is TransformerModel
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout)

Another thing to take note is that in language modling tasks, the negative log likelihood is called Perplexity or PPL.


The speech recognition task's input is the audio and the output in this case is a probability matrix of characters which will be decoded to the most likely sequence of characer spoken in the audio.
 
Firstly we need the torchaudio module to process the input audio.
we use the LibriSpeech dataset where each sample contains waveform, sample_rate, utterance, speaker_id, chapter_id and utterance_id.

Next we need to do some data augmentations to help increase the diversity of our dataset adn increase the dataset size. In this lab we chose to use the Spectrogram Augmentation.

We also need a TextTransform functions to transform each character to integers and vice versa

Then we create a data_processing function for transoforming the input audios and labels into spectrogram and sequence of labels which would then be feed to the model for training.

The tutorial in this lab used Residual Convolutional Neural Networks (ResCNN) and Bidirectional Recurrent Neural Networks (BiRNN) as components in the speech recognition model.

There are also many evaluation matrix for the speech recognition task such as Levenshtein distance in the word-level and character-level, word error rate (WER) and Character Error rate (CER)

For the optimizer, AdamW is introduced to fix the weight decay problem of Adam. which results in faster convergence.

And CTC Loss functions allow the blank label to allow the model to say that in that audio frame no character was said.

# PART I : Transformer With Torchtext @ Language Modeling Task on WikiText2

In [1]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from IPython.display import display

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout = 0.1, max_len = 5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

class TransformerModel(nn.Module):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        src = self.encoder(src) * math.sqrt(self.ninp)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

### Load data

In [2]:
import io
import torch
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
counter = Counter()
for line in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter)

def data_process(raw_text_iter):
  data = [torch.tensor([vocab[token] for token in tokenizer(item)],
                       dtype=torch.long) for item in raw_text_iter]
  return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

train_iter, val_iter, test_iter = WikiText2()
print("train_iter : ", train_iter)

train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)
print('train_data after process', train_data)
print('train_data.shape after process', train_data.shape)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def batchify(data, bsz):
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10

train_data = batchify(train_data, batch_size)
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)
print('train_data after batchify', train_data)
print('test_data.shape after batchify', train_data.shape)

train_iter :  WikiText2
train_data after process tensor([  10, 3850, 3870,  ..., 2443, 4811,    4])
train_data.shape after process torch.Size([2049990])
train_data after batchify tensor([[   10,    60,   565,  ..., 11653,  2436,     2],
        [ 3850,    13,   301,  ...,    48,    31,  1991],
        [ 3870,   316,    20,  ...,    98,  7721,     5],
        ...,
        [  588,  4012,    60,  ...,     2,  1440, 12314],
        [ 4988,    30,     5,  ...,  3166, 17107,  2061],
        [    7,     9,     2,  ...,    63,    19,     3]], device='cuda:0')
test_data.shape after batchify torch.Size([102499, 20])


### Generate input and target sequences

In [3]:
bptt = 35
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data    = source[i   : i+seq_len]
    target  = source[i+1 : i+1+seq_len].reshape(-1)
    return data, target

### Initialize the model

In [4]:
ntokens = len(vocab.stoi) # the size of vocabulary

emsize  = 200 # embedding dimension
nhid    = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2   # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead   = 2   # the number of heads in the multiheadattention models
dropout = 0.2 # the dropout value

model   = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)

### Run the model

In [5]:
criterion = nn.CrossEntropyLoss()
lr = 5.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

import time
def train():
    model.train() # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        
        data, targets = get_batch(train_data, i)
        
        optimizer.zero_grad()
        
        if data.size(0) != bptt:
            src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
            
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        
        total_loss += loss.item()
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                  'lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                    elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

def evaluate(eval_model, data_source):
    eval_model.eval() # Turn on the evaluation mode
    total_loss = 0.
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            if data.size(0) != bptt:
                src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
            output = eval_model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1), output_flat, targets

In [7]:
best_val_loss = float("inf")
epochs        = 10 # number of epochs
best_model    = model

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train()
    torch.save(model.state_dict(), f'../../weights/transformer_with_torchtext/transformer_torch_text{epoch}.pth')

    val_loss, output_flat, targets = evaluate(model, val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
          'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = model

    scheduler.step()

| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 10.20 | loss  5.67 | ppl   289.32
| epoch   1 |   400/ 2928 batches | lr 5.00 | ms/batch  9.87 | loss  5.67 | ppl   288.72
| epoch   1 |   600/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.44 | ppl   230.48
| epoch   1 |   800/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.49 | ppl   243.01
| epoch   1 |  1000/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.44 | ppl   230.74
| epoch   1 |  1200/ 2928 batches | lr 5.00 | ms/batch  9.85 | loss  5.48 | ppl   239.40
| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch  9.88 | loss  5.49 | ppl   242.60
| epoch   1 |  1600/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.51 | ppl   248.02
| epoch   1 |  1800/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.46 | ppl   234.58
| epoch   1 |  2000/ 2928 batches | lr 5.00 | ms/batch  9.86 | loss  5.47 | ppl   237.37
| epoch   1 |  2200/ 2928 batches | lr 5.00 | ms/batch  9.88 | loss  5.33 | ppl   207.05
| epoch   1 |  2400/ 

### Evaluate the model with test data

In [10]:
test_loss, output_flat, targets = evaluate(best_model, test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  6.76 | test ppl   865.01


# Test on my own TEXT data on this Transformer Model

In [11]:
model.load_state_dict(torch.load('../../weights/transformer_with_torchtext/transformer_torch_text49.pth'))

my_text = 'We’ve come a long way since our early days in Älmhult, Sweden, but IKEA founder Ingvar Kamprad’s dream to create a better life for as many people as possible – whatever the size of their wallet – is and will always be our driving force.'

my_text_split = my_text.split(' ')
print(my_text_split)

my_data_processed = data_process(my_text_split)
print(my_data_processed)

my_ready_data = batchify(my_data_processed, eval_batch_size)
print(my_ready_data.shape)

my_loss, output_flat, targets = evaluate(model, my_ready_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    my_loss, math.exp(my_loss)))
print('=' * 89)

['We’ve', 'come', 'a', 'long', 'way', 'since', 'our', 'early', 'days', 'in', 'Älmhult,', 'Sweden,', 'but', 'IKEA', 'founder', 'Ingvar', 'Kamprad’s', 'dream', 'to', 'create', 'a', 'better', 'life', 'for', 'as', 'many', 'people', 'as', 'possible', '–', 'whatever', 'the', 'size', 'of', 'their', 'wallet', '–', 'is', 'and', 'will', 'always', 'be', 'our', 'driving', 'force.']
tensor([   0,  741,    9,  162,  263,  153,  856,  115,  335,    7,    0,    3,
        4450,    3,   39,    0, 3694,    0,    0, 2910,    8, 1057,    9,  876,
         167,   18,   15,  102,  151,   15,  688,   41, 4719,    2,  862,    5,
          37,    0,   41,   24,    6,  198,  949,   34,  856, 3075,  345,    4])
torch.Size([4, 10])
| End of training | test loss  6.54 | test ppl   691.31


## Discussion

I was able to train this model for 3 epochs.

After testing this model on my own text sample. The test loss and test ppl are very similar to the test loss that I got from the training set of the WikiText2.