# Attention-based Speech Recognition

In [37]:
!nvidia-smi

Wed May  3 04:46:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    27W /  70W |   1211MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [121]:
# # Install some required libraries
# # Feel free to add more if you want
# !pip install -q python-levenshtein torchsummaryX wandb kaggle pytorch-nlp 

# Imports

In [38]:
# Import Necessary Modules you require for this HW here
import torch
import random
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torchaudio.transforms as tat

from sklearn.metrics import accuracy_score
import gc

import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import Levenshtein

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Toy Dataset Download

In [4]:
# !wget -q https://cmu.box.com/shared/static/om4qpzd4tf1xo4h7230k4v1pbdyueghe --content-disposition --show-progress
# !unzip -q hw4p2_toy.zip -d ./

# Kaggle Dataset Download

In [5]:
api_token = '{"username":"ayh2cfa","key":"d2a236c24691c53b85eb4f0495142fdf"}'

# set up kaggle.json
# TODO: Use the same Kaggle code from HW1P2, HW2P2, HW3P2
!mkdir /root/.kaggle/

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write(api_token) # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle/’: File exists


In [6]:
# # To download the dataset
# !kaggle datasets download -d varunjain3/11-785-s23-hw4p2-dataset

In [7]:
# # To unzip data quickly and quietly
# !unzip -q 11-785-s23-hw4p2-dataset.zip -d ./data

# Dataset and Dataloaders

We have given you 2 datasets. One is a toy dataset, and the other is the standard LibriSpeech dataset. The toy dataset is to help you get your code implemented and tested and debugged easily, to verify that your attention diagonal is produced correctly. Note however that it's task (phonetic transcription) is drawn from HW3P2, it is meant to be familiar and help you understand how to transition from phonetic transcription to alphabet transcription, with a working attention module.

Please make sure you use the right constants in your code implementation for future modules, (SOS_TOKEN vs SOS_TOKEN_TOY) when working with either dataset. We have defined the constants accordingly below. Before you come to OH or post on piazza, make sure you aren't misuing the constants for either dataset in your code. 

## Toy Dataset

The toy dataset is a dataset of fixed length speech sequences that have phonetic transcripts. The reason we made it with phonetic transcripts was to help you understand how attention can work with phonetic transcription that you have done in HW3P2

In [39]:
# Load the toy dataset
import numpy as np
import torch
X_train = np.load("hw4p2_toy/f0176_mfccs_train_new.npy")
X_valid = np.load("hw4p2_toy/f0176_mfccs_dev_new.npy")
Y_train = np.load("hw4p2_toy/f0176_hw3p2_train.npy")
Y_valid = np.load("hw4p2_toy/f0176_hw3p2_dev.npy")

# This is how you actually need to find out the different trancripts in a dataset. 
# Can you think whats going on here? Why are we using a np.unique?
VOCAB_MAP_TOY           = dict(zip(np.unique(Y_valid), range(len(np.unique(Y_valid))))) 
VOCAB_MAP_TOY["[PAD]"]  = len(VOCAB_MAP_TOY)
VOCAB_TOY               = list(VOCAB_MAP_TOY.keys())

SOS_TOKEN_TOY = VOCAB_MAP_TOY["[SOS]"]
EOS_TOKEN_TOY = VOCAB_MAP_TOY["[EOS]"]
PAD_TOKEN_TOY = VOCAB_MAP_TOY["[PAD]"]

Y_train = [np.array([VOCAB_MAP_TOY[p] for p in seq]) for seq in Y_train]
Y_valid = [np.array([VOCAB_MAP_TOY[p] for p in seq]) for seq in Y_valid]

In [40]:
class ToyDataset(torch.utils.data.Dataset):

    def __init__(self, partition):

        if partition == "train":
            self.mfccs = X_train
            self.transcripts = Y_train

        elif partition == "valid":
            self.mfccs = X_valid
            self.transcripts = Y_valid

        assert len(self.mfccs) == len(self.transcripts)

        self.length = len(self.mfccs)

    def __len__(self):

        return self.length

    def __getitem__(self, i):

        x = torch.tensor(self.mfccs[i])
        y = torch.tensor(self.transcripts[i])

        return x, y

    def collate_fn(self, batch):

        x_batch, y_batch = list(zip(*batch))

        x_lens      = [x.shape[0] for x in x_batch] 
        y_lens      = [y.shape[0] for y in y_batch] 

        x_batch_pad = torch.nn.utils.rnn.pad_sequence(x_batch, batch_first=True, padding_value= EOS_TOKEN_TOY)
        y_batch_pad = torch.nn.utils.rnn.pad_sequence(y_batch, batch_first=True, padding_value= EOS_TOKEN_TOY) 
        
        return x_batch_pad, y_batch_pad, torch.tensor(x_lens), torch.tensor(y_lens)

In [41]:
config = {}
config['batch_size'] = 128
train_toy_dataset   = ToyDataset(partition= 'train')
valid_toy_dataset   = ToyDataset(partition= 'valid')

train_toy_loader    = torch.utils.data.DataLoader(
    dataset     = train_toy_dataset, 
    batch_size  = config['batch_size'], 
    shuffle     = True,
    num_workers = 4, 
    pin_memory  = True,
    collate_fn  = train_toy_dataset.collate_fn
)

valid_toy_loader    = torch.utils.data.DataLoader(
    dataset     = valid_toy_dataset, 
    batch_size  = config['batch_size'], 
    shuffle     = False,
    num_workers = 2, 
    pin_memory  = True,
    collate_fn  = valid_toy_dataset.collate_fn
)

print("No. of train mfccs   : ", train_toy_dataset.__len__())
print("Batch size           : ", config['batch_size'])
print("Train batches        : ", train_toy_loader.__len__())
print("Valid batches        : ", valid_toy_loader.__len__())

No. of train mfccs   :  16000
Batch size           :  128
Train batches        :  125
Valid batches        :  13


In [22]:
for batch in train_toy_loader:
    x, y, x_len, y_len = batch
    print(x.shape, y.shape, x_len.shape, y_len.shape)
    break

torch.Size([128, 176, 27]) torch.Size([128, 23]) torch.Size([128]) torch.Size([128])


## LibriSpeech

In terms of the dataset, the dataset structure for HW3P2 and HW4P2 dataset are very similar. Can you spot out the differences? What all will be required?? 

Hints:

- Check how big is the dataset (do you require memory efficient loading techniques??)
- How do we load mfccs? Do we need to normalise them? 
- Does the data have \<SOS> and \<EOS> tokens in each sequences? Do we remove them or do we not remove them? (Read writeup)
- Would we want a collating function? Ask yourself: Why did we need a collate function last time?
- Observe the VOCAB, is the dataset same as HW3P2? 
- Should you add augmentations, if yes which augmentations? When should you add augmentations? (Check bootcamp for answer)


In [42]:
VOCAB = ['<pad>', '<sos>', '<eos>', 
         'A',   'B',    'C',    'D',    
         'E',   'F',    'G',    'H',    
         'I',   'J',    'K',    'L',       
         'M',   'N',    'O',    'P',    
         'Q',   'R',    'S',    'T', 
         'U',   'V',    'W',    'X', 
         'Y',   'Z',    "'",    ' ', 
         ]

VOCAB_MAP = {VOCAB[i]:i for i in range(0, len(VOCAB))}

PAD_TOKEN = VOCAB_MAP["<pad>"]
SOS_TOKEN = VOCAB_MAP["<sos>"]
EOS_TOKEN = VOCAB_MAP["<eos>"]

print(f"Length of vocab: {len(VOCAB)}")
print(f"Vocab: {VOCAB}")
print(f"PAD_TOKEN: {PAD_TOKEN}")
print(f"SOS_TOKEN: {SOS_TOKEN}")
print(f"EOS_TOKEN: {EOS_TOKEN}")

Length of vocab: 31
Vocab: ['<pad>', '<sos>', '<eos>', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', "'", ' ']
PAD_TOKEN: 0
SOS_TOKEN: 1
EOS_TOKEN: 2


In [43]:
class AudioDataset(torch.utils.data.Dataset):

    # For this homework, we give you full flexibility to design your data set class.
    # Hint: The data from HW1 is very similar to this HW

    #TODO
    def __init__(self, root, partition, phonemes = VOCAB): 
        '''
        Initializes the dataset.

        INPUTS: What inputs do you need here?
        '''

        # Load the directory and all files in them

        self.mfcc_dir = root+'/'+partition+'/mfcc'
        self.transcript_dir = root+'/'+partition+'/transcripts'

        self.mfcc_files = sorted(os.listdir(self.mfcc_dir))
        self.transcript_files = sorted(os.listdir(self.transcript_dir))
        self.phonemes = phonemes

        #TODO
        # WHAT SHOULD THE LENGTH OF THE DATASET BE?
        self.length = len(self.mfcc_files)
        
        #TODO
        # HOW CAN WE REPRESENT PHONEMES? CAN WE CREATE A MAPPING FOR THEM?
        # HINT: TENSORS CANNOT STORE NON-NUMERICAL VALUES OR STRINGS
        num_mfccs = self.length

        #TODO
        # CREATE AN ARRAY OF ALL FEATUERS AND LABELS
        # WHAT NORMALIZATION TECHNIQUE DID YOU USE IN HW1? CAN WE USE IT HERE?
        '''
        You may decide to do this in __getitem__ if you wish.
        However, doing this here will make the __init__ function take the load of
        loading the data, and shift it away from training.
        '''
        # Iterate through mfccs and transcripts
        self.transcripts = [None]*num_mfccs
        self.mfccs = [None]*num_mfccs
        for i in range(num_mfccs):
        #   Load a single mfcc
            mfcc = np.load(self.mfcc_dir+'/'+self.mfcc_files[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            sigma = np.std(mfcc,axis=0)
            mfcc -= np.sum(mfcc,axis=0)/mfcc.shape[0]
            mfcc /= sigma
            self.mfccs[i] = torch.tensor(mfcc)

        #   Load the corresponding transcript
            transcript  = np.load(self.transcript_dir+'/'+self.transcript_files[i]) 

            # Map the phonemes to their corresponding list indexes in self.phonemes
            # Now, if an element in self.transcript is 0, it means that it is 'SIL' (as per the above example)
            transcript = [self.phonemes.index(p) for p in transcript]
            self.transcripts[i] = torch.tensor(transcript)

    def __len__(self):
        
        '''
        TODO: What do we return here?
        '''
        return self.length

    def __getitem__(self, ind):
        '''
        TODO: RETURN THE MFCC COEFFICIENTS AND ITS CORRESPONDING LABELS

        If you didn't do the loading and processing of the data in __init__,
        do that here.

        Once done, return a tuple of features and labels.
        '''

        mfcc = self.mfccs[ind]
        transcript = self.transcripts[ind]
        return mfcc, transcript


    def collate_fn(self,batch):
        '''
        TODO:
        1.  Extract the features and labels from 'batch'
        2.  We will additionally need to pad both features and labels,
            look at pytorch's docs for pad_sequence
        3.  This is a good place to perform transforms, if you so wish. 
            Performing them on batches will speed the process up a bit.
        4.  Return batch of features, labels, lenghts of features, 
            and lengths of labels.
        '''
        # batch of input mfcc coefficients
        batch_mfcc = [b[0] for b in batch]
        # batch of output phonemes
        batch_transcript = [b[1] for b in batch]

        # HINT: CHECK OUT -> pad_sequence (imported above)
        # Also be sure to check the input format (batch_first)
        lengths_mfcc = [len(m) for m in batch_mfcc]
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first=True)

        lengths_transcript = [len(t) for t in batch_transcript]
        batch_transcript_pad = pad_sequence(batch_transcript, batch_first=True)

        # You may apply some transformation, Time and Frequency masking, here in the collate function;
        # Food for thought -> Why are we applying the transformation here and not in the __getitem__?
        #                  -> Would we apply transformation on the validation set as well?
        #                  -> Is the order of axes / dimensions as expected for the transform functions?
        
        # Return the following values: padded features, padded labels, actual length of features, actual length of the labels
        return batch_mfcc_pad, batch_transcript_pad, torch.tensor(lengths_mfcc), torch.tensor(lengths_transcript)

       

In [44]:
# Test Dataloader
class AudioTestDataset(torch.utils.data.Dataset):

    # For this homework, we give you full flexibility to design your data set class.
    # Hint: The data from HW1 is very similar to this HW

    #TODO
    def __init__(self, root, partition): 
        '''
        Initializes the dataset.

        INPUTS: What inputs do you need here?
        '''

        # Load the directory and all files in them

        self.mfcc_dir = root+'/'+partition+'/mfcc'

        self.mfcc_files = sorted(os.listdir(self.mfcc_dir))

        self.length = len(self.mfcc_files)

        num_mfccs = self.length

        # Iterate through mfccs
        self.mfccs = [None]*num_mfccs
        for i in range(num_mfccs):
        #   Load a single mfcc
            mfcc = np.load(self.mfcc_dir+'/'+self.mfcc_files[i])
        #   Do Cepstral Normalization of mfcc (explained in writeup)
            sigma = np.std(mfcc,axis=0)
            mfcc -= np.sum(mfcc,axis=0)/mfcc.shape[0]
            mfcc /= sigma
            self.mfccs[i] = torch.tensor(mfcc)

    def __len__(self):
        
        '''
        TODO: What do we return here?
        '''
        return self.length

    def __getitem__(self, ind):
        '''
        TODO: RETURN THE MFCC COEFFICIENTS AND ITS CORRESPONDING LABELS

        If you didn't do the loading and processing of the data in __init__,
        do that here.

        Once done, return a tuple of features and labels.
        '''

        mfcc = self.mfccs[ind]
        return mfcc


    def collate_fn(self,batch):
        '''
        TODO:
        1.  Extract the features and labels from 'batch'
        2.  We will additionally need to pad both features and labels,
            look at pytorch's docs for pad_sequence
        3.  This is a good place to perform transforms, if you so wish. 
            Performing them on batches will speed the process up a bit.
        4.  Return batch of features, labels, lenghts of features, 
            and lengths of labels.
        '''
        # batch of input mfcc coefficients
        batch_mfcc = batch
        lengths_mfcc = [len(m) for m in batch_mfcc]
        batch_mfcc_pad = pad_sequence(batch_mfcc, batch_first=True)
        return batch_mfcc_pad, torch.tensor(lengths_mfcc)

In [45]:
root = 'data'

In [7]:
val_data = AudioDataset(root, 'dev-clean')

In [8]:
train_data = AudioDataset(root, 'train-clean-100')

In [9]:
test_data = AudioTestDataset(root, 'test-clean')

In [46]:
train_loader = torch.utils.data.DataLoader(
    dataset     = train_data, 
    num_workers = 4,
    batch_size  = config['batch_size'], 
    pin_memory  = True,
    shuffle     = True,
    collate_fn = train_data.collate_fn
)
val_loader = torch.utils.data.DataLoader(
    dataset     = val_data, 
    num_workers = 2,
    batch_size  = config['batch_size'],
    pin_memory  = True,
    shuffle     = False,
    collate_fn = val_data.collate_fn
)
test_loader = torch.utils.data.DataLoader(
    dataset     = test_data, 
    num_workers = 2, 
    batch_size  = config['batch_size'], 
    pin_memory  = True, 
    shuffle     = False,
    collate_fn = test_data.collate_fn
)

In [47]:
print("\nChecking the shapes of the val data...")
for batch in val_loader:
    x, y, x_len, y_len = batch
    print(x.shape, y.shape, x_len.shape, y_len.shape)
    print(type(x), type(y), type(x_len), type(y_len))
    break


Checking the shapes of the val data...
torch.Size([128, 2936, 27]) torch.Size([128, 364]) torch.Size([128]) torch.Size([128])
<class 'torch.Tensor'> <class 'torch.Tensor'> <class 'torch.Tensor'> <class 'torch.Tensor'>


In [190]:
print("\nChecking the shapes of the train data...")
for batch in train_loader:
    x, y, x_len, y_len = batch
    print(x.shape, y.shape, x_len.shape, y_len.shape)
    break


Checking the shapes of the train data...
torch.Size([128, 1680, 27]) torch.Size([128, 329]) torch.Size([128]) torch.Size([128])


In [None]:
print("\nChecking the shapes of the test data...")
for batch in test_loader:
    x, x_len = batch
    print(x_len)
    print(x.shape, x_len.shape)
    break

Check if you are loading the data correctly with the following:

(Note: These are outputs from loading your data in the dataset class, not your dataloader which will have padded sequences)

- Train Dataset
```
Partition loaded:  train-clean-100
Max mfcc length:  2448
Average mfcc length:  1264.6258453344547
Max transcript:  400
Average transcript length:  186.65321139493324
```

- Dev Dataset
```
Partition loaded:  dev-clean
Max mfcc length:  3260
Average mfcc length:  713.3570107288198
Max transcript:  518
Average transcript length:  108.71698113207547
```

- Test Dataset
```
Partition loaded:  test-clean
Max mfcc length:  3491
Average mfcc length:  738.2206106870229
```

If your values is not matching, read hints, think what could have gone wrong. Then approach TAs.

# THE MODEL 

In [182]:
config = {
  'batch_size': 128,
  'lr':1e-3,
  'epochs': 50,
  'dropout': 0.3,
  'sos': SOS_TOKEN,
  'eos': EOS_TOKEN,
  'pad': PAD_TOKEN,
  'vocab': VOCAB,
  'tmask_length': 60,
  'fmask': 2
}

In [183]:
# Utils for network
torch.cuda.empty_cache()

class PermuteBlock(torch.nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

In [184]:
# citation: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/nn/lock_dropout.html
import torch.nn as nn
class LockedDropout(nn.Module):
    """ LockedDropout applies the same dropout mask to every time step.

    **Thank you** to Sales Force for their initial implementation of :class:`WeightDrop`. Here is
    their `License
    <https://github.com/salesforce/awd-lstm-lm/blob/master/LICENSE>`__.

    Args:
        p (float): Probability of an element in the dropout mask to be zeroed.
    """

    def __init__(self, p=0.5):
        self.p = p
        super().__init__()

    def forward(self, x):
        """
        Args:
            x (:class:`torch.FloatTensor` [sequence length, batch size, rnn hidden size]): Input to
                apply dropout too.
        """
        if not self.training or not self.p:
            return x
        x = x.clone()
        mask = x.new_empty(x.size(0), 1, x.size(2), requires_grad=False).bernoulli_(1 - self.p)
        mask = mask.div_(1 - self.p)
        mask = mask.expand_as(x)
        return x * mask


    def __repr__(self):
        return self.__class__.__name__ + '(' \
            + 'p=' + str(self.p) + ')'

In [185]:
class pBLSTM(torch.nn.Module):
    def __init__(self, input_size, hidden_size):
        super(pBLSTM, self).__init__()
        self.blstm = nn.LSTM(input_size, hidden_size, bidirectional=True, batch_first=True)
        self.input_size = input_size

    def forward(self, x_packed):
        x_padded, x_padded_len = pad_packed_sequence(x_packed, batch_first=True)
        x_pad_trunc, x_pad_trunc_len = self.trunc_reshape(x_padded, x_padded_len)
        x_packed = pack_padded_sequence(x_pad_trunc, x_pad_trunc_len, batch_first=True, enforce_sorted=False)
        output, (h_n, c_n) = self.blstm(x_packed)
        return output

    def trunc_reshape(self, x, x_lens): 
        B, T, F = x.shape
        x = x[:,:(T//2)*2, :]
        x = x.reshape(B,T//2,F*2)
        x_lens = x_lens // 2
        return x, x_lens

class Listener(torch.nn.Module):
    def __init__(self, input_size, encoder_hidden_size):
        super(Listener, self).__init__()
        self.permute = PermuteBlock()
        self.embedding = nn.Conv1d(in_channels=27, out_channels=input_size, kernel_size=3, padding=1)
        self.encoder_size = encoder_hidden_size
        self.pBLSTM1 = pBLSTM(input_size*2, encoder_hidden_size)
        self.pBLSTM2 = pBLSTM(encoder_hidden_size*4, encoder_hidden_size)
        self.ldp = LockedDropout(p=config['dropout'])

    def forward(self, x, x_lens):
        x = self.permute(x) 
        E = self.embedding(x) 
        E = self.permute(E)
        E_packed = pack_padded_sequence(E, x_lens, batch_first=True, enforce_sorted=False)

        pblstms_out = self.pBLSTM1(E_packed.to(device))
        pblstms_out, pad_lens = pad_packed_sequence(pblstms_out, batch_first=True)
        pblstms_out = self.ldp(pblstms_out)
        pblstms_out = pack_padded_sequence(pblstms_out, pad_lens, batch_first=True, enforce_sorted=False)

        pblstms_out = self.pBLSTM2(pblstms_out) 
        pblstms_out, pad_lens = pad_packed_sequence(pblstms_out, batch_first=True)
        pblstms_out = self.ldp(pblstms_out)
        pblstms_out = pack_padded_sequence(pblstms_out, pad_lens, batch_first=True, enforce_sorted=False)

        pblstm_out = self.pBLSTM2(pblstms_out)
        pblstms_out, pad_lens = pad_packed_sequence(pblstms_out, batch_first=True)
        pblstms_out = self.ldp(pblstms_out)
        pblstms_out = pack_padded_sequence(pblstms_out, pad_lens, batch_first=True, enforce_sorted=False)

        encoder_outputs, encoder_lens = pad_packed_sequence(pblstms_out, batch_first=True)
        return encoder_outputs, encoder_lens

## Attention

In [186]:
class Attention(torch.nn.Module):
  def __init__(self, listener_hidden_size, speller_hidden_size, projection_size):
    super(Attention, self).__init__()
    self.Wv = nn.Linear(listener_hidden_size*2, projection_size)
    self.Wk = nn.Linear(listener_hidden_size*2, projection_size)
    self.Wq = nn.Linear(speller_hidden_size, projection_size)
    self.listener_hidden_size = listener_hidden_size
    self.speller_hidden_size = speller_hidden_size 
    self.projection_size = projection_size
  
  def set_key_value(self, encoder_outputs):
    self.key = self.Wk(encoder_outputs)
    self.value = self.Wv(encoder_outputs)

  def compute_context(self, decoder_context):
    query = self.Wq(decoder_context)
    query = torch.unsqueeze(query,1)
    raw_weights = torch.bmm(query, self.key.transpose(1,2))
    attention_weights = torch.softmax(raw_weights,dim=2)
    attention_context = torch.bmm(attention_weights, self.value)
    attention_weights = attention_weights.squeeze(1) 
    attention_context = attention_context.squeeze(1)
    return attention_context, attention_weights

## The Speller

In [187]:
class Speller(torch.nn.Module):

  def __init__(self, attender:Attention, vocab_size, embedding_size, hidden_size):
    super(). __init__()

    self.attend = attender # Attention object in speller
    self.max_timesteps = 518

    self.embedding =  torch.nn.Embedding(vocab_size,embedding_size).cuda()
    self.lstm_cells =  torch.nn.Sequential(
        torch.nn.LSTMCell(embedding_size+self.attend.projection_size,hidden_size),
        torch.nn.LSTMCell(hidden_size,hidden_size)
    )
    
    # For CDN (Feel free to change)
    self.output_to_char = nn.Linear(self.attend.projection_size+hidden_size, embedding_size)# Linear module to convert outputs to correct hidden size (Optional: TO make dimensions match)
    self.activation = nn.Tanh()
    self.char_prob = nn.Linear(embedding_size, vocab_size) # Linear layer to convert hidden space back to logits for token classification
    self.char_prob.weight = self.embedding.weight

  def lstm_step(self, input_word, hidden_states_list):

    for i in range(len(self.lstm_cells)):
        hidden_states_list[i] = self.lstm_cells[i](input_word, hidden_states_list[i])
        input_word = hidden_states_list[i][0]
  
    return input_word, hidden_states_list
    
  def CDN(self, cdn_input): # handles the probability distribution too
    # Make the CDN here, you can add the output-to-char
    chars = self.output_to_char(cdn_input)
    chars = self.activation(chars)
    raw_pred = self.char_prob(chars)
    return raw_pred
    
  def forward (self, y=None, teacher_forcing_ratio=1):
    # print(self.attend.value.shape, self.attend.key.shape)
    B = self.attend.value.shape[0]
    attn_context = torch.zeros((B, self.attend.projection_size)).to(device)
    output_symbol = torch.full((B,), fill_value=config['sos'], dtype=torch.long).to(device)
    raw_outputs = []  
    attention_plot = []
    hidden_states_list = [None]*len(self.lstm_cells)
      
    if y is None:
      timesteps = self.max_timesteps
      teacher_forcing_ratio = 0 #Why does it become zero?

    else:
      timesteps = y.shape[1]

    for t in range(timesteps):
      p = torch.rand(1).item()
      # print(t)

      if p < teacher_forcing_ratio and t > 0: # Why do we consider cases only when t > 0? What is considered when t == 0? Think.
        output_symbol = y[:, t-1]

      char_embed = self.embedding(output_symbol) # Embed the character symbol
      lstm_input = torch.cat((char_embed, attn_context), dim=1)

      rnn_out, hidden_states_list = self.lstm_step(lstm_input, hidden_states_list) # Feed the input through LSTM Cells and attention.
      # What should we retrieve from forward_step to prepare for the next timestep?

      attn_context, attn_weights = self.attend.compute_context(rnn_out) # Feed the resulting hidden state into attention

      cdn_input = torch.cat((rnn_out, attn_context), dim=1)
      raw_pred = self.CDN(cdn_input)

      # Generate a prediction for this timestep and collect it in output_symbols
      output_symbol = torch.argmax(raw_pred, dim=1)

      raw_outputs.append(raw_pred) # for loss calculation
      attention_plot.append(attn_weights) # for plotting attention plot

    
    attention_plot = torch.stack(attention_plot, dim=1)
    raw_outputs = torch.stack(raw_outputs, dim=1)

    return raw_outputs, attention_plot

## LAS

Here we finally build the LAS model, comibining the listener, attender and speller together, we have given a template, but you are free to read the paper and implement it yourself.

In [188]:
class LAS(torch.nn.Module):
  def __init__(self, listener_input_size, listener_hidden_size, speller_embedding_size, speller_hidden_size, attn_proj_size): # add parameters
    super().__init__()

    self.augmentations  = torch.nn.Sequential(
        # Add Time Masking/ Frequency Masking
        PermuteBlock(),
        tat.TimeMasking(time_mask_param=config['tmask_length']),
        tat.FrequencyMasking(freq_mask_param=config['fmask']),
        PermuteBlock()
    )
    # Pass the right parameters here
    self.listener = Listener(input_size=listener_input_size, encoder_hidden_size=listener_hidden_size)
    self.attend = Attention(listener_hidden_size, speller_hidden_size, projection_size=attn_proj_size)
    self.speller = Speller(self.attend, vocab_size=len(config['vocab']), embedding_size=speller_embedding_size, hidden_size=speller_hidden_size)

  def forward(self, x,lx,y=None,teacher_forcing_ratio=1):
    
    if self.training:
      x = self.augmentations(x)
    # Encode speech features
    encoder_outputs, _ = self.listener(x,lx)

    # We want to compute keys and values ahead of the decoding step, as they are constant for all timesteps
    # Set keys and values using the encoder outputs
    self.attend.set_key_value(encoder_outputs)

    # Decode text with the speller using context from the attention
    raw_outputs, attention_plots = self.speller(y=y,teacher_forcing_ratio=teacher_forcing_ratio)

    return raw_outputs.to(device), attention_plots

# Model Setup 

In [255]:
# Baseline LAS has the following configuration:
# Encoder bLSTM/pbLSTM Hidden Dimension of 512 (256 per direction)
# Decoder Embedding Layer Dimension of 256
# Decoder Hidden Dimension of 512 
# Attention Projection Size of 128
# Feel Free to Experiment with this 

model = LAS(
    # Initialize your model 
    # Read the paper and think about what dimensions should be used
    # You can experiment on these as well, but they are not requried for the early submission
    # Remember that if you are using weight tying, some sizes need to be the same
    listener_input_size = 64,
    listener_hidden_size = 512,
    speller_embedding_size = 256,
    speller_hidden_size = 256,
    attn_proj_size = 128
)

model = model.to(device)
print(model)
summary(model, 
        x.to(device), 
        x_len, 
        y.to(device))

LAS(
  (augmentations): Sequential(
    (0): PermuteBlock()
    (1): TimeMasking()
    (2): FrequencyMasking()
    (3): PermuteBlock()
  )
  (listener): Listener(
    (permute): PermuteBlock()
    (embedding): Conv1d(27, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (pBLSTM1): pBLSTM(
      (blstm): LSTM(128, 512, batch_first=True, bidirectional=True)
    )
    (pBLSTM2): pBLSTM(
      (blstm): LSTM(2048, 512, batch_first=True, bidirectional=True)
    )
    (ldp): LockedDropout(p=0.3)
  )
  (attend): Attention(
    (Wv): Linear(in_features=1024, out_features=128, bias=True)
    (Wk): Linear(in_features=1024, out_features=128, bias=True)
    (Wq): Linear(in_features=256, out_features=128, bias=True)
  )
  (speller): Speller(
    (attend): Attention(
      (Wv): Linear(in_features=1024, out_features=128, bias=True)
      (Wk): Linear(in_features=1024, out_features=128, bias=True)
      (Wq): Linear(in_features=256, out_features=128, bias=True)
    )
    (embedding): Embedding(31, 

Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_augmentations.PermuteBlock_0,-,"[128, 27, 1680]",,
1_augmentations.TimeMasking_1,-,"[128, 27, 1680]",,
2_augmentations.FrequencyMasking_2,-,"[128, 27, 1680]",,
3_augmentations.PermuteBlock_3,-,"[128, 1680, 27]",,
4_listener.PermuteBlock_permute,-,"[128, 27, 1680]",,
...,...,...,...,...
2644_speller.attend.Linear_Wq,"[256, 128]","[128, 128]",,32768.0
2645_speller.attend.Linear_Wq,"[256, 128]","[128, 128]",,32768.0
2646_speller.Linear_output_to_char,"[384, 256]","[128, 256]",,98304.0
2647_speller.Tanh_activation,-,"[128, 256]",,


# Loss Function, Optimizers, Scheduler

In [256]:
optimizer   = torch.optim.Adam(model.parameters(), lr= config['lr']) # Feel free to experiment if needed
criterion   = torch.nn.CrossEntropyLoss(reduction='mean',ignore_index=config['pad']) #check how would you fill these values : https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
scaler      = torch.cuda.amp.GradScaler()
scheduler   = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,factor=0.2,patience=2,threshold=0.01)

# Optional (but Recommended): Create a custom class for a Teacher Force Schedule

# Levenshtein Distance

In [257]:
# We have given you this utility function which takes a sequence of indices and converts them to a list of characters
def indices_to_chars(indices, vocab):
    tokens = []
    for i in indices: # This loops through all the indices
        if int(i) == config['sos']: # If SOS is encountered, dont add it to the final list
            continue
        elif int(i) == config['eos']: # If EOS is encountered, stop the decoding process
            break
        else:
            tokens.append(vocab[int(i)])
    return tokens

# To make your life more easier, we have given the Levenshtein distantce / Edit distance calculation code
def calc_edit_distance(predictions, y, ly, vocab= config['vocab'], print_example= False):

    dist                = 0
    batch_size, seq_len = predictions.shape

    for batch_idx in range(batch_size): 

        y_sliced    = indices_to_chars(y[batch_idx,0:ly[batch_idx]], vocab)
        pred_sliced = indices_to_chars(predictions[batch_idx], vocab)
        
        # Strings - When you are using characters from the AudioDataset
        y_string    = ''.join(y_sliced)
        pred_string = ''.join(pred_sliced)
        
        dist        += Levenshtein.distance(pred_string, y_string)
        # Comment the above and uncomment below for toy dataset, as the toy dataset has a list of phonemes to compare
        # dist      += Levenshtein.distance(y_sliced, pred_sliced)

    if print_example: 
        # Print y_sliced and pred_sliced if you are using the toy dataset
        print("Ground Truth : ", y_string)
        print("Prediction   : ", pred_string)
        
    dist/=batch_size
    return dist

# Train and Validation functions 


In [258]:
def train(model, dataloader, criterion, optimizer, teacher_forcing_rate):

    model.train()
    batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc='Train')

    running_loss        = 0.0
    running_perplexity  = 0.0
    
    for i, (x, y, lx, ly) in enumerate(dataloader):

        optimizer.zero_grad()

        x, y, lx, ly = x.to(device), y.to(device), lx, ly

        with torch.cuda.amp.autocast():

            raw_predictions, attention_plot = model(x, lx, y= y, teacher_forcing_ratio=teacher_forcing_rate)

            # Predictions are of Shape (batch_size, timesteps, vocab_size). 
            # Transcripts are of shape (batch_size, timesteps) Which means that you have batch_size amount of batches with timestep number of tokens.
            # So in total, you have batch_size*timesteps amount of characters.
            # Similarly, in predictions, you have batch_size*timesteps amount of probability distributions.
            # How do you need to modify transcipts and predictions so that you can calculate the CrossEntropyLoss? Hint: Use Reshape/View and read the docs
            # Also we recommend you plot the attention weights, you should get convergence in around 10 epochs, if not, there could be something wrong with 
            # your implementation
            raw_predictions = torch.reshape(raw_predictions,(raw_predictions.shape[0]*raw_predictions.shape[1],raw_predictions.shape[2])).to(torch.float32)
            y = torch.reshape(y,(y.shape[0]*y.shape[1],)).to(torch.int64)
            loss        =  criterion(raw_predictions, y)

            perplexity  = torch.exp(loss) # Perplexity is defined the exponential of the loss

            running_loss        += loss.item()
            running_perplexity  += perplexity.item()
        
        # Backward on the masked loss
        scaler.scale(loss).backward()

        # Optional: Use torch.nn.utils.clip_grad_norm to clip gradients to prevent them from exploding, if necessary
        # If using with mixed precision, unscale the Optimizer First before doing gradient clipping
        
        scaler.step(optimizer)
        scaler.update()
        

        batch_bar.set_postfix(
            loss="{:.04f}".format(running_loss/(i+1)),
            perplexity="{:.04f}".format(running_perplexity/(i+1)),
            lr="{:.04f}".format(float(optimizer.param_groups[0]['lr'])),
            tf_rate='{:.02f}'.format(teacher_forcing_rate))
        batch_bar.update()

        del x, y, lx, ly
        torch.cuda.empty_cache()

    running_loss /= len(dataloader)
    running_perplexity /= len(dataloader)
    batch_bar.close()

    return running_loss, running_perplexity, attention_plot

In [259]:
def validate(model, dataloader):

    model.eval()

    batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc="Val")

    running_lev_dist = 0.0

    for i, (x, y, lx, ly) in enumerate(dataloader):

        x, y, lx, ly = x.to(device), y.to(device), lx, ly

        with torch.inference_mode():
            raw_predictions, attentions = model(x, lx, y = None)

        # Greedy Decoding
        greedy_predictions   = torch.argmax(raw_predictions, dim=2) # TODO: How do you get the most likely character from each distribution in the batch?

        # Calculate Levenshtein Distance
        running_lev_dist    += calc_edit_distance(greedy_predictions, y, ly, config['vocab'], print_example = False) # You can use print_example = True for one specific index i in your batches if you want

        batch_bar.set_postfix(
            dist="{:.04f}".format(running_lev_dist/(i+1)))
        batch_bar.update()

        del x, y, lx, ly
        torch.cuda.empty_cache()

    batch_bar.close()
    running_lev_dist /= len(dataloader)

    return running_lev_dist

# Experiment

In [260]:
# Login to Wandb
# Initialize your Wandb Run Here
# Save your model architecture in a txt file, and save the file to Wandb
import wandb
wandb.login(key="17b33e5165b64dc340be46f11e70984b346ab965")



True

In [261]:
def plot_attention(attention): 
    # Function for plotting attention
    # You need to get a diagonal plot
    plt.clf()
    sns.heatmap(attention, cmap='GnBu')
    plt.show()

In [262]:
run = wandb.init(
    name = str(config), ## Wandb creates random run names if you skip this field
    reinit = True, ### Allows reinitalizing runs when you re-run this cell
    # run_id = ### Insert specific run id here if you want to resume a previous run
    # resume = "must" ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw4p2-ablations", ### Project should be created in your wandb account 
    config = config ### Wandb Config for your run
)

In [264]:
def save_model(model, optimizer, scheduler, metric, epoch, path):
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'optimizer_state_dict'     : optimizer.state_dict(),
         'scheduler_state_dict'     : scheduler.state_dict(),
         metric[0]                  : metric[1], 
         'epoch'                    : epoch}, 
         path
    )

def load_model(path, model, metric= 'valid_acc', optimizer= None, scheduler= None):

    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer != None:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler != None:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        
    epoch   = checkpoint['epoch']
    metric  = checkpoint[metric]

    return [model, optimizer, scheduler, epoch, metric]

In [265]:
# This is for checkpointing, if you're doing it over multiple sessions

last_epoch_completed = 0
start = last_epoch_completed
end = config["epochs"]
best_lev_dist = float("inf") # if you're restarting from some checkpoint, use what you saw there.
epoch_model_path = 'hw4p2_current_checkpoint.pth'
best_model_path = 'hw4p2_best_checkpoint.pth'

In [266]:
best_lev_dist = float("inf")
tf_rate = 1.0

for epoch in range(0, config['epochs']):
    
    print("\nEpoch: {}/{}".format(epoch+1, config['epochs']))

    curr_lr = float(optimizer.param_groups[0]['lr'])

    # Call train and validate, get attention weights from training
    train_loss, train_perplexity, attention_plot = train(model, train_loader, criterion, optimizer, tf_rate)
    valid_dist  = validate(model, val_loader)

    print("\tTrain Loss {:.04f}\t Learning Rate {:.07f}".format(train_loss, curr_lr))
    print("\tVal Dist {:.04f}".format(valid_dist))

    # Print your metrics

    # Plot Attention for a single item in the batch
    # plot_attention(attention_plot[0].cpu().detach().numpy())

    # Log metrics to Wandb

    # Optional: Scheduler Step / Teacher Force Schedule Step
    tf_rate = 1.0+(-0.5/config['epochs'])*(epoch+1)
    scheduler.step(valid_dist)

    wandb.log({
        'train_loss': train_loss,  
        'valid_dist': valid_dist, 
        'lr'        : curr_lr
    })
    save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, epoch_model_path)
    wandb.save(epoch_model_path)
    print("Saved epoch model")

    if valid_dist <= best_lev_dist:
        best_lev_dist = valid_dist
        save_model(model, optimizer, scheduler, ['valid_dist', valid_dist], epoch, best_model_path)
        wandb.save(best_model_path)
        print("Saved best model")


Epoch: 1/50




	Train Loss 1.8219	 Learning Rate 0.0010000
	Val Dist 453.8790
Saved epoch model
Saved best model

Epoch: 2/50




	Train Loss 1.4733	 Learning Rate 0.0010000
	Val Dist 453.2832
Saved epoch model
Saved best model

Epoch: 3/50




	Train Loss 1.4090	 Learning Rate 0.0010000
	Val Dist 452.8393
Saved epoch model
Saved best model

Epoch: 4/50




	Train Loss 1.1205	 Learning Rate 0.0010000
	Val Dist 164.5333
Saved epoch model
Saved best model

Epoch: 5/50




	Train Loss 0.6068	 Learning Rate 0.0010000
	Val Dist 79.2804
Saved epoch model
Saved best model

Epoch: 6/50




	Train Loss 0.4429	 Learning Rate 0.0010000
	Val Dist 42.1219
Saved epoch model
Saved best model

Epoch: 7/50




	Train Loss 0.3684	 Learning Rate 0.0010000
	Val Dist 38.1652
Saved epoch model
Saved best model

Epoch: 8/50




	Train Loss 0.3422	 Learning Rate 0.0010000
	Val Dist 28.0758
Saved epoch model
Saved best model

Epoch: 9/50




	Train Loss 0.3054	 Learning Rate 0.0010000
	Val Dist 24.4261
Saved epoch model
Saved best model

Epoch: 10/50




	Train Loss 0.2834	 Learning Rate 0.0010000
	Val Dist 21.1009
Saved epoch model
Saved best model

Epoch: 11/50




	Train Loss 0.2592	 Learning Rate 0.0010000
	Val Dist 20.0424
Saved epoch model
Saved best model

Epoch: 12/50




	Train Loss 0.2502	 Learning Rate 0.0010000
	Val Dist 19.3586
Saved epoch model
Saved best model

Epoch: 13/50




	Train Loss 0.2388	 Learning Rate 0.0010000
	Val Dist 19.2235
Saved epoch model
Saved best model

Epoch: 14/50




	Train Loss 0.2268	 Learning Rate 0.0010000
	Val Dist 16.7446
Saved epoch model
Saved best model

Epoch: 15/50




	Train Loss 0.2249	 Learning Rate 0.0010000
	Val Dist 17.6948
Saved epoch model

Epoch: 16/50




	Train Loss 0.2148	 Learning Rate 0.0010000
	Val Dist 17.5859
Saved epoch model

Epoch: 17/50




	Train Loss 0.2091	 Learning Rate 0.0010000
	Val Dist 15.8962
Saved epoch model
Saved best model

Epoch: 18/50




	Train Loss 0.2051	 Learning Rate 0.0010000
	Val Dist 13.7506
Saved epoch model
Saved best model

Epoch: 19/50




	Train Loss 0.2058	 Learning Rate 0.0010000
	Val Dist 13.9039
Saved epoch model

Epoch: 20/50




	Train Loss 0.2001	 Learning Rate 0.0010000
	Val Dist 13.7793
Saved epoch model

Epoch: 21/50




	Train Loss 0.1951	 Learning Rate 0.0010000
	Val Dist 13.2180
Saved epoch model
Saved best model

Epoch: 22/50




	Train Loss 0.1935	 Learning Rate 0.0010000
	Val Dist 13.0245
Saved epoch model
Saved best model

Epoch: 23/50




	Train Loss 0.1942	 Learning Rate 0.0010000
	Val Dist 12.4890
Saved epoch model
Saved best model

Epoch: 24/50




	Train Loss 0.2098	 Learning Rate 0.0010000
	Val Dist 13.1968
Saved epoch model

Epoch: 25/50




	Train Loss 0.2096	 Learning Rate 0.0010000
	Val Dist 11.9528
Saved epoch model
Saved best model

Epoch: 26/50




	Train Loss 0.1815	 Learning Rate 0.0010000
	Val Dist 10.8460
Saved epoch model
Saved best model

Epoch: 27/50




	Train Loss 0.1788	 Learning Rate 0.0010000
	Val Dist 11.0284
Saved epoch model

Epoch: 28/50




	Train Loss 0.1790	 Learning Rate 0.0010000
	Val Dist 11.2143
Saved epoch model

Epoch: 29/50




	Train Loss 0.1740	 Learning Rate 0.0010000
	Val Dist 11.3738
Saved epoch model

Epoch: 30/50




	Train Loss 0.1393	 Learning Rate 0.0002000
	Val Dist 9.3057
Saved epoch model
Saved best model

Epoch: 31/50




	Train Loss 0.1337	 Learning Rate 0.0002000
	Val Dist 9.0717
Saved epoch model
Saved best model

Epoch: 32/50




	Train Loss 0.1302	 Learning Rate 0.0002000
	Val Dist 8.9223
Saved epoch model
Saved best model

Epoch: 33/50




	Train Loss 0.1308	 Learning Rate 0.0002000
	Val Dist 9.0722
Saved epoch model

Epoch: 34/50




	Train Loss 0.1318	 Learning Rate 0.0002000
	Val Dist 8.9805
Saved epoch model

Epoch: 35/50




	Train Loss 0.1288	 Learning Rate 0.0002000
	Val Dist 9.3799
Saved epoch model

Epoch: 36/50




	Train Loss 0.1206	 Learning Rate 0.0000400
	Val Dist 9.1480
Saved epoch model

Epoch: 37/50




	Train Loss 0.1220	 Learning Rate 0.0000400
	Val Dist 9.2994
Saved epoch model

Epoch: 38/50




	Train Loss 0.1306	 Learning Rate 0.0000400
	Val Dist 8.7832
Saved epoch model
Saved best model

Epoch: 39/50




	Train Loss 0.1331	 Learning Rate 0.0000400
	Val Dist 8.7342
Saved epoch model
Saved best model

Epoch: 40/50




	Train Loss 0.1327	 Learning Rate 0.0000400
	Val Dist 8.9344
Saved epoch model

Epoch: 41/50




	Train Loss 0.1398	 Learning Rate 0.0000400
	Val Dist 8.6219
Saved epoch model
Saved best model

Epoch: 42/50




	Train Loss 0.1351	 Learning Rate 0.0000400
	Val Dist 8.7165
Saved epoch model

Epoch: 43/50




	Train Loss 0.1317	 Learning Rate 0.0000400
	Val Dist 8.6247
Saved epoch model

Epoch: 44/50




	Train Loss 0.1326	 Learning Rate 0.0000400
	Val Dist 8.5463
Saved epoch model
Saved best model

Epoch: 45/50




	Train Loss 0.1415	 Learning Rate 0.0000080
	Val Dist 8.5915
Saved epoch model

Epoch: 46/50




	Train Loss 0.1459	 Learning Rate 0.0000080
	Val Dist 8.6600
Saved epoch model

Epoch: 47/50




	Train Loss 0.1492	 Learning Rate 0.0000080
	Val Dist 8.4993
Saved epoch model
Saved best model

Epoch: 48/50




	Train Loss 0.1605	 Learning Rate 0.0000080
	Val Dist 8.5521
Saved epoch model

Epoch: 49/50




	Train Loss 0.1617	 Learning Rate 0.0000080
	Val Dist 8.5797
Saved epoch model

Epoch: 50/50




	Train Loss 0.1626	 Learning Rate 0.0000080
	Val Dist 8.5323
Saved epoch model


In [267]:
run.finish()

VBox(children=(Label(value='336.870 MB of 336.870 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0,…

0,1
lr,████████████████████████▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
train_loss,█▇▆▅▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
valid_dist,███▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
lr,1e-05
train_loss,0.16258
valid_dist,8.53229


# Testing

In [270]:
best_state_dict = torch.load('hw4p2_best_checkpoint.pth')
best_model = LAS(listener_input_size = 64, listener_hidden_size = 512, speller_embedding_size = 256, speller_hidden_size = 256, attn_proj_size = 128).to(device)
best_model.load_state_dict(best_state_dict['model_state_dict'])

<All keys matched successfully>

In [314]:
a = torch.tensor([1,1,1,1,2,3,4])
print(len(a))
# print(a[0:torch.where(a==3)[0]+1])

7


In [320]:
def decode_prediction(pred):
  pred_str = []
  for i in range(pred.shape[0]):
    vocab = config['vocab']
    text = pred[i]
    eos_ind = 0
    for j in range(len(text)):
      if text[j] == config['eos']:
        eos_ind = j
        break
    # print(torch.nonzero(text==config['eos']))
    prediction = [vocab[ind] for ind in text[1:eos_ind]]
    prediction = "".join(prediction)
    pred_str.append(prediction)
  return pred_str

In [321]:
results = []
best_model.eval()
print("Testing")
running_lev_dist = 0.0
for data in tqdm(test_loader):

    x, lx   = data
    x       = x.to(device)

    with torch.no_grad():
        raw_predictions, attentions = best_model(x, lx, y=None)
    greedy_predictions   = torch.argmax(raw_predictions, dim=2)
    prediction_string = decode_prediction(greedy_predictions)
    
    #TODO save the output in results array.
    results.append(prediction_string)
    # print(prediction_string)
    
    del x, lx, raw_predictions, attentions
    torch.cuda.empty_cache()
    # break

Testing


100%|██████████| 21/21 [00:30<00:00,  1.44s/it]


In [322]:
RESULT = []
for i in range(len(results)):
  for j in range(len(results[i])):
    RESULT.append(results[i][j])

In [288]:
print(RESULT)

["<sos>HE HOPED THERE WOULD BE STOOL FOR DINNER TURNIPS AND CHARACTES AND BRUISED POTATOES AND FAT MUDDEN PIECES TO BE LATELED OUT IN THICK PEPPERED FLOWER FATTEN'S SAUCE<eos><eos><eos><eos> HE HOPED THERE WOULD BE STOOL FOR DINNER TURNIPS AND CHARACTE  AND BRUISED POTATOES AND FAT MUDDEN PIECES TO BE LATELED OUT IN THICK PEPPERED FLOWER FATTEN'S SAUCE<eos><eos><eos><eos> HE HOPED THERE WOULD BE STOOL FOR DINNER TURNIPS AND CARRITS AND BRUISED POTATOES AND FAT MUDDEN PIECES TO BE LATELED OUT IN THICK PEPPERED FLOWER FATTEN'S SAUCE<eos><eos><eos><eos> HE HOPED THERE ", '<sos>STUFFED INTO YOU HIS BELLY COUNCILED HIM<eos>TH<eos><eos><eos> THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  AND THE  

In [323]:
for i in range(len(RESULT)):
  print(RESULT[i])

HE HOPED THERE WOULD BE STOOL FOR DINNER TURNIPS AND CHARACTES AND BRUISED POTATOES AND FAT MUDDEN PIECES TO BE LATELED OUT IN THICK PEPPERED FLOWER FATTEN'S SAUCE
STUFFED INTO YOU HIS BELLY COUNCILED HIM
AFTER EARLY KNIGHT FALL THE YELLOW LAMPS WHICH LIGHT HUP HERE AND THERE THE SQUALID QUARTER OF THE BROAFFLES
AND OUT BURITAY AND HE GOOD IN YOUR MIND
NUMBER DEN FRESH NELLIER'S WAITING ON YOU COULD NIGHT HUSBAND
THE MUSIC CAME NEARER AND HE RECALLED THE WORDS THE WORDS OF SHELLIES FRAGMENT UPON THE MOON WANDERING COMPANIONALIS PALE FOR WEARINESS
THE DULL LIGHT FELL MORE FAINTLY UPON THE PAGE WEAR ON ANOTHER REQUASION BEGAN TO ONE FOLD ITSELF SLOWLY AND TO SPREAD ABROAD ITS WIDENING TAIL
A COLD LUCID INDIFFERENCE WERE RAINED IN HIS SOUL
THE CHAOS IN WHICH HIS ARDOUR EXTINGUISHED ITSELF WAS A COLD INDIFFERENT KNOWLEDGE OF HIMSELF
AT MOST BY AN ALMS GIVEN TO A BEGGAR WHOSE BLESSING HE FLED FROM HE MIGHT HOPE WEARILY TO WIN FOR HIMSELF SOME MEASURE OF ACTUAL GRACE
WELL NOW INNOCE I DECLAR

In [324]:
data_dir = "hw4p2_submission.csv"
df = pd.DataFrame(columns=["index", "label"])
df.to_csv("hw4p2_submission.csv", index=False)
df = pd.read_csv(data_dir)
df['label'] = RESULT
df['index'] = np.array(range(2620))
df.to_csv("hw4p2_submission.csv", index=False)

!kaggle competitions submit -c attention-based-speech-recognition-slack -f hw4p2_submission.csv -m "I made it!"

100% 287k/287k [00:00<00:00, 552kB/s]
Successfully submitted to Attention-Based Speech Recognition (Slack)