# Homework 2: CTC Speech Recognition System
You can do this notebook in google collab, or in datasphere (if you are brave enougth)

### Grades criteria

```
[ ] (10 points) Implement a Prefix Decoder
[ ] (10 points) Train ASR System, WER criterions: 60-50 -- 3 points, 50-40 -- 5 points, 40-35 -- 7 points, <=35 -- 10 points. + Bonus point per 1% WER below 30
[ ] (5 points) Compare performance of DNN, RNN and BiRNN models in terms of WER, training time and other properties
[ ] (5 points) Compare alignments obtained from DNN, RNN and BiRNN models
```

The results of this task are two artifacts:
1. this Jupiter Notebook (`.ipynb`) with completed cells, training progress and final score.
2. file with predictions of your best model for the test data

Save the artifacts to a directory named `{your last name}_{your first name}_hw2` and pack them in `.zip` archive.


In [None]:
#!L
#pip install torch==1.8.0+cu101
%pip install torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
%pip install https://github.com/kpu/kenlm/archive/master.zip
%pip install dulwich
    
%enable_full_walk

## Clone github repo

In [None]:
#!L

import dulwich.client
from dulwich.repo import Repo
from dulwich import index

import os
import shutil

def git_clone(src, target):
    client, path = dulwich.client.get_transport_and_path(src)
    if os.path.isdir(target):
        shutil.rmtree(target)
    os.makedirs(target)
    r = Repo.init(target)

    remote_refs = client.fetch(src, r)
    r[b"HEAD"] = remote_refs.refs[b"HEAD"]

    index.build_index_from_tree(r.path, r.index_path(), r.object_store, r[b'HEAD'].tree)

src = "https://github.com/yandexdataschool/speech_course"
target = "./speech_course"

git_clone(src, target)
os.listdir(target)

week_05_path = './speech_course/week_05' # Change this path, if it is different in your case

In [None]:
#!L
import importlib
import collections
import os
import math
import numpy as np
import time

from speech_course.week_05.utils import *

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torchaudio
from torch import optim
import kenlm

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm

In [None]:
#!L
# Download LibriSpeech 100hr training and test data

if not os.path.isdir("./data"):
    os.makedirs("./data")

train_dataset = torchaudio.datasets.LIBRISPEECH("./data", url="train-clean-100", download=True)
test_dataset = torchaudio.datasets.LIBRISPEECH("./data", url="test-clean", download=True)

## Tokenizer Class

In [None]:
#!L
# Class to transform text to strings of token indecies
class Tokenizer:
    """Maps characters to integers and vice versa"""
    def __init__(self):
        char_map_str = """
        ' 0
        _ 1
        a 2
        b 3
        c 4
        d 5
        e 6
        f 7
        g 8
        h 9
        i 10
        j 11
        k 12
        l 13
        m 14
        n 15
        o 16
        p 17
        q 18
        r 19
        s 20
        t 21
        u 22
        v 23
        w 24
        x 25
        y 26
        z 27
        """
        self.char_map = {}
        self.index_map = {}
        for line in char_map_str.strip().split('\n'):
            ch, index = line.split()
            self.char_map[ch] = int(index)
            self.index_map[int(index)] = ch
        self.index_map[1] = ' '

    def text_to_indecies(self, text):
        """ Use a character map and convert text to an integer sequence """
        int_sequence = []
        for c in text:
            if c == ' ':
                ch = self.char_map['_']
            else:
                ch = self.char_map[c]
            int_sequence.append(ch)
        return int_sequence

    def indecies_to_text(self, labels):
        """ Use a character map and convert integer labels to an text sequence """
        string = []
        for i in labels:
            string.append(self.index_map[i])
        return ''.join(string).replace('_', ' ')
tokenizer = Tokenizer()

In [None]:
#!L

INSERT GREEDY DECODER CODE

In [None]:
#!L
# TESTING THE GREEDY DECODER 

#Load numpy matrix, add axis [batch,classes,time]
matrix = np.loadtxt(os.path.join(week_05_path, 'test_matrix.txt'))[np.newaxis,:,:]

# Turn into Torch Tensor of shape [batch, time, classes]
matrix = torch.Tensor(matrix).transpose(1,2)

# Create list of torch tensor
labels_indecies = [torch.Tensor(tokenizer.text_to_indecies('there seems no good reason for believing that it will change'))]

# Run the Decoder
decodes, targets = GreedyDecoder(matrix, labels_indecies, [len(labels_indecies[0])])

assert decodes[0] == 'there se ms no good reason for believing that twillc ange'
assert targets[0] == 'there seems no good reason for believing that it will change'



## Implement Prefix Decoding With LM (10 points)

In [1]:
SKELETON IN PROGRESS

SyntaxError: invalid syntax (<ipython-input-1-3ef1120ddf79>, line 1)

## Deep Learning part

## Create a Dataloader

In [None]:
#!L
# For train you can use SpecAugment data aug here.
train_audio_transforms = nn.Sequential(
    torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=<YOUR CODE HERE>),
    ADD YOUR AUGMENTS HERE
)

test_audio_transforms = torchaudio.transforms.MelSpectrogram()

tokenizer = Tokenizer()




def data_processing(data, data_type="train"):
    spectrograms = []
    labels = []
    input_lengths = []
    label_lengths = []
    for (waveform, _, utterance, _, _, _) in data:
        if data_type == 'train':
            spec = train_audio_transforms(waveform).squeeze(0).transpose(0, 1)
        elif data_type == 'test':
            spec = test_audio_transforms(waveform).squeeze(0).transpose(0, 1)
        else:
            raise Exception('data_type should be train or valid')
        spectrograms.append(spec)
        label = torch.Tensor(tokenizer.text_to_indecies(utterance.lower()))
        labels.append(label)
        input_lengths.append(spec.shape[0] // 2)
        label_lengths.append(len(label))

    spectrograms = nn.utils.rnn.pad_sequence(spectrograms, batch_first=True).unsqueeze(1).transpose(2, 3)
    labels = nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return spectrograms, labels, input_lengths, label_lengths

## Implement a Neural Network Model

You should try out a few different model types:
- Feed-Forward Model (DNN)
- Recurrent Model (GRU or LSTM)
- Bidirectional Recurrent Model (bi-GRU or bi-LSTM)
- Something different for bonus points

Before any of this models you can use convolutional layers, as shown in the example below

After your experiments you should write a report with comparison of different models in terms of different features, for example: parameters, training speed, resulting quality, spectrogram properties, and data augmentations. Remember, that for full mark you need to achive good WER 

WER criterions: 60-50 -- 3 points, 50-40 -- 5 points, 40-35 -- 7 points, <= 35 -- 10 points

### Our model classes are just examples, you can change them as you want

In [None]:
#!L
# Define model
class CNNLayerNorm(nn.Module):
    """Layer normalization built for CNNs input"""

    def __init__(self, n_feats):
        super(CNNLayerNorm, self).__init__()
        self.layer_norm = nn.LayerNorm(n_feats)

    def forward(self, x):
        # x (batch, channel, feature, time)
        x = x.transpose(2, 3).contiguous()  # (batch, channel, time, feature)
        x = self.layer_norm(x)
        return x.transpose(2, 3).contiguous()  # (batch, channel, feature, time)


class ResidualCNN(nn.Module):
    """Residual CNN inspired by https://arxiv.org/pdf/1603.05027.pdf
        except with layer norm instead of batch norm
    """

    def __init__(self, in_channels, out_channels, kernel, stride, dropout, n_feats):
        super(ResidualCNN, self).__init__()

        self.cnn1 = nn.Conv2d(in_channels, out_channels, kernel, stride, padding=kernel // 2)
        self.cnn2 = nn.Conv2d(out_channels, out_channels, kernel, stride, padding=kernel // 2)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.layer_norm1 = CNNLayerNorm(n_feats)
        self.layer_norm2 = CNNLayerNorm(n_feats)

    def forward(self, x):
        residual = x  # (batch, channel, feature, time)
        <YOUR CODE HERE>
        x += residual
        return x  # (batch, channel, feature, time)


class SpeechRecognitionModel(nn.Module):

    def __init__(self, n_cnn_layers, n_rnn_layers, rnn_dim, n_class, n_feats, stride=2, dropout=0.1):
        super(SpeechRecognitionModel, self).__init__()
        n_feats = n_feats // 2
        self.cnn = nn.Conv2d(1, 32, 3, stride=stride, padding=3 // 2)  # cnn for extracting heirachal features

        # n residual cnn layers with filter size of 32
        self.rescnn_layers = nn.Sequential(*[
            ResidualCNN(32, 32, kernel=3, stride=1, dropout=dropout, n_feats=n_feats)
            for _ in range(n_cnn_layers)
        ])
        self.fully_connected = <YOUR CODE HERE>
        self.birnn_layers = <YOUR CODE HERE>
        self.classifier = <YOUR CODE HERE>

    def forward(self, x):
        x = self.cnn(x)
        x = self.rescnn_layers(x)
        sizes = x.size()
        x = x.view(sizes[0], sizes[1] * sizes[2], sizes[3])  # (batch, feature, time)
        x = x.transpose(1, 2)  # (batch, time, feature)
        x = self.fully_connected(x)
        x = self.birnn_layers(x)
        x = self.classifier(x)
        return x

## Training and Evaluation Code

In [None]:
#!L
from tqdm import tqdm_notebook

In [None]:
#!L
def train(model, device, train_loader, criterion, optimizer, scheduler, epoch):
    model.train()
    data_len = len(train_loader.dataset)
    for batch_idx, _data in enumerate(train_loader):
        spectrograms, labels, input_lengths, label_lengths = _data
        spectrograms, labels = spectrograms.to(device), labels.to(device)

        optimizer.zero_grad()

        output = model(spectrograms)  # (batch, time, n_class)
        output = F.log_softmax(output, dim=2)
        output = output.transpose(0, 1)  # (time, batch, n_class)

        loss = criterion(output, labels, input_lengths, label_lengths)
        loss.backward()

        optimizer.step()
        scheduler.step()
        if batch_idx % 100 == 0 or batch_idx == data_len:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(spectrograms), data_len,
                       100. * batch_idx / len(train_loader), loss.item()))


def test(model, device, test_loader, criterion, epoch, decode='Greedy', lm=None):
    print('Beginning eval...')
    model.eval()
    test_loss = 0
    test_cer, test_wer = [], []
    with torch.no_grad():
        start = time.time()
        for i, _data in enumerate(test_loader):
            spectrograms, labels, input_lengths, label_lengths = _data
            spectrograms, labels = spectrograms.to(device), labels.to(device)
            
            matrix = model(spectrograms)  # (batch, time, n_class)
            matrix = F.log_softmax(matrix, dim=2)
            probs = F.softmax(matrix,dim=2)
            matrix = matrix.transpose(0, 1)  # (time, batch, n_class)
                
            loss = criterion(matrix, labels, input_lengths, label_lengths)
            test_loss += loss.item() / len(test_loader)

            if decode == 'Greedy':
                decoded_preds, decoded_targets = GreedyDecoder(matrix.transpose(0, 1), labels, label_lengths)
            elif decode == 'BeamSearch':
                ## THIS IS A CLASS YOU SHOULD IMPLEMENT
                decoded_preds, decoded_targets = BeamSearchDecoder(probs, labels, label_lengths, input_lengths, lm=lm)
            for j in range(len(decoded_preds)):
                test_cer.append(cer(decoded_targets[j], decoded_preds[j]))
                test_wer.append(wer(decoded_targets[j], decoded_preds[j]))

    avg_cer = sum(test_cer) / len(test_cer)
    avg_wer = sum(test_wer) / len(test_wer)

    print(
        'Epoch: {:d}, Test set: Average loss: {:.4f}, Average CER: {:4f} Average WER: {:.4f}\n'.format(epoch, test_loss,
                                                                                                       avg_cer,
                                                                                                       avg_wer))

In [None]:
#!L
#pragma async 
# PRAGMA ASYNC IS NECESSARY FOR TRAINING!
torch.manual_seed(7)
if torch.cuda.is_available():
    print('GPU found! 🎉')
    device = 'cuda'
else:
    print('Only CPU found! 💻')
    device = 'cpu'

verbose=False

# Hyperparameters for your model
hparams = {
    "n_cnn_layers": 3,
    "n_rnn_layers": <YOUR CODE HERE>,
    "rnn_dim": <YOUR CODE HERE>
    "n_class": 29,
    "n_feats": <YOUR CODE HERE>,
    "stride": 2,
    "dropout": 0.1,
    "learning_rate":  5e-4,
    "batch_size":  10,
    "epochs": 20
}

# Define Dataloyour training and test data loaders
kwargs = {'num_workers': 1, 'pin_memory': True} if device=='cuda' else {}
 train_loader = <YOUR_CODE>
 test_loader = <YOUR_CODE>

# Define ASR Model 
model = SpeechRecognitionModel(
    hparams['n_cnn_layers'], hparams['n_rnn_layers'], hparams['rnn_dim'],
    hparams['n_class'], hparams['n_feats'], hparams['stride'], hparams['dropout']
).to(device)

model.to(device)

if verbose:
    print(model)
print('Num Model Parameters', sum([param.nelement() for param in model.parameters()]))

#Define optimizineer, criterion, scheduler
optimizer = <YOUR CODE>
criterion = <YOUR CODE>
scheduler = <YOUR CODE> - I suggest OneCycleLR for speed.


#iter_meter = IterMeter()
start = time.time()
print("Start training...")
for epoch in range(1, hparams['epochs'] + 1):
    ep_start = time.time()
    train(model, device, train_loader, criterion, optimizer, scheduler, epoch)
    #if epoch % 2 == 0:
    save_checkpoint(model, checkpoint_name=f'model_epoch{epoch}.tar')
    load_checkpoint(model, checkpoint_name=f'model_epoch{epoch}.tar', path='./', device=device)
    test(model, device, test_loader, criterion, epoch)
    print(f"Time for epoch: {round(time.time() - ep_start, 0)} sec.")
save_checkpoint(model, checkpoint_name=f'model.tar')
duration = time.time() - start
print(f'Training took {np.round(duration / 60.0, 1)} min.')

In [None]:
#!L
# Test the model in Prefix Decode Mode - only do after you have implemented your prefix decoder

test(model, device, test_loader, criterion, epoch, decode='BeamSearch', lm=None)

lm = kenlm.Model('3-gram.pruned.1e-7.arpa')
test(model, device, test_loader, criterion, epoch, decode='BeamSearch', lm=lm)

## Compare different models: DNN, GRU/LSTM, bi-GRU/bi-LSTM (5 points)

## Analyze CTC Alignments (5 points)

## In this section you should compare alignments obtained from different models.

For example, you can show:

* Examples of alignments and their analysis
* Differencies in the properties of alignment distributions over the dataset
* Dynamic of alignments during training (from checkpoints)
* Connection between alignments and model loss


In [None]:
#!L
ADD YOUR CODE FOR CTC FORWARD BACKWARD FROM SEMINAR

In [None]:
#!L
# Test your implementation of CTC
#Load numpy matrix, add axis [classes,time]
matrix = np.loadtxt(os.path.join(week_05_path, 'test_matrix.txt'))
# Create label_sequence
tokenizer = Tokenizer()
labels_indecies = tokenizer.text_to_indecies('there se ms no good reason for believing that twillc ange')

align = soft_alignment(labels_indecies, matrix)

ref_align = np.loadtxt(os.path.join(week_05_path, 'soft_alignment.txt'))

assert np.all(ref_align == align)

In [None]:
#Example
model.eval()
_data = next(iter(test_loader))
spectrograms, labels, input_lengths, label_lengths = _data
spectrograms, labels = spectrograms.to(device), labels.to(device)

matrix = model(spectrograms).transpose(1,2)  # (batch, n_class, time)

In [None]:
# Example of alignment calculation:
with torch.no_grad():
  align = soft_alignment(labels[0].int().cpu().numpy(), F.softmax(matrix[0],dim=0).cpu().numpy())

In [None]:
plt.figure(dpi=150)
plt.imshow(align, aspect='auto', interpolation='nearest')
plt.colorbar()

plt.figure(dpi=150)
plt.imshow(np.log(align), aspect='auto', interpolation='nearest')
plt.colorbar()

### Conclusions 🧑‍🎓

* What challenges did you encounter while completing this task?
* What skills have you acquired while doing this task?
* How difficult did you find this task (on a scale from 0 to 10), and why?
* What did you like in this homework, and what didn't?