<a href="https://colab.research.google.com/github/deepikadhiman5517/speechRecognition/blob/main/speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Frame Level Speech Recognition with Neural Networks

In this coursework you will take your knowledge of feedforward neural networks and apply it to the task of speech recognition.

You are provided a dataset of audio recordings (utterances) and their phoneme state (subphoneme) labels. The data comes from articles published in the Wall Street Journal (WSJ) that are read aloud and labelled using the original text. If you have not encountered speech data before or have not heard of phonemes or spectrograms, we will clarify these here:

Phonemes and Phoneme States
As letters are the atomic elements of written language, phonemes are the atomic elements of speech. It is crucial for us to have a means to distiguish different sounds in speech that may or may not represent the same letter or combinations of letters in the written alphabet. For example, the words "jet" and "ridge" both contain the same sound and we refer to this elemental sound as the phoneme "JH". For this challenge we will consider 46 phonemes in the english language.

["+BREATH+", "+COUGH+", "+NOISE+", "+SMACK+", "+UH+", "+UM+", "AA", "AE", "AH", "AO", "AW", "AY", "B", "CH", "D", "DH", "EH", "ER", "EY", "F", "G", "HH", "IH", "IY", "JH", "K", "L", "M", "N", "NG", "OW", "OY", "P", "R", "S", "SH", "SIL", "T", "TH", "UH", "UW", "V", "W", "Y", "Z", "ZH"]

A powerful technique in speech recognition is to model speech as a markov process with unobserved states. This model considers observed speech to be dependent on unobserved state transitions. We refer to these unobserved states as phoneme states or subphonemes. For each phoneme, there are 3 respective phoneme states. Therefore for our 46 phonemes, there exist 138 respective phoneme states. The transition graph of the phoneme states for a given phoneme is as follows:


Hidden Markov Models (HMMs) estimate the parameters of this unobserved markov process (transition and emission probabilities) that maximize the likelihood of the observed speech data.

Your task is to instead take a model-free approach and classify mel spectrogram frames using a neural network that takes a frame (plus optional context) and outputs class probabilities for all 138 phoneme states. Performance on the task will be measured by classification accuracy on a held-out set of labelled mel spectrogram frames. Training/dev labels are provided as integers [0-137].

Representing speech
As a first step, the speech must be converted into a feature representation that can be fed into the network.

In our representation, utterances have been converted to "mel spectrograms", which are pictorial representations that characterize how the frequency content of the signal varies with time. The frequency-domain of the audio signal provides more useful features for distinguishing phonemes.

For a more intuitive understanding, consider attempting to determine which instruments are playing in an orchestra given an audio recording of a performance. By looking only at the amplitude of the signal of the orchestra over time, it is nearly impossible to distinguish one source from another. But if the signal is transformed into the frequency domain, we can use our knowledge that flutes produce higher frequency sounds and bassoons produce lower frequency sounds. In speech, a similar phenomenon is observed when the vocal tract produces sounds at varying frequencies.

To convert the speech to a mel spectrogram, it is segmented into little "frames", each 25ms wide, where the "stride" between adjacent frames is 10ms. Thus we get 100 such frames per second of speech.

From each frame, we compute a single "mel spectral" vector, where the components of the vector represent the (log) energy in the signal in different frequency bands. In the data we have given you, we have 40-dimensional mel-spectral vectors, i.e. we have computed energies in 40 frequency bands.

Thus, we get 100 40-dimensional mel spectral (row) vectors per second of speech in the recording. Each one of these vectors is referred to as a frame. The details of how mel spectrograms are computed from speech is explained in the attached blog.

Thus, for a T-second recording, the entire spectrogram is a 100T x 40 matrix, comprising 100T 40- dimensional vectors (at 100 vectors (frames) per second).

The training data comprise:

Speech recordings (raw mel spectrogram frames)
Frame-level phoneme state labels
The test data comprise:

Speech recordings (raw mel spectrogram frames)
Phoneme state labels are not given
Your job is to identify the phoneme state label for each frame in the test data set. It is important to note that utterances are of variable length. We are providing you code to load and parse the raw files into the expected format. For now we are only providing dev data files as the training file is very large.

Feature Files
[train|dev|test].npy contain a numpy object array of shape [utterances]. Each utterance is a float32 ndarray of shape [time, frequency], where time is the length of the utterance. Frequency dimension is always 40 but time dimension is of variable length.

Label Files
[train|dev]_labels.npy contain a numpy object array of shape [utterances]. Each element in the array is an int32 array of shape [time] and provides the phoneme state label for each frame. There are 138 distinct labels [0-137], one for each subphoneme.




In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch.optim as optim
from torch.optim.lr_scheduler import MultiStepLR

import logging
import argparse
import os
import pandas as pd
import datetime


In [None]:
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
current_time

In [None]:
parser = argparse.ArgumentParser(description='speech_recognition')
parser.add_argument('--lr', default=0.001, type=float, help='learning rate')
parser.add_argument('--batch_size', default=512, type=int, help='batch size')
parser.add_argument('--context_size', default=12, type=int, help='context size')
parser.add_argument('--input_size', default=1000, type=int, help='input size')
parser.add_argument('--output_size', default=138, type=int, help='output size')
parser.add_argument('--num_epochs', default=18, type=int, help='epoch number')
parser.add_argument('--decay_steps', default='7, 12', type=str,
                    help='The step where learning rate decay by 0.1')
parser.add_argument('--save_step', default=5, type=int, help='step for saving model')
parser.add_argument('--eval_step', default=1, type=int, help='step for validation')
parser.add_argument('--train_data_path', default='/content/drive/MyDrive/dev.npy')
parser.add_argument('--train_label_path', default='/content/drive/MyDrive/dev_labels.npy')
parser.add_argument('--val_data_path', default='/content/drive/MyDrive/dev.npy', type=str)
parser.add_argument('--val_label_path', default='/content/drive/MyDrive/dev_labels.npy', type=str)
parser.add_argument('--test_data_path', default='../data/test.npy', type=str)
parser.add_argument('--checkpoint_dir', default='../checkpoints/', help='checkpoint folder root')
parser.add_argument('--result_file_name', default='hw1p2_test_result.csv', type=str, help='testing result save path')

parser.add_argument("-f", "--fff", help="a dummy argument to fool ipython", default="1")

In [None]:
args = parser.parse_args()
args.expr_dir = os.path.join(args.checkpoint_dir, current_time)
os.makedirs(args.expr_dir)


In [None]:
# Create the log
log_path = os.path.join(args.expr_dir, 'speech_recognition_{}.log'.format(current_time))
logging.basicConfig(filename=log_path, level=logging.INFO)

In [None]:
# Modify the result save path
args.result_file_name = os.path.join(args.expr_dir, args.result_file_name)

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

In [None]:
def save_log(message):
    print(message)
    logging.info(message)

In [None]:
class load_dataset(Dataset):
    def __init__(self, data_path, label_path=None):
    	# Both data and label has the same time length for one utterrance
    	# Data shape: (utterance, seq_len, 40), Label shape: (utterance, seq_len)
        self.data = np.load(data_path, encoding='bytes', allow_pickle=True)
        if label_path:
            self.label = np.load(label_path,encoding='bytes',allow_pickle=True)
        else:
            self.label = None

        self.idx_map = []
        for i, xs in enumerate(self.data):
            for j in range(xs.shape[0]):
                self.idx_map.append((i, j))

    def __getitem__(self, index):
        i, j = self.idx_map[index]
        # Select the context_size before and after the current frame
        x = self.data[i].take(range(j - args.context_size, j + args.context_size + 1), mode='clip', axis=0).flatten()
        # Normalize
        # x = (x - x.mean()) / x.std()
        # Select the phoneme state label for the current frame
        y = np.int32(self.label[i][j]).reshape(1) if self.label is not None else np.int32(-1).reshape(1)
        return torch.from_numpy(x).float(), torch.LongTensor(y)

    def __len__(self):
        return len(self.idx_map)

In [None]:
###
# * Layers -> [input_size, 2048, 2048, 1024, 1024, output_size]
# * ReLU activations
# * Context size k = 12 frames on both sides
# * Adam optimizer, with the default learning rate 1e-3
# * Zero padding of k frames on both sides of each utterance
###
class MLP(nn.Module):
    def __init__(self, input_size, output_size):
        super(MLP, self).__init__()
        self.net = nn.Sequential(nn.Linear(input_size, 2048),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(2048),
                                 nn.Linear(2048, 2048),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(2048),
                                 nn.Linear(2048, 2048),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(2048),
                                 nn.Linear(2048, 1024),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(1024),
                                 nn.Linear(1024, 1024),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(1024),
                                 nn.Linear(1024, 1024),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(1024),
                                 nn.Linear(1024, 1024),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(1024),
                                 nn.Linear(1024, 512),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(512),
                                 nn.Linear(512, 512),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(512),
                                 nn.Linear(512, 512),
                                 nn.ReLU(inplace=True),
                                 nn.BatchNorm1d(512),
                                 nn.Linear(512, output_size)
                                 )

    def forward(self, x):
        return self.net(x)


In [None]:
def train(net, loader, optimizer, criterion, epoch):
    net.train()

    running_batch = 0
    running_loss = 0.0
    running_corrects = 0

    # Iterate over images.
    for i, (data, label) in enumerate(loader):
        data = data.to(device)
        label = label.to(device)
        output = net(data)
        _, label_pred = torch.max(output, 1)
        loss = criterion(output, label.view(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_batch += label.size(0)
        running_loss += loss.item()
        running_corrects += torch.sum(label_pred == label.view(-1)).item()

        if (i + 1) % 20 == 0:  # print every 5 mini-batches
            message = '[%d, %5d] loss: %.3f accuracy: %.3f' % (
            epoch, i + 1, running_loss / running_batch, running_corrects / running_batch)
            save_log(message)


In [None]:
def validate(net, loader, criterion, epoch):
    net.eval()

    running_batch = 0
    running_loss = 0.0
    running_corrects = 0

    with torch.no_grad():
        message = '*' * 40
        save_log(message)
        for i, (data, label) in enumerate(loader):
            data = data.to(device)
            label = label.to(device)
            output = net(data)

            # label_pred = torch.nn.functional.softmax(output, dim=1)
            _, label_pred = torch.max(output, 1)

            loss = criterion(output, label.view(-1))
            running_batch += label.size(0)
            running_loss += loss.item()
            running_corrects += torch.sum(label_pred == label.view(-1)).item()

        running_loss /= running_batch
        acc = running_corrects / running_batch
        message = 'Epoch: %d, testing Loss %.3f, testing accuracy: %.3f' % (epoch, running_loss, acc)
        save_log(message)
        message = '*' * 40
        save_log(message)

    return acc

In [None]:
def test(net, loader):
    net.eval()
    label = []
    running_batch = 0
    with torch.no_grad():
        for i, (data, _) in enumerate(loader):
            data = data.to(device)
            output = net(data)
            _, label_pred = torch.max(output, 1)
            label.extend(label_pred.cpu().numpy())
            running_batch += data.size(0)
    return running_batch, label


In [None]:
def save_networks(net, which_epoch):
    save_filename = '%s_net.pth' % (which_epoch)
    save_path = os.path.join(args.expr_dir, save_filename)
    if torch.cuda.is_available():
        try:
            torch.save(net.module.cpu().state_dict(), save_path)
        except:
            torch.save(net.cpu().state_dict(), save_path)
    else:
        torch.save(net.cpu().state_dict(), save_path)


def weights_init(m, type='kaiming'):
    classname = m.__class__.__name__
    if classname.find('Linear') != -1 or classname.find('Conv2d') != -1:
        if type == 'xavier':
            nn.init.xavier_normal_(m.weight)
        elif type == 'kaiming':
            nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        elif type == 'orthogonal':
            nn.init.orthogonal_(m.weight)
        elif type == 'gaussian':
            m.weight.data.normal_(0, 0.01)
        if m.bias is not None:
            m.bias.data.zero_()
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.BatchNorm1d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)


In [None]:
if __name__ == '__main__':

    net = MLP(input_size=args.input_size, output_size=args.output_size)
    net.apply(weights_init)
    criterion = nn.CrossEntropyLoss()
    criterion.to(device)
    optimizer = optim.Adam(net.parameters(), lr=args.lr)

    str_steps = args.decay_steps.split(',')
    args.decay_steps = []
    for str_step in str_steps:
        str_step = int(str_step)
        args.decay_steps.append(str_step)
    scheduler = MultiStepLR(optimizer, milestones=args.decay_steps, gamma=0.1)

    save_log('Logging data')
    train_data = load_dataset(args.train_data_path, args.train_label_path)
    train_loader = DataLoader(dataset=train_data, num_workers=4, batch_size=args.batch_size, pin_memory=True,
                              shuffle=True)
    val_data = load_dataset(args.val_data_path, args.val_label_path)
    val_loader = DataLoader(dataset=val_data, num_workers=4, batch_size=args.batch_size, pin_memory=True,
                            shuffle=False)
    save_log('Data is loaded')
    cur_acc = 0
    for epoch in range(1, args.num_epochs + 1):
        net.to(device)

        scheduler.step()
        lr = optimizer.param_groups[0]['lr']
        message = '{}: {}/{} , {}: {:.4f}'.format('epoch', epoch, args.num_epochs, 'lr', lr)
        save_log(message)
        save_log('-' * 10)

        train(net, train_loader, optimizer, criterion, epoch)
        if epoch % args.eval_step == 0:
            val_acc = validate(net, val_loader, criterion, epoch)

            if val_acc > cur_acc:
                save_networks(net, epoch)
                cur_acc = val_acc

        # if epoch % args.save_step == 0:
        #     save_networks(epoch)
    save_networks(net, epoch)

    # ------------------------
    # Start Testing
    # ------------------------
    save_log('Loading test data')
    test_data = load_dataset(args.test_data_path)
    test_loader = DataLoader(dataset=test_data, num_workers=4, batch_size=args.batch_size, pin_memory=True, shuffle=False)
    save_log('Test data is loaded')
    net.to(device)
    test_num, test_label = test(net, test_loader)
    d = {'id': list(range(test_num)), 'label': test_label}
    df = pd.DataFrame(data=d)
    df.to_csv(args.file_name, header=True, index=False)
    save_log('Testing is done, result is saved to {}'.format(args.result_file_name))


Logging data
Data is loaded
epoch: 1/18 , lr: 0.0010
----------




[1,    20] loss: 0.011 accuracy: 0.094
[1,    40] loss: 0.009 accuracy: 0.127
[1,    60] loss: 0.008 accuracy: 0.151
[1,    80] loss: 0.008 accuracy: 0.172
[1,   100] loss: 0.007 accuracy: 0.187
[1,   120] loss: 0.007 accuracy: 0.201
[1,   140] loss: 0.007 accuracy: 0.214
[1,   160] loss: 0.007 accuracy: 0.224
[1,   180] loss: 0.007 accuracy: 0.234
[1,   200] loss: 0.006 accuracy: 0.244
[1,   220] loss: 0.006 accuracy: 0.252
[1,   240] loss: 0.006 accuracy: 0.260
[1,   260] loss: 0.006 accuracy: 0.268
[1,   280] loss: 0.006 accuracy: 0.274
[1,   300] loss: 0.006 accuracy: 0.281
[1,   320] loss: 0.006 accuracy: 0.288
[1,   340] loss: 0.006 accuracy: 0.294
[1,   360] loss: 0.006 accuracy: 0.300
[1,   380] loss: 0.006 accuracy: 0.306
[1,   400] loss: 0.006 accuracy: 0.311
[1,   420] loss: 0.005 accuracy: 0.315
[1,   440] loss: 0.005 accuracy: 0.320
[1,   460] loss: 0.005 accuracy: 0.324
[1,   480] loss: 0.005 accuracy: 0.329
[1,   500] loss: 0.005 accuracy: 0.333
[1,   520] loss: 0.005 ac