In this notebook, I'll provide an wide overview on Automatic Speech Architecture, its standard pipelines, and how to train a model for it!

### What is Automatic Speech Recognition?
Automatic Speech Recognition (ASR) is the translation of spoken speech to text by a computer. This typically only means turning audio into a sequence of characters (disregarding grammar).

It has many uses:
+ closed captioning
+ mobile phone voice assistants
+ interface for handicapped individuals
+ preserve endangered languages

However, ASR is very difficult to solve due to:
+ Environment (Is there noise? Other speakers?)
+ Style of speech (Is it casual, formal, poetic, etc?)
+ Style of speaker (Talks fast? Accents? Gender bias?)

Modern ASR models work quite well in ideal, noiseless conditions. However, in noisy conditions they are far from perfect. Below, we can see the typical accuracy for state-of-the-art ASR models in 2020!

(Note: To calculate the accuracy of ASR models, we use a 'Word Error Rate.' This determines which percentage of spoken words were incorrectly translated to text by the ASR model. )

![word error rate](https://drive.google.com/uc?export=view&id=1iUz9koIqQErBVB0a4db0gSZVdfsfwh9L)



Architecture:

![timeline](https://drive.google.com/uc?export=view&id=1n2ID2ML5x9TFSBJhfdvj5AGI4u-9Mp4m)  
[source](https://youtu.be/q67z7PTGRi8)


From the 1980s to the 2010s, ASR was predominately performed by Hidden Markov Model (HMM) architecture. HMMs are probabilistic models for linear sequence classification problems. Given text's linear nature, HMMs seem like a natural fit for ASR.
However, using HMMs for ASR come with 3 downsides:
+ HMMs are built on the Markov Assumption, which assumes that the probability of the next state only depends on the current state, and not any states prior. This doesn't make sense for language processing, which is highly contextual (see Coarticulation)
+ HMMs' state transitions probabilities are 'baked in' and thus inflexible to changes in language
+ Classic HMM ASR pipelines require hand-tuned probability distributions for accoustic models and language models from linguistic experts
+ HMM-based ASR architecture is very complex, requiring 3 models as input. The figure below demonstrates an HMM-centric ASR architecture

![an HMM-based architecture](https://drive.google.com/uc?export=view&id=1loF8wbD-6DRO45Ly7lKnVVIcCn83H-B9)  
[source](https://youtu.be/q67z7PTGRi8)

## Neural Networks
These days Neural Network architecture, like Recurrent Neural Networks (RNNs) or RNN-HMM hybrids are favored for ASR because they solve these problems. In essence, Neural Networks take an arbitrary amount of data points and tries to 'fit' a predictive model to those points. The introduction of Neural Networks to ASR immediately led to WER to be improved by ~30%. However, RNN's come at the cost of requiring a vast amount of training data.

![an RNN-only architecture](https://drive.google.com/uc?export=view&id=1_9McwMlHIqlPuJGYOkPwB-8h-t13P4pI)  
(an RNN-only model that has only 4 steps)  
[source](https://youtu.be/q67z7PTGRi8)

## Encoder-Decoder
ASR is typically performed with an Encoder-Decoder Recurrent Neural Network, which was developed as a way of solving Sequence-to-Sequence prediction problems; essentially this means taking an input sequence and returning an output sequence. These sequences can hold elements of arbitrary length- making this architecture viable (and powerful!) for problems involving inputs and outputs of different lengths. This is precisely why it's used for ASR!

Encoder-Decoder models are composed of 3 parts:
+ an Encoder (a Neural Network)
+ a Context Vector (a vector of a 'a summary of the inputs')
+ a Decoder (a Neural Network)

The Encoder compresses and summarizes input data into a Context Vector, which is later transformed by the Decoder. The Context Vector must be defined with a fixed-length (usually 256, 512, or 1024).

## Accoustic Features as Input for RNN:
A classic HMM-based ASR model would require us to input 3 probabilistic models (Accoustic, Pronunciation, and Language). An RNN only *requires* 1 (Accoustic), but a hand-crafted Language model is still commonly used for the purposes of injecting rules and grammar to speech-to-text translations. 

The Accoustic Model regards how Accoustic Features (from audio) are broken up into phonemes (the building blocks of speech) and other linguistic elements.  We'll extract this data from the [LibriSpeech ASR Dataset](https://www.openslr.org/12).  

The Language Model is a probabilistic model that concerns the stringing of words and other language elements together in a sensible way. In state-of-the-art architectures like those at Google, Language Models are integrated into deep learning models. In this notebook we won't be using this, and instead only building an Accoustic Model.

For our model to build an Accostic Model need to first extract Accoustic Features from the audio files- we'll be putting these features in a vector and feeding them into the RNN.
In order to extract the features from a given file, we need to perform the following steps:
```
1) Convert the FLAC file into a Waveform (giving us frequency data over the time domain)
2) Split the Waveform into Windows using a Windowing Function (like Hamming)
3) For each Window:
    a) Compute its Fourier Transform
    b) Compute its Mel Spectrogram (also known as Mel Filter Bank)
    c) Compute the Log of each value in the Mel Spectrogram
    d) Append the result of 3c to the Features Vector
4) Return Features Vector
```

But before we do any computations, let's define some configuration constants:

In [1]:
import numpy as np
import librosa
from glob import glob
import joblib
from tqdm import tqdm # for pretty loading bars
import os
import random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
!pip install tensorflow_io
import tensorflow_io as tfio
import string
!sudo apt install ffmpeg

Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [2]:
sample_rate = 16000  # The Sample Rate of the LibriSpeech Dataset
window_size_frames = int(sample_rate * 0.025)  # The standard window size for ASR models is 25ms
window_step_frames = int(sample_rate * 0.010) # The standard window step for ASR models is 10ms
n_filters = 40

DATASETS = [
  "dev-clean"
  # "train-clean-100"
  # "train-clean-360"  # this exceeds Colab's Disk limit
  # "train-clean-500"  # this exceeds Colab's Disk limit
]
DATASET_DIR = './data/'
FEATURES_DIR = './features'

# SPECIAL_TOKENS = ['<PAD>', '<SPC>', '<SOS>', '<EOS>']  # Tags for text formatting
SPECIAL_TOKENS = ["-", " ", "<", ">", " ", "'", '"']  # Tags for text formatting
VOCAB =  SPECIAL_TOKENS + list(string.ascii_uppercase[:26])

MAX_TEXT_LEN = 700 # >the length of the largest sentence. Used for padding

And of course, download the dataset(s):

In [3]:


def download_datasets(datasets): 
  for dataset in datasets:
    dest = DATASET_DIR + dataset + '.tar.gz'
    !mkdir -p {DATASET_DIR}
    !wget -O {dest} https://www.openslr.org/resources/12/{dataset}.tar.gz

def extract_datasets(datasets):
  for dataset in datasets:
    src = DATASET_DIR + dataset + ".tar.gz"
    dest = DATASET_DIR + dataset + '/'
    !mkdir -p {DATASET_DIR + dataset}
    !tar -xf {src} -C {dest}
    print("Finished extraction")

def convert_datasets_to_wav(datasets):
  """ Iterate through datasets and convert .flac files to .wav. Needed for 
  Tensors during Log Mel Spectrogram calculation
  """
  dataset_dirs = [DATASET_DIR + dataset + '/LibriSpeech/' + dataset for dataset in datasets]
  for dataset_dir in dataset_dirs:
    folders = glob(dataset_dir+"/**/**")
    print(folders)
    texts = []
    audio_path = []
    for path in folders:
      flac_paths = glob(path+"/*flac")
      for flac_path in flac_paths:
        target = flac_path[:-5]+".wav"
        os.system(f'ffmpeg -i {flac_path} {target}')

download_datasets(DATASETS)
extract_datasets(DATASETS)
convert_datasets_to_wav(DATASETS)


--2021-06-09 02:31:53--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘./data/dev-clean.tar.gz’


2021-06-09 02:32:08 (21.9 MB/s) - ‘./data/dev-clean.tar.gz’ saved [337926286/337926286]

Finished extraction
['./data/dev-clean/LibriSpeech/dev-clean/6241/61946', './data/dev-clean/LibriSpeech/dev-clean/6241/61943', './data/dev-clean/LibriSpeech/dev-clean/6241/66616', './data/dev-clean/LibriSpeech/dev-clean/3853/163249', './data/dev-clean/LibriSpeech/dev-clean/251/118436', './data/dev-clean/LibriSpeech/dev-clean/251/137823', './data/dev-clean/LibriSpeech/dev-clean/251/136532', './data/dev-clean/LibriSpeech/dev-clean/3752/4944', './data/dev-clean/LibriSpeech/dev-clean/3752/4943', './data/dev-clean/LibriSpeech/dev-clean/1462/170142', './data

Now let's get the Log Mel Spectrograms!

In [4]:


# def files_to_log_mel_spec(paths, sample_rate=16000, window_size_frames=400, window_step_frames=160, num_mel_bins=40):
def files_to_log_mel_spec(path):
  """ Returns 
  ( Tensor of Log Mel Spectrograms of the audio files in paths, 
    numpy array of the Lengths of the Spectrograms ) 
  """
  audio = tf.io.read_file(path)
  audio, _ = tf.audio.decode_wav(audio, 1)
  audio = tf.squeeze(audio, axis=-1)
  # Calculate Spectrogram
  stfts = tf.signal.stft(audio, frame_length=200, frame_step=80, fft_length=256)
  spectrograms = tf.abs(stfts)
  # Warp the linear scale spectrograms into the Mel Scale.
  num_spectrogram_bins = stfts.shape[-1]
  lower_edge_hertz, upper_edge_hertz = 80.0, 7600.0
  linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
    n_filters, num_spectrogram_bins, sample_rate, lower_edge_hertz,
    upper_edge_hertz)
  mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
  mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
    linear_to_mel_weight_matrix.shape[-1:]))
  # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
  lmspec = tf.math.log(mel_spectrograms + 1e-6)

  # Normalize Log Mel Spectrogram
  means = tf.math.reduce_mean(lmspec, 1, keepdims=True)
  stddevs = tf.math.reduce_std(lmspec, 1, keepdims=True)
  lmspec = (lmspec - means) / stddevs

  # Pad to make consistent (required for Neural Network)
  pad_len = 4500 # Longest in dev-clean is 3263
  paddings = tf.constant([[0, pad_len], [0, 0]])
  lmspec = tf.pad(lmspec, paddings, "CONSTANT")[:pad_len, :]
  return lmspec
  
  

In [5]:
# # Example
# audio = files_to_log_mel_spec(['./data/dev-clean/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac'], sample_rate, window_size_frames, window_step_frames, n_filters)
# audio = files_to_log_mel_spec('./data/dev-clean/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac')

Sometimes you may see MFCCs used as the ASR features instead of Mel Filter Banks. MFCCs are values that loosely represents the brain's capacity to filter out certain signals. MFCCs were popular with HMM-based models for helping model based on human-perceptible information, but newer architecture can usually use Filter Banks and get similar or better accuracy. You can read more [here](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html).

The diagram below shows the steps required for calculating the Filter Banks and MFCCs. As you can see, MFCCs require a few more calculations after the Filter Banks:

![features pipeline](https://drive.google.com/uc?export=view&id=15T0MgXcj9wA_ZQ_i4tCLukUCH-7T2FnB)

[source](https://www.youtube.com/watch?v=QTw-6GU5Mjs&t=319s)


Let's also add a helper function that'll allow us to map the spectrograms to their target text! Specifically, this function will Tokenize the target text (i.e. split the text by character) and Numericize it (i.e. map each character to an integer). We'll also insert some special boundary characters to help the NN read the string:

In [6]:
# This tokenization code was borrowed/modified from https://github.com/30stomercury/Automatic-Speech-Recognition/blob/master/utils/tokenizer.py
import string
def char2id(vocab):
    """
    Args:
        special_tokens: special charactors, #, <, >, _
    Returns:
        char2id: dict, from character to index.
        id2char: dict, from index to character.
    """
    tokens = vocab
    token_to_id = {}
    id_to_token = {}
    for i, c in enumerate(tokens):
        token_to_id[c] = i
        id_to_token[i] = c
    return token_to_id, id_to_token

_char2id, _id2char = char2id(VOCAB)  # Lookup dictionaries

def tokenize(uppercase_sentence):
  """ Returns a Tokenization of the given sentence and its length """
  tokens = [_char2id[SPECIAL_TOKENS[2]]] # Put Start-of-Sentence character
  tokens += [_char2id[char] if char != ' ' else _char2id[SPECIAL_TOKENS[4]] for char in list(uppercase_sentence)]
  tokens += [_char2id[SPECIAL_TOKENS[3]]] # Put End-of-Sentence character
  return np.array(tokens, dtype=np.int32), len(tokens)

In [7]:

def tokenize_sentences(sentences):
  """ Iterate through a list of sentences and tokenize them.
  Returns: ( an array of tokens, their lengths )"""
  tokens = []
  lengths = []
  n_sentences = len(sentences)
  print("Tokenizing "+str(n_sentences)+" sentences")
  for sentence in sentences:
    # sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    sentence_converted, length = tokenize(sentence)
    # Pad Tokenization
    to_pad = MAX_TEXT_LEN - length # max in dev_clean is 513
    sentence_converted = np.concatenate((sentence_converted, [_char2id[SPECIAL_TOKENS[0]]] * to_pad))
    tokens.append(sentence_converted)
    lengths.append(length)
  return np.array(tokens, dtype=np.int32), np.array(lengths).astype(np.int32)


In [8]:
# Show examples
print("Numericization dict: "+str(_char2id))
print("'TEST' tokenized is "+str(tokenize('TEST')[0])+" where the 1st token is < and the last token is >") 

Numericization dict: {'-': 0, ' ': 4, '<': 2, '>': 3, "'": 5, '"': 6, 'A': 7, 'B': 8, 'C': 9, 'D': 10, 'E': 11, 'F': 12, 'G': 13, 'H': 14, 'I': 15, 'J': 16, 'K': 17, 'L': 18, 'M': 19, 'N': 20, 'O': 21, 'P': 22, 'Q': 23, 'R': 24, 'S': 25, 'T': 26, 'U': 27, 'V': 28, 'W': 29, 'X': 30, 'Y': 31, 'Z': 32}
'TEST' tokenized is [ 2 26 11 25 26  3] where the 1st token is < and the last token is >


Now we can extract features and target text from the whole LibriSpeech dataset!

In [9]:

def extract_tokens_and_paths(datasets, datasets_dir, feat_dir):
  """ Iterate through LibriSpeech dataset and save features and texts to 
      binaries under feat_dir. 
  """

  def get_texts_and_audio_paths(root_dataset_path):
    """ Iterate through folders in a directory.
    Return: Target texts, paths of correspending audio recordings 
    """
    folders = glob(root_dataset_path+"/**/**")
    texts = []
    audio_path = []
    for path in folders:
      text_path = glob(path+"/*txt")[0]
      f = open(text_path)
      for line in f.readlines():
          line_ = line.split(" ")
          audio_path.append(path+"/"+line_[0]+".wav")
          texts.append(line[len(line_[0])+1:-1].replace("'",""))
    return texts, audio_path
    
  # Main extraction
  audio_paths_all = []
  for dataset_name in datasets:
    to_cat = dataset_name
    libri_path = datasets_dir + dataset_name + '/' + 'LibriSpeech' + '/' + dataset_name
    print("Extracting from "+libri_path)
    target_texts, audio_paths = get_texts_and_audio_paths(libri_path)
    audio_paths_all += audio_paths

    # Tokenize sentence
    tokens, token_lengths = tokenize_sentences(target_texts)

    # Save tokens and their lengths to files
    np.save(feat_dir+"/{}-{}s.npy".format(to_cat,'char'), tokens)
    np.save(feat_dir+"/{}-{}len.npy".format(to_cat,'char'), token_lengths)

  print("Finished extraction")
  return audio_paths_all


In [10]:

!mkdir -p {FEATURES_DIR}
audio_paths = extract_tokens_and_paths(DATASETS, DATASET_DIR, FEATURES_DIR)


Extracting from ./data/dev-clean/LibriSpeech/dev-clean
Tokenizing 2703 sentences
Finished extraction


In [11]:

def load_extracted_tokens(feature_dir, datasets):
  """ Load the data from files """
  dataset = datasets[0] # TODO
  tokens = np.load(feature_dir+"/{}-chars.npy".format(dataset), allow_pickle=True)
  token_lengths = np.load(feature_dir+"/{}-charlen.npy".format(dataset), allow_pickle=True)
  return tokens, token_lengths

# Load our saved data
tokens, token_lengths = load_extracted_tokens(FEATURES_DIR, DATASETS)

In [12]:
def create_tensorflow_datasets(features, tokens, batch_size, audio_paths):
  """ Return Tensorflow Dataset objects for features, tokens """
  features_ds = tf.data.Dataset.from_tensor_slices(audio_paths)
  features_ds = features_ds.map(
      files_to_log_mel_spec, num_parallel_calls=tf.data.experimental.AUTOTUNE
  )

  tokens_ds = tf.data.Dataset.from_tensor_slices(tokens)

  ds = tf.data.Dataset.zip((features_ds, tokens_ds))
  ds = ds.map(lambda x, y: {"source": x, "target": y})
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
  return ds

split = int(len(tokens) * 0.99)
# Training dataset
train_features = [] #features[:split]  # junk for now
train_tokens = tokens[:split]
train_audio_paths = audio_paths[:split]
train_ds = create_tensorflow_datasets(train_features, train_tokens, 64, train_audio_paths)

# Testing dataset
test_features = [] #features[split:]
test_tokens = tokens[split:]
test_audio_paths = audio_paths[split:]
test_ds = create_tensorflow_datasets(test_features, test_tokens, 4, test_audio_paths)

## Current-day Encoder-Decoder Architectures for ASR

### Attention Optimization
A weakness to the plain encoder-decoder model is that accuracy of predictions is worse with longer sequences of input. The issue lies in the inability of the model to use distant information to predict words farther in the sentence, which is caused by the Context Vector being fixed in size and unable to store long sentences. Due to how the beginning of a sentence strongly influences the rest of the sentence, an solution to this weakness was needed. In 2014 a [landmark paper](https://arxiv.org/abs/1409.0473) (Bahdanau et. al) introduced the concept of an "attention mechanism." Intuitively, it allows the model to pay *attention* to any part of the input when making predictions. Structurally we implement Attention as an vector of weights summing to 1 that determine how much attention we pay to that input in the sequence.

### Listen, Attend, and Spell
The ['Listen, Attend, and Spell'](https://arxiv.org/abs/1508.01211) (LAS) model was proposed by in 2015 for ASR architecture in order to simplify Hybrid architecture (such as DNN-HMMs or CTC-HMMs) and avoid the aforementioned independence assumptions. It was one of the first models for ASR to utilize the attention mechanism to great success.

### Transformer
Transformers are the model of choice for modern NLP problems (used in projects like GPT and BERT). Proposed in a paper named ["Attention Is All You Need"](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf), the Transformer model is possibly the greatest user of the attention mechanism- it outperforms past models in accuracy while being *faster at fitting* than Recurrent and Convolutional Neural Networks. It entirely replaces the Recurrent Layer used in the Encoder-Decoder architecture with 'multi-headed self-attention,' or locally-computed attention vectors, allowing for parallelization of computation.

# Coding the Transformer for ASR

**This implementation of the Transformer was borrowed and modified almost in its entirety from [Keras' example](https://keras.io/examples/audio/transformer_asr/). I plan to annotate and explain it entirely**


## Define the Transformer Input Layer

When processing past target tokens for the decoder, we compute the sum of position embeddings and token embeddings.

When processing audio features, we apply convolutional layers to downsample them (via convolution stides) and process local relationships.

In [13]:

class TokenEmbedding(layers.Layer):
    def __init__(self, num_vocab=1000, maxlen=100, num_hid=64):
        super().__init__()
        self.emb = tf.keras.layers.Embedding(num_vocab, num_hid)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=num_hid)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        x = self.emb(x)
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        return x + positions


class SpeechFeatureEmbedding(layers.Layer):
    def __init__(self, num_hid=64, maxlen=100):
        super().__init__()
        self.conv1 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )
        self.conv2 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )
        self.conv3 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=num_hid)

    def call(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        return self.conv3(x)


### Transformer Encoder Layer

In [14]:

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [
                layers.Dense(feed_forward_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


## Transformer Decoder Layer



In [15]:

class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, dropout_rate=0.1):
        super().__init__()
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)
        self.self_att = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.enc_att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.self_dropout = layers.Dropout(0.5)
        self.enc_dropout = layers.Dropout(0.1)
        self.ffn_dropout = layers.Dropout(0.1)
        self.ffn = keras.Sequential(
            [
                layers.Dense(feed_forward_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )

    def causal_attention_mask(self, batch_size, n_dest, n_src, dtype):
        """Masks the upper half of the dot product matrix in self attention.

        This prevents flow of information from future tokens to current token.
        1's in the lower triangle, counting from the lower right corner.
        """
        i = tf.range(n_dest)[:, None]
        j = tf.range(n_src)
        m = i >= j - n_src + n_dest
        mask = tf.cast(m, dtype)
        mask = tf.reshape(mask, [1, n_dest, n_src])
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
        )
        return tf.tile(mask, mult)

    def call(self, enc_out, target):
        input_shape = tf.shape(target)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = self.causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        target_att = self.self_att(target, target, attention_mask=causal_mask)
        target_norm = self.layernorm1(target + self.self_dropout(target_att))
        enc_out = self.enc_att(target_norm, enc_out)
        enc_out_norm = self.layernorm2(self.enc_dropout(enc_out) + target_norm)
        ffn_out = self.ffn(enc_out_norm)
        ffn_out_norm = self.layernorm3(enc_out_norm + self.ffn_dropout(ffn_out))
        return ffn_out_norm


## Complete the Transformer model

Our model takes audio spectrograms as inputs and predicts a sequence of characters. During training, we give the decoder the target character sequence shifted to the left as input. During inference, the decoder uses its own past predictions to predict the next token.

In [16]:

class Transformer(keras.Model):
    def __init__(
        self,
        num_hid=64,
        num_head=2,
        num_feed_forward=128,
        source_maxlen=100,
        target_maxlen=100,
        num_layers_enc=4,
        num_layers_dec=1,
        num_classes=10,
    ):
        super().__init__()
        self.loss_metric = keras.metrics.Mean(name="loss")
        self.num_layers_enc = num_layers_enc
        self.num_layers_dec = num_layers_dec
        self.target_maxlen = target_maxlen
        self.num_classes = num_classes

        self.enc_input = SpeechFeatureEmbedding(num_hid=num_hid, maxlen=source_maxlen)
        self.dec_input = TokenEmbedding(
            num_vocab=num_classes, maxlen=target_maxlen, num_hid=num_hid
        )

        self.encoder = keras.Sequential(
            [self.enc_input]
            + [
                TransformerEncoder(num_hid, num_head, num_feed_forward)
                for _ in range(num_layers_enc)
            ]
        )

        for i in range(num_layers_dec):
            setattr(
                self,
                f"dec_layer_{i}",
                TransformerDecoder(num_hid, num_head, num_feed_forward),
            )

        self.classifier = layers.Dense(num_classes)

    def decode(self, enc_out, target):
        y = self.dec_input(target)
        for i in range(self.num_layers_dec):
            y = getattr(self, f"dec_layer_{i}")(enc_out, y)
        return y

    def call(self, inputs):
        source = inputs[0]
        target = inputs[1]
        x = self.encoder(source)
        y = self.decode(x, target)
        return self.classifier(y)

    @property
    def metrics(self):
        return [self.loss_metric]

    def train_step(self, batch):
        """Processes one batch inside model.fit()."""
        source = batch["source"]
        target = batch["target"]
        dec_input = target[:, :-1]
        dec_target = target[:, 1:]
        with tf.GradientTape() as tape:
            preds = self([source, dec_input])
            one_hot = tf.one_hot(dec_target, depth=self.num_classes)
            mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
            loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        self.loss_metric.update_state(loss)
        return {"loss": self.loss_metric.result()}

    def test_step(self, batch):
        source = batch["source"]
        target = batch["target"]
        dec_input = target[:, :-1]
        dec_target = target[:, 1:]
        preds = self([source, dec_input])
        one_hot = tf.one_hot(dec_target, depth=self.num_classes)
        mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
        loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
        self.loss_metric.update_state(loss)
        return {"loss": self.loss_metric.result()}

    def generate(self, source, target_start_token_idx):
        """Performs inference over one batch of inputs using greedy decoding."""
        bs = tf.shape(source)[0]
        enc = self.encoder(source)
        dec_input = tf.ones((bs, 1), dtype=tf.int32) * target_start_token_idx
        dec_logits = []
        for i in range(self.target_maxlen - 1):
            dec_out = self.decode(enc, dec_input)
            logits = self.classifier(dec_out)
            logits = tf.argmax(logits, axis=-1, output_type=tf.int32)
            last_logit = tf.expand_dims(logits[:, -1], axis=-1)
            dec_logits.append(last_logit)
            dec_input = tf.concat([dec_input, last_logit], axis=-1)
        return dec_input


## Callbacks to display predictions

In [17]:

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [18]:
MODELS_DIR = '/ASR_Trained_Models/'  # Where our model will be saved in Google Drive
MODEL_NAME = 'our-transformer'

dest_folder = "/content/gdrive/MyDrive" + MODELS_DIR

In [19]:
class DisplayOutputs(keras.callbacks.Callback):
    def __init__(
        self, batch, idx_to_token, target_start_token_idx=1, target_end_token_idx=2
    ):
        """Displays a batch of outputs after every epoch

        Args:
            batch: A test batch containing the keys "source" and "target"
            idx_to_token: A List containing the vocabulary tokens corresponding to their indices
            target_start_token_idx: A start token index in the target vocabulary
            target_end_token_idx: An end token index in the target vocabulary
        """
        self.batch = batch
        self.target_start_token_idx = target_start_token_idx
        self.target_end_token_idx = target_end_token_idx
        self.idx_to_char = idx_to_token

    def on_epoch_end(self, epoch, logs=None):
        # if epoch % 5 != 0:
        #     return
        source = self.batch["source"]
        target = self.batch["target"].numpy()
        bs = tf.shape(source)[0]
        preds = self.model.generate(source, self.target_start_token_idx)
        preds = preds.numpy()
        print(f"epoch:      {epoch}")
        for i in range(bs):
            target_text = "".join([self.idx_to_char[_] for _ in target[i, :]])
            prediction = ""
            for idx in preds[i, :]:
                prediction += self.idx_to_char[idx]
                if idx == self.target_end_token_idx:
                    break
            print(f"target:     {target_text.replace('-','')}")
            print(f"prediction: {prediction}\n")
        
        # Save Model!
        dest = "/content/gdrive/MyDrive" + MODELS_DIR + MODEL_NAME + f"-{epoch}"
        model.save_weights(dest, overwrite=True)


## Learning rate schedule

In [20]:

class CustomSchedule(keras.optimizers.schedules.LearningRateSchedule):
    def __init__(
        self,
        init_lr=0.00001,
        lr_after_warmup=0.001,
        final_lr=0.00001,
        warmup_epochs=15,
        decay_epochs=85,
        steps_per_epoch=203,
    ):
        super().__init__()
        self.init_lr = init_lr
        self.lr_after_warmup = lr_after_warmup
        self.final_lr = final_lr
        self.warmup_epochs = warmup_epochs
        self.decay_epochs = decay_epochs
        self.steps_per_epoch = steps_per_epoch

    def calculate_lr(self, epoch):
        """ linear warm up - linear decay """
        warmup_lr = (
            self.init_lr
            + ((self.lr_after_warmup - self.init_lr) / (self.warmup_epochs - 1)) * epoch
        )
        decay_lr = tf.math.maximum(
            self.final_lr,
            self.lr_after_warmup
            - (epoch - self.warmup_epochs)
            * (self.lr_after_warmup - self.final_lr)
            / (self.decay_epochs),
        )
        return tf.math.minimum(warmup_lr, decay_lr)

    def __call__(self, step):
        epoch = step // self.steps_per_epoch
        return self.calculate_lr(epoch)


## Create & train the end-to-end model

In [21]:
EPOCHS = 20

batch = next(iter(test_ds))

# The vocabulary to convert predicted indices into characters
display_cb = DisplayOutputs(
    batch, VOCAB, target_start_token_idx=2, target_end_token_idx=3
)  # set the arguments as per vocabulary index for '<' and '>'

model = Transformer(
    num_hid=200,
    num_head=2,
    num_feed_forward=400,
    target_maxlen=MAX_TEXT_LEN,
    num_layers_enc=4,
    num_layers_dec=1,
    num_classes=len(VOCAB),
)
loss_fn = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True, label_smoothing=0.1,
)

In [22]:

learning_rate = CustomSchedule(
    init_lr=0.00001,
    lr_after_warmup=0.001,
    final_lr=0.00001,
    warmup_epochs=15,
    decay_epochs=85,
    steps_per_epoch=len(train_ds),
)
optimizer = keras.optimizers.Adam(learning_rate)
model.compile(optimizer=optimizer, loss=loss_fn)

history = model.fit(train_ds, validation_data=test_ds, callbacks=[display_cb], epochs=EPOCHS)


Epoch 1/20

KeyboardInterrupt: ignored

In [None]:
VOCAB

# References / Resources used

### Overviews  
[Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)  
[End-to-End Models](https://youtu.be/q67z7PTGRi8)   
[Timeline of Architecture](https://www.youtube.com/watch?v=3MjIkWxXigM&)  
[Machine Translation Vis](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)  
[Audio Features](https://youtu.be/PPmNYwVbcts)   
[Attention Mechanism Intuition](https://youtu.be/SysgYptB198)  
[Comparison of Audio Features for ASR](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)  
[HMMs vs NNs](https://stats.stackexchange.com/questions/282987/hidden-markov-model-vs-recurrent-neural-network)

### Models
[Attention mechanism: Neural Machine Translation by Jointly Learning to Align and Translate ](https://arxiv.org/pdf/1409.0473.pdf)  
[Listen, Attend, and Spell](https://arxiv.org/abs/1508.01211)  
[Transformers: Attention Is All You Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)   

### Code  
[Listen, Attend, and Spell Implementation by 30stomercury](https://github.com/30stomercury/Automatic-Speech-Recognition)  
[ASR Transformer by Apoorv Nandan](https://keras.io/examples/audio/transformer_asr/)  

### More Readings
https://towardsdatascience.com/recognizing-speech-commands-using-recurrent-neural-networks-with-attention-c2b2ba17c837  
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html  

References:


https://stats.stackexchange.com/questions/282987/hidden-markov-model-vs-recurrent-neural-network

Jurafsky textbook