In this notebook, I'll provide an wide overview on Automatic Speech Architecture, its standard pipelines, and how to train a model for it!

### What is Automatic Speech Recognition?
Automatic Speech Recognition (ASR) is the translation of spoken speech to text by a computer.

It has many uses:
+ closed captioning
+ mobile phone voice assistants
+ interface for handicapped individuals
+ preserve endangered languages

However, ASR is very difficult to solve due to:
+ Environment (Is there noise? Other speakers?)
+ Style of speech (Is it casual, formal, poetic, etc?)
+ Style of speaker (Talks fast? Accents? Gender bias?)

https://www.youtube.com/watch?v=q67z7PTGRi8

Modern ASR models work quite well in ideal, noiseless conditions. However, in noisy conditions they are far from perfect. Below, we can see the typical accuracy for state-of-the-art ASR models in 2020!

(Note: To calculate the accuracy of ASR models, we use a 'Word Error Rate.' This determines which percentage of spoken words were incorrectly translated to text by the ASR model. )

![word error rate](https://drive.google.com/uc?export=view&id=1iUz9koIqQErBVB0a4db0gSZVdfsfwh9L)



Architecture:

![timeline](https://drive.google.com/uc?export=view&id=1n2ID2ML5x9TFSBJhfdvj5AGI4u-9Mp4m)


From the 1980s to the 2010s, ASR was predominately performed by Hidden Markov Model (HMM) architecture. HMMs are probabilistic models for linear sequence classification problems. Given text's linear nature, HMMs seem like a natural fit for ASR.
However, using HMMs for ASR come with 3 downsides:
+ HMMs are built on the Markov Assumption, which assumes that the probability of the next state only depends on the current state, and not any states prior. This doesn't make sense for language processing, which is highly contextual (see Coarticulation)
+ HMMs' state transitions probabilities are 'baked in' and thus inflexible to changes in language
+ Classic HMM ASR pipelines require hand-tuned probability distributions for accoustic models and language models from linguistic experts
+ HMM-based ASR architecture is very complex, requiring 3 models as input. The figure below demonstrates an HMM-centric ASR architecture

![an HMM-based architecture](https://drive.google.com/uc?export=view&id=1loF8wbD-6DRO45Ly7lKnVVIcCn83H-B9)

Nowadays Recurrent Neural Networks (RNNs) or RNN-HMM hybrids are favored for ASR because they solve these problems. However, RNN's come at the cost of requiring a vast amount of training data.

![an RNN-only architecture](https://drive.google.com/uc?export=view&id=1_9McwMlHIqlPuJGYOkPwB-8h-t13P4pI)
(an RNN-only model that has only 4 steps)

https://stats.stackexchange.com/questions/282987/hidden-markov-model-vs-recurrent-neural-network
Jurafsky textbook

For the our ASR model, we'll be using an 'Listen, Attend, and Spell' (LAS) RNN. The LAS was proposed by in 2016 for ASR architecture in order to simplify Hybrid architecture (such as DNN-HMMs or CTC-HMMs) and avoid the aforementioned independence assumptions. Since then it has been the forefront of ASR architecture.

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44926.pdf

The objective of LAS is to transform an arbitrary-length input sequence to an arbitrary-length output sequence. LAS is an encoder-decoder model- a class of RNN's that is well-suited for problems where the length of the inputs greatly differs from the length of the outputs (like ASR).

Encoder-Decoder models are composed of 3 parts:
+ an Encoder
+ a Context Vector
+ a Decoder

The Encoder compresses and summarizes input data into a Context Vector, which is later transformed by the Decoder. The Context Vector must be defined with a fixed-length (usually 256, 512, or 1024).

Accoustic Information as Input for RNN:
A classic HMM-based ASR model would require us to input 3 probabilistic models (accoustic, pronunciation, and language), whereas an RNN only requires 1 (accoustic). The accoustic model captures important characteristics about how we perceive speech (see MFCC). We'll extract this data from the LibriSpeech ASR dataset.

https://www.openslr.org/12

We need to first extract features from the audio files- we'll be putting these features in a vector and feeding them into the RNN.
In order to extract the features from a given file, we need to perform the following steps:
```
1) Convert the FLAC file into a Waveform (giving us frequency data over the time domain)
2) Split the Waveform into Windows using a Windowing Function (like Hamming)
3) For each Window:
    a) Compute its Fourier Transform
    b) Compute its Mel Spectrogram (also known as Mel Filter Bank)
    c) Compute the Log of each value in the Mel Spectrogram
    d) Append the result of 3c to the Features Vector
4) Return Features Vector
```

But before we do any computations, let's define some configuration constants:

In [1]:
sample_rate = 16000  # The Sample Rate of the LibriSpeech Dataset
window_size_frames = int(sample_rate * 0.025)  # The standard window size for ASR models is 25ms
window_step_frames = int(sample_rate * 0.010) # The standard window step for ASR models is 10ms
n_filters = 40

And of course, download the dataset(s):

In [2]:
DATASET_DIR = './data/'

DATASETS = [
  "dev-clean"
  # "train-clean-100"
  # "train-clean-360"  # this exceeds Colab's Disk limit
  # "train-clean-500"  # this exceeds Colab's Disk limit
]

def download_datasets(datasets): 
  for dataset in datasets:
    dest = DATASET_DIR + dataset + '.tar.gz'
    !mkdir -p {DATASET_DIR}
    !wget -O {dest} https://www.openslr.org/resources/12/{dataset}.tar.gz

def extract_datasets(datasets):
  for dataset in datasets:
    src = DATASET_DIR + dataset + ".tar.gz"
    dest = DATASET_DIR + dataset + '/'
    !mkdir -p {DATASET_DIR + dataset}
    !tar -xf {src} -C {dest}
    print("Finished extraction")

download_datasets(DATASETS)
extract_datasets(DATASETS)

--2021-06-07 02:13:39--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘./data/dev-clean.tar.gz’


2021-06-07 02:13:53 (24.5 MB/s) - ‘./data/dev-clean.tar.gz’ saved [337926286/337926286]

Finished extraction


Now let's get the Log Mel Filter Bank energies! We'll be using the speechpy library for simplicity.

In [3]:
import numpy as np
import soundfile as sf
import tensorflow as tf
import librosa
from tqdm import tqdm # for pretty loading bars

In [17]:

def files_to_log_mel_spec(paths, sample_rate, window_size_frames, window_step_frames, num_mel_bins):
  """ Returns 
  ( Tensor of Log Mel Spectrograms of the audio files in paths, 
    numpy array of the Lengths of the Spectrograms ) 
  """
  n_files = len(paths)
  print(f"Converting {n_files} files to Log Mel Spectrograms")
  lmspecs = [] 
  lengths = []
  # longest = 0
  for i, path in enumerate(paths):
    if i % 250 == 0: print(f"{i}/{n_files} converted to Log Mel Spectrograms")
    audio = tf.io.read_file(path)
    audio, _ = librosa.load(path, sample_rate)
    audio = tf.convert_to_tensor(audio)
    # Get Spectrogram using Short-time Fourier Transform
    stfts = tf.signal.stft(audio, frame_length=window_size_frames, frame_step=window_step_frames, fft_length=1024)
    spectrograms = tf.abs(stfts)
    # Warp the linear scale spectrograms into the Mel Scale.
    num_spectrogram_bins = stfts.shape[-1]
    lower_edge_hertz, upper_edge_hertz = 80.0, 7600.0
    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
      num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
      upper_edge_hertz)
    mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
      linear_to_mel_weight_matrix.shape[-1:]))
    # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
    lmspec = tf.math.log(mel_spectrograms + 1e-6)

    # Normalize Log Mel Spectrogram
    means = tf.math.reduce_mean(lmspec, 1, keepdims=True)
    stddevs = tf.math.reduce_std(lmspec, 1, keepdims=True)
    lmspec = (lmspec - means) / stddevs

    # longest = longest if longest > len(lmspec) else len(lmspec)
    # Pad to make consistent
    pad_len = 4500 # Longest in dev-clean is 3263
    paddings = tf.constant([[0, pad_len], [0, 0]])
    lmspec = tf.pad(lmspec, paddings, "CONSTANT")[:pad_len, :]

    lmspecs.append(lmspec)
    lengths.append(len(lmspec))

  print(f"{n_files}/{n_files} converted to Log Mel Spectrograms")

  # print("Longest is "+str(longest))
  return lmspecs, np.array(lengths).astype(np.int32)


In [5]:
# Example
files_to_log_mel_spec(['./data/dev-clean/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac'], sample_rate, window_size_frames, window_step_frames, n_filters)

Converting 1 files to Log Mel Spectrograms
0/1 converted to Log Mel Spectrograms


([<tf.Tensor: shape=(4500, 40), dtype=float32, numpy=
  array([[ 0.20061943, -1.0287393 , -1.509152  , ..., -1.2667471 ,
          -0.5122534 , -0.39129886],
         [-0.4862597 , -1.1197394 , -0.5595873 , ..., -0.9269718 ,
          -0.5736864 , -0.34323072],
         [ 0.6221455 , -0.00927406, -0.01342059, ..., -0.83187187,
          -0.71671116, -0.4780636 ],
         ...,
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  0.        ]], dtype=float32)>],
 array([4500], dtype=int32))

Sometimes you may see MFCCs used as the ASR features instead of Mel Filter Banks. MFCCs are values that loosely represents the brain's capacity to filter out certain signals. MFCCs were popular with HMM-based models for helping model based on human-perceptible information, but newer architecture can usually use Filter Banks and get similar or better accuracy. You can read more [here](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html).

The diagram below shows the steps required for calculating the Filter Banks and MFCCs. As you can see, MFCCs require a few more calculations after the Filter Banks:

![features pipeline](https://drive.google.com/uc?export=view&id=15T0MgXcj9wA_ZQ_i4tCLukUCH-7T2FnB)

https://www.youtube.com/watch?v=QTw-6GU5Mjs&t=319s


Let's also add a helper function that'll allow us to map the spectrograms to their target text! Specifically, this function will Tokenize the target text (i.e. split the text by character) and Numericize it (i.e. map each character to an integer). We'll also insert some special boundary characters to help the NN read the string:

In [6]:
# This tokenization code was borrowed/modified from https://github.com/30stomercury/Automatic-Speech-Recognition/blob/master/utils/tokenizer.py
import string
def char2id(special_tokens):
    """
    Args:
        special_tokens: special charactors, <PAD>, <SOS>, <EOS>, <SPACE>
    Returns:
        char2id: dict, from character to index.
        id2char: dict, from index to character.
    """
    alphas = list(string.ascii_uppercase[:26])
    tokens = special_tokens + alphas
    token_to_id = {}
    id_to_token = {}
    for i, c in enumerate(tokens):
        token_to_id[c] = i
        id_to_token[i] = c
    return token_to_id, id_to_token

SPECIAL_TOKENS = ['<PAD>', '<SOS>', '<EOS>', '<SPACE>']
_char2id, _id2char = char2id(SPECIAL_TOKENS)  # Lookup dictionaries

def tokenize(uppercase_sentence):
  """ Returns a Tokenization of the given sentence and its length """
  tokens = [_char2id['<SOS>']]
  tokens += [_char2id[char] if char != ' ' else _char2id['<SPACE>'] for char in list(uppercase_sentence)]
  tokens += [_char2id['<EOS>']]
  return tokens, len(tokens)

In [7]:

def tokenize_sentences(sentences):
  """ Iterate through a list of sentences and tokenize them.
  Returns: ( an array of tokens, their lengths )"""
  tokens = []
  lengths = []
  n_sentences = len(sentences)
  print("Tokenizing "+str(n_sentences)+" sentences")
  for sentence in sentences:
    # sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    sentence_converted, length = tokenize(sentence)
    # Pad Tokenization
    to_pad = 700 - length # max in dev_clean is 513
    sentence_converted += [_char2id["<PAD>"]] * to_pad
    tokens.append(sentence_converted)
    lengths.append(length)
  return np.array(tokens), np.array(lengths).astype(np.int32)


In [8]:
# Show examples
print("Numericization dict: "+str(_char2id))
print("'TEST' tokenized is "+str(tokenize('TEST')[0])+" where the 1st token is <SOS> and the last token is <EOS>") 

Numericization dict: {'<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<SPACE>': 3, 'A': 4, 'B': 5, 'C': 6, 'D': 7, 'E': 8, 'F': 9, 'G': 10, 'H': 11, 'I': 12, 'J': 13, 'K': 14, 'L': 15, 'M': 16, 'N': 17, 'O': 18, 'P': 19, 'Q': 20, 'R': 21, 'S': 22, 'T': 23, 'U': 24, 'V': 25, 'W': 26, 'X': 27, 'Y': 28, 'Z': 29}
'TEST' tokenized is [1, 23, 8, 22, 23, 2] where the 1st token is <SOS> and the last token is <EOS>


Now we can extract features and target text from the whole LibriSpeech dataset!

In [9]:
from glob import glob
import joblib
import logging

In [14]:
# The following code borrowed/modified from https://github.com/30stomercury/Automatic-Speech-Recognition/blob/master/preprocess.py

def get_texts_and_audio_paths(root_dataset_path):
  """ Iterate through folders in a directory.
  Return: Target texts, paths of correspending audio recordings 
  """
  folders = glob(root_dataset_path+"/**/**")
  texts = []
  audio_path = []
  for path in folders:
    text_path = glob(path+"/*txt")[0]
    f = open(text_path)
    for line in f.readlines():
        line_ = line.split(" ")
        audio_path.append(path+"/"+line_[0]+".flac")
        texts.append(line[len(line_[0])+1:-1].replace("'",""))
  return texts, audio_path

# When number of audios in a set (usually training set) > threshold, divide set into several parts to avoid memory error.
_SAMPLE_THRESHOLD = 30000

def extract_inputs(datasets, datasets_dir, feat_dir):
  """ Iterate through LibriSpeech dataset and save features and texts to 
      binaries under feat_dir. 
  """
  def process_libri_feats(audio_path, cat, k):
      """When number of feats > threshold, divide feature
          into several parts to avoid memory error.
      """
      if len(audio_path) > _SAMPLE_THRESHOLD:
          featlen = []
          n = len(audio_path) // k + 1
          logging.info("Process {} audios...".format(cat))
          for i in tqdm(range(k)):
              feats, featlen_ = files_to_log_mel_spec(audio_path[i*n:(i+1)*n], sample_rate, window_size_frames, window_step_frames, n_filters)
              featlen += featlen_
              # Save the features into a file
              joblib.dump(feats, feat_dir+"/{}-feats-{}.pkl".format(cat, i))
              feats = []
      else:
          feats, featlen = files_to_log_mel_spec(audio_path, sample_rate, window_size_frames, window_step_frames, n_filters)
          joblib.dump(feats, feat_dir+"/{}-feats.pkl".format(cat))
      np.save(feat_dir+"/{}-featlen.npy".format(cat), featlen)

  for dataset_name in datasets:
    to_cat = dataset_name
    libri_path = datasets_dir + dataset_name + '/' + 'LibriSpeech' + '/' + dataset_name
    print("Extracting from "+libri_path)
    target_texts, audio_paths = get_texts_and_audio_paths(libri_path)

    # Tokenize sentence
    tokens, token_lengths = tokenize_sentences(target_texts)

    # Save tokens and their lengths to files
    np.save(feat_dir+"/{}-{}s.npy".format(to_cat,'char'), tokens)
    np.save(feat_dir+"/{}-{}len.npy".format(to_cat,'char'), token_lengths)

    # Extract and download features (in our case, the Log Mel Spectrograms)
    process_libri_feats(audio_paths, to_cat, len(audio_paths)//_SAMPLE_THRESHOLD)
  print("Finished extraction")


In [15]:

FEATURES_DIR = './features'
!mkdir -p {FEATURES_DIR}
extract_inputs(DATASETS, DATASET_DIR, FEATURES_DIR)

Extracting from ./data/dev-clean/LibriSpeech/dev-clean
Tokenizing 2703 sentences
Converting 2703 files to Log Mel Spectrograms
0/2703 converted to Log Mel Spectrograms
250/2703 converted to Log Mel Spectrograms
500/2703 converted to Log Mel Spectrograms
750/2703 converted to Log Mel Spectrograms
1000/2703 converted to Log Mel Spectrograms
1250/2703 converted to Log Mel Spectrograms
1500/2703 converted to Log Mel Spectrograms
1750/2703 converted to Log Mel Spectrograms
2000/2703 converted to Log Mel Spectrograms
2250/2703 converted to Log Mel Spectrograms
2500/2703 converted to Log Mel Spectrograms


In [20]:

def load_extracted_data(feature_dir, datasets):
  """ Load the data from files """
  dataset = datasets[0]
  features = joblib.load(feature_dir+"/{}-feats.pkl".format(dataset))
  feature_lengths = np.load(feature_dir+"/{}-featlen.npy".format(dataset), allow_pickle=True)
  tokens = np.load(feature_dir+"/{}-chars.npy".format(dataset), allow_pickle=True)
  token_lengths = np.load(feature_dir+"/{}-charlen.npy".format(dataset), allow_pickle=True)
  return features, feature_lengths, tokens, token_lengths

# Load our saved data
features, feature_lengths, tokens, token_lengths = load_extracted_data(FEATURES_DIR, DATASETS)

In [23]:
def create_tensorflow_datasets(features, tokens, batch_size = 64):
  """ Return Tensorflow Dataset objects for features, tokens """
  features_ds = tf.data.Dataset.from_tensor_slices(features)
  tokens_ds = tf.data.Dataset.from_tensor_slices(tokens)
  ds = tf.data.Dataset.zip((features_ds, tokens_ds))
  ds = ds.map(lambda x, y: {"source": x, "target": y})
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
  return ds

create_tensorflow_datasets(features, tokens)

<PrefetchDataset shapes: {source: (None, 4500, 40), target: (None, 700)}, types: {source: tf.float32, target: tf.int64}>

In [None]:
!wget -O /content/data/test.wav https://file-examples-com.github.io/uploads/2017/11/file_example_WAV_1MG.wav

Even though we have the spectrograms of audio recordings and their respective text translations, the Neural Network will not be able to read them in this format. We'll need to define the following variables so that both the encoder and decoder can run:

In [None]:
# add function to AudioHandler
log_specgrams, \
target_text_chars, \
total_found_chars, \
num_target_text_tokens = extract_features_and_text_targets('./data_test/LibriSpeech/dev-clean/')
# https://www.kdnuggets.com/2017/12/audio-classifier-deep-neural-networks.html

 and Numericize them (i.e. assign integers to each token)

In [None]:
target_token_index = None  # A dict mapping an integer to each seen character
encoder_input_data = None
decoder_input_data = None  # An array of dicts mapping characters to whether 1 if seen, 0 if not seen
decoder_target_data = None  # An array of dicts mapping characters to whether 1 if seen, 0 if not seen


In [None]:

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
data_path = "./data/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac"




In [None]:
import keras

encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))



Next: build encoder-decoder architecture, based on https://github.com/tensorflow/nmt

Next: plug into RNN, run it on Google Colab. Save it

Next: let user input voice data, and run it through the model

More soucres
https://towardsdatascience.com/recognizing-speech-commands-using-recurrent-neural-networks-with-attention-c2b2ba17c837
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
https://arxiv.org/pdf/1409.0473.pdf


Further Optimizations:

The Context Vector being fixed-length turns out to be a performance bottleneck; given a long input sentences, the the Encoder may not be able to store all of its output in the Context Vector in one timestep.

To solve the bottleneck, we could use the 'attention' mechanism which involves the Decoder selecting from all hidden states provided by the Encoder. For the sake of simplicity we won't be doing this optimization.
