In this notebook, I'll provide an wide overview on Automatic Speech Architecture, its standard pipelines, and how to train a model for it!

### What is Automatic Speech Recognition?
Automatic Speech Recognition (ASR) is the translation of spoken speech to text by a computer.

It has many uses:
+ closed captioning
+ mobile phone voice assistants
+ interface for handicapped individuals
+ preserve endangered languages

However, ASR is very difficult to solve due to:
+ Environment (Is there noise? Other speakers?)
+ Style of speech (Is it casual, formal, poetic, etc?)
+ Style of speaker (Talks fast? Accents? Gender bias?)

https://www.youtube.com/watch?v=q67z7PTGRi8

Modern ASR models work quite well in ideal, noiseless conditions. However, in noisy conditions they are far from perfect. Below, we can see the typical accuracy for state-of-the-art ASR models in 2020!

(Note: To calculate the accuracy of ASR models, we use a 'Word Error Rate.' This determines which percentage of spoken words were incorrectly translated to text by the ASR model. )

![word error rate](https://drive.google.com/uc?export=view&id=1iUz9koIqQErBVB0a4db0gSZVdfsfwh9L)



Architecture:

![timeline](https://drive.google.com/uc?export=view&id=1n2ID2ML5x9TFSBJhfdvj5AGI4u-9Mp4m)


From the 1980s to the 2010s, ASR was predominately performed by Hidden Markov Model (HMM) architecture. HMMs are probabilistic models for linear sequence classification problems. Given text's linear nature, HMMs seem like a natural fit for ASR.
However, using HMMs for ASR come with 3 downsides:
+ HMMs are built on the Markov Assumption, which assumes that the probability of the next state only depends on the current state, and not any states prior. This doesn't make sense for language processing, which is highly contextual (see Coarticulation)
+ HMMs' state transitions probabilities are 'baked in' and thus inflexible to changes in language
+ Classic HMM ASR pipelines require hand-tuned probability distributions for accoustic models and language models from linguistic experts
+ HMM-based ASR architecture is very complex, requiring 3 models as input. The figure below demonstrates an HMM-centric ASR architecture

![an HMM-based architecture](https://drive.google.com/uc?export=view&id=1loF8wbD-6DRO45Ly7lKnVVIcCn83H-B9)

Nowadays Recurrent Neural Networks (RNNs) or RNN-HMM hybrids are favored for ASR because they solve these problems. However, RNN's come at the cost of requiring a vast amount of training data.

![an RNN-only architecture](https://drive.google.com/uc?export=view&id=1_9McwMlHIqlPuJGYOkPwB-8h-t13P4pI)
(an RNN-only model that has only 4 steps)

https://stats.stackexchange.com/questions/282987/hidden-markov-model-vs-recurrent-neural-network
Jurafsky textbook

For the our ASR model, we'll be using an 'Listen, Attend, and Spell' (LAS) RNN. The LAS was proposed by in 2016 for ASR architecture in order to simplify Hybrid architecture (such as DNN-HMMs or CTC-HMMs) and avoid the aforementioned independence assumptions. Since then it has been the forefront of ASR architecture.

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44926.pdf

The objective of LAS is to transform an arbitrary-length input sequence to an arbitrary-length output sequence. LAS is an encoder-decoder model- a class of RNN's that is well-suited for problems where the length of the inputs greatly differs from the length of the outputs (like ASR).

Encoder-Decoder models are composed of 3 parts:
+ an Encoder
+ a Context Vector
+ a Decoder

The Encoder compresses and summarizes input data into a Context Vector, which is later transformed by the Decoder. The Context Vector must be defined with a fixed-length (usually 256, 512, or 1024).

Accoustic Information as Input for RNN:
A classic HMM-based ASR model would require us to input 3 probabilistic models (accoustic, pronunciation, and language), whereas an RNN only requires 1 (accoustic). The accoustic model captures important characteristics about how we perceive speech (see MFCC). We'll extract this data from the LibriSpeech ASR dataset.

https://www.openslr.org/12

We need to first extract features from the audio files- we'll be putting these features in a vector and feeding them into the RNN.
In order to extract the features from a given file, we need to perform the following steps:
```
1) Convert the FLAC file into a Waveform (giving us frequency data over the time domain)
2) Split the Waveform into Windows using a Windowing Function (like Hamming)
3) For each Window:
    a) Compute its Fourier Transform
    b) Compute its Mel Spectrogram (also known as Mel Filter Bank)
    c) Compute the Log of each value in the Mel Spectrogram
    d) Append the result of 3c to the Features Vector
4) Return Features Vector
```

But before we do any computations, let's define some configuration constants:

In [21]:
sample_rate = 16000  # The Sample Rate of the LibriSpeech Dataset
window_size = 0.025  # The standard window size for ASR models is 25ms
window_step = 0.010 # The standard window step for ASR models is 10ms
n_filters = 40

Now let's get the Log Mel Filter Bank energies! We'll be using the speechpy library for simplicity.

In [98]:
import numpy as np
import soundfile as sf
import speechpy
import tensorflow as tf

def files_to_log_mel_spec(filepaths, sample_rate, window_size, window_step, n_filters):  # LibriSpeech uses a 16kHz sample rate
    """ Convert an audio file to a Log Mel Spectrogram """
    log_mels = []
    lengths = []
    for filepath in filepaths:
        test_sig, _ = sf.read(filepath)
        # Perform Log Mel Filter Bank
        log_mel = speechpy.feature.lmfe(test_sig, sample_rate, frame_length=window_size, frame_stride=window_step, num_filters=n_filters).astype(np.float32)
        log_mels.append(log_mel)
        lengths.append(len(log_mels))
    log_mels = np.array(log_mels)
    # normalize it
    means = tf.math.reduce_mean(log_mels, 1, keepdims=True)
    stddevs = tf.math.reduce_std(log_mels, 1, keepdims=True)
    normalized = (log_mels - means) / stddevs
    return normalized, np.array(lengths).astype(np.int32)

files_to_log_mel_spec(['././data/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac'], sample_rate, window_size, window_step, n_filters)

1.0


(<tf.Tensor: shape=(1, 583, 40), dtype=float32, numpy=
 array([[[-1.197623  , -1.5210508 , -2.1210625 , ..., -1.7393734 ,
          -1.6316619 , -1.3016486 ],
         [-1.2549098 , -1.6451805 , -1.731487  , ..., -1.6508245 ,
          -1.5869974 , -1.2539657 ],
         [-1.125568  , -1.5487683 , -2.0217886 , ..., -1.6577387 ,
          -1.5408138 , -1.3341299 ],
         ...,
         [-1.7144234 , -1.6876788 , -1.6496265 , ..., -0.95731646,
          -1.0773276 , -0.9335013 ],
         [-1.4120784 , -2.0117342 , -1.5904878 , ..., -0.9466566 ,
          -1.2478892 , -1.0783314 ],
         [-1.5031108 , -1.6242408 , -1.5319533 , ..., -0.78672004,
          -1.2694486 , -1.2399886 ]]], dtype=float32)>,
 array([1], dtype=int32))

Sometimes you may see MFCCs used as the ASR features instead of Mel Filter Banks. MFCCs are values that loosely represents the brain's capacity to filter out certain signals. MFCCs were popular with HMM-based models for helping model based on human-perceptible information, but newer architecture can usually use Filter Banks and get similar or better accuracy. You can read more [here](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html).

The diagram below shows the steps required for calculating the Filter Banks and MFCCs. As you can see, MFCCs require a few more calculations after the Filter Banks:

![features pipeline](https://drive.google.com/uc?export=view&id=15T0MgXcj9wA_ZQ_i4tCLukUCH-7T2FnB)

https://www.youtube.com/watch?v=QTw-6GU5Mjs&t=319s


Let's also add a helper function that'll allow us to map the spectrograms to their target text! Specifically, this function will Tokenize the target text (i.e. split the text by character):

In [93]:
from project_utils.tokenizer import CharEncoder

tokenizer = CharEncoder()
def tokenize_sentences(sentences):
    tokens = []
    lengths = []
    for sentence in sentences:
        # sentence = sentence.translate(str.maketrans('', '', string.punctuation))
        sentence_converted = tokenizer.encode(sentence, with_eos=True)
        tokens.append(sentence_converted)
        lengths.append(len(sentence_converted))

    return np.array(tokens), np.array(lengths).astype(np.int32)

demo_dir_path = './data/LibriSpeech/dev-clean/1272/128104/'

Now we can extract features and target text from the whole LibriSpeech dataset!

In [94]:
# The following code borrowed from is https://github.com/30stomercury/Automatic-Speech-Recognition/blob/master/preprocess.py

from glob import glob
import joblib
import logging

def get_texts_and_audio_paths(root_path):
    folders = glob(root_path+"/**/**")
    texts = []
    audio_path = []
    for path in folders:
        text_path = glob(path+"/*txt")[0]
        f = open(text_path)
        for line in f.readlines():
            line_ = line.split(" ")
            audio_path.append(path+"/"+line_[0]+".flac")
            texts.append(line[len(line_[0])+1:-1].replace("'",""))
    return texts, audio_path

demo_root_path = './data/LibriSpeech/dev-clean/'
res = get_texts_and_audio_paths(demo_root_path)

# When number of audios in a set (usually training set) > threshold, divide set into several parts to avoid memory error.
_SAMPLE_THRESHOLD = 30000
train_100hr_corpus_dir = './null/'
train_360hr_corpus_dir = './null/'
train_500hr_corpus_dir ='./null/'
dev_data_dir = './data/LibriSpeech/dev-clean/'
test_data_dir = './null/'
def extract_inputs(feat_dir):
    def process_libri_feats(audio_path, cat, k):
        """When number of feats > threshold, divide feature
           into several parts to avoid memory error.
        """
        if len(audio_path) > _SAMPLE_THRESHOLD:
            featlen = []
            n = len(audio_path) // k + 1
            logging.info("Process {} audios...".format(cat))
            for i in range(k):
                feats, featlen_ = files_to_log_mel_spec(audio_path[i*n:(i+1)*n])
                featlen += featlen_
                # save
                joblib.dump(feats, feat_dir+"/{}-feats-{}.pkl".format(cat, i))
                feats = []
        else:
            feats, featlen = files_to_log_mel_spec(audio_path, sample_rate, window_size, window_step, n_filters)
            joblib.dump(feats, feat_dir+"/{}-feats.pkl".format(cat))
        np.save(feat_dir+"/{}-featlen.npy".format(cat), featlen)

    paths = [('train-100', train_100hr_corpus_dir), ('train-360', train_360hr_corpus_dir), ('train-500', train_500hr_corpus_dir), ('dev', dev_data_dir), ('test', test_data_dir)]
    path_pair = paths[3]
    # for path_pair in paths:
    if True:
        to_cat = path_pair[0]
        libri_path = path_pair[1]
        target_texts, audio_paths = get_texts_and_audio_paths(libri_path)

        tokens, token_lengths = tokenize_sentences(target_texts)

        np.save(feat_dir+"/{}-{}s.npy".format(to_cat,'char'), tokens)
        np.save(feat_dir+"/{}-{}len.npy".format(to_cat,'char'), token_lengths)

        # audios
        process_libri_feats(audio_paths, to_cat, len(audio_paths)//_SAMPLE_THRESHOLD)

extract_inputs('./test/')

[11, 8, 3, 11, 4, 7, 3, 26, 21, 12, 23, 23, 8, 17, 3, 4, 3, 17, 24, 16, 5, 8, 21, 3, 18, 9, 3, 5, 18, 18, 14, 22, 3, 11, 12, 16, 22, 8, 15, 9, 3, 4, 16, 18, 17, 10, 3, 23, 11, 8, 16, 3, 4, 3, 11, 12, 22, 23, 18, 21, 28, 3, 18, 9, 3, 7, 4, 17, 6, 12, 17, 10, 3, 4, 3, 11, 12, 22, 23, 18, 21, 28, 3, 18, 9, 3, 6, 18, 22, 23, 24, 16, 8, 3, 4, 3, 14, 8, 28, 3, 23, 18, 3, 22, 11, 4, 14, 8, 22, 19, 8, 4, 21, 8, 22, 3, 22, 18, 17, 17, 8, 23, 22, 3, 4, 3, 22, 23, 24, 7, 28, 3, 18, 9, 3, 23, 11, 8, 3, 19, 18, 8, 23, 21, 28, 3, 18, 9, 3, 8, 21, 17, 8, 22, 23, 3, 7, 18, 26, 22, 18, 17, 3, 8, 23, 3, 6, 8, 23, 8, 21, 4, 2]
[11, 24, 10, 11, 22, 3, 26, 21, 12, 23, 23, 8, 17, 3, 4, 3, 7, 8, 15, 12, 10, 11, 23, 9, 24, 15, 3, 19, 4, 21, 23, 3, 9, 18, 21, 3, 11, 8, 21, 3, 4, 17, 7, 3, 22, 11, 8, 22, 3, 20, 24, 12, 23, 8, 3, 12, 17, 8, 27, 19, 21, 8, 22, 22, 12, 5, 15, 8, 2]
[12, 3, 11, 4, 19, 19, 8, 17, 3, 23, 18, 3, 11, 4, 25, 8, 3, 16, 4, 6, 3, 6, 18, 17, 17, 8, 15, 15, 22, 3, 5, 18, 27, 3, 9, 18, 21, 3,

  return np.array(tokens), np.array(lengths).astype(np.int32)
  lmes = np.array(log_mel_energies)


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

Even though we have the spectrograms of audio recordings and their respective text translations, the Neural Network will not be able to read them in this format. We'll need to define the following variables so that both the encoder and decoder can run:

In [None]:
num_encoder_tokens = None  #
num_decoder_tokens = None
encoder_input_data = None
decoder_input_data = None
max_encoder_seq_length = None
max_decoder_seq_length = None

decoder_target_data = None

In [None]:

    # target_token_index = dict([(char, i) for i, char in enumerate(total_found_chars)])

def get_decoder_data():  # populate 2 arrays, offset by 1
    decoder_input_data = np.zeros(
        (num_output, max_decoder_seq_length, num_decoder_tokens), dtype="float32"
    )
    decoder_target_data = np.zeros(
        (num_output, max_decoder_seq_length, num_decoder_tokens), dtype="float32"
    )
    for char in found_chars:
        idx =target_token_index[char]
        array[idx] = 1.0

        # How do we do input/output arrays of diffff length?
        # https://keras.io/examples/nlp/lstm_seq2seq/
        # https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/lstm_seq2seq.ipynb#scrollTo=Tp6YF0oHXgay
        # https://medium.com/deep-learning-with-keras/seq2seq-part-e-encoder-decoder-for-variable-input-output-size-with-teacher-forcing-92c476dd9b0



In [None]:
# add function to AudioHandler
log_specgrams, \
target_text_chars, \
total_found_chars, \
num_target_text_tokens = extract_features_and_text_targets('./data_test/LibriSpeech/dev-clean/')
# https://www.kdnuggets.com/2017/12/audio-classifier-deep-neural-networks.html

 and Numericize them (i.e. assign integers to each token)

In [122]:
target_token_index = None  # A dict mapping an integer to each seen character
encoder_input_data = None
decoder_input_data = None  # An array of dicts mapping characters to whether 1 if seen, 0 if not seen
decoder_target_data = None  # An array of dicts mapping characters to whether 1 if seen, 0 if not seen


1272-135031-0002
1272-135031-0003
1272-135031-0018
1272-135031-0019
1272-135031-0011
1272-135031-0022
1272-135031-0020
1272-135031-0007
1272-135031-0006
1272-135031-0009
1272-135031-0014
1272-135031-0010
1272-135031-0013
1272-135031-0017
1272-135031-0021
1272-135031-0000
1272-135031-0005
1272-135031-0023
1272-135031-0016
1272-135031-0008
1272-135031-0012
1272-135031-0001
1272-135031-0015
1272-135031-0004
1272-135031-0024
1272-128104-0005
1272-128104-0008
1272-128104-0014
1272-128104-0011
1272-128104-0001
1272-128104-0013
1272-128104-0012
1272-128104-0004
1272-128104-0007
1272-128104-0002
1272-128104-0003
1272-128104-0000
1272-128104-0009
1272-128104-0006
1272-128104-0010
1272-141231-0011
1272-141231-0018
1272-141231-0030
1272-141231-0013
1272-141231-0012
1272-141231-0010
1272-141231-0019
1272-141231-0022
1272-141231-0006
1272-141231-0032
1272-141231-0001
1272-141231-0025
1272-141231-0027
1272-141231-0026
1272-141231-0015
1272-141231-0003
1272-141231-0024
1272-141231-0028
1272-141231-00

[]

In [None]:

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
data_path = "./data/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac"




In [109]:
import keras

encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))


<module 'keras.datasets.boston_housing' from '/home/a/.local/lib/python3.8/site-packages/keras/datasets/boston_housing.py'>



Next: build encoder-decoder architecture, based on https://github.com/tensorflow/nmt

Next: plug into RNN, run it on Google Colab. Save it

Next: let user input voice data, and run it through the model

More soucres
https://towardsdatascience.com/recognizing-speech-commands-using-recurrent-neural-networks-with-attention-c2b2ba17c837
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
https://arxiv.org/pdf/1409.0473.pdf


Further Optimizations:

The Context Vector being fixed-length turns out to be a performance bottleneck; given a long input sentences, the the Encoder may not be able to store all of its output in the Context Vector in one timestep.

To solve the bottleneck, we could use the 'attention' mechanism which involves the Decoder selecting from all hidden states provided by the Encoder. For the sake of simplicity we won't be doing this optimization.
