# End-to-End Models for Speech Processing

## Classical Speech Recognition

You build a statistical model of speech starting from text sequences to audio features.

![12_classic_speech_recognition](./assets/12_classic_speech_recognition.png)

Each of the stage of the pipeline above uses a different statistical model.

* Speech preprocessing uses a classical signal processing 
* Language model uses a n-gram model
* Pronounciation uses a pronounciation table
* Acoustic model uses a Gaussian mixture

You look at the waveform, compute some features for it and then you look at your model and perform some inference to figure out what does it mean. 

## Motivation

Although people later discovered that each of the model above can be improved significantly by using deep neural networks and recurrent neural networks, the problem is that these models may not fit well together. This drives people to seek an end-to-end model that performs all the tasks above in one-go.

## Connectionist Temporal Classification

Given an audio signal,

$$
X = x_{1}, x_{2}, ..., x_{T}
$$

where `x` is a frame of signal and a corresponding output text,

$$
Y = y_{1}, y_{2}, ..., y_{L}
$$

where `y` coud be a list of words or characters.

We want to learn the probablistic model where `T` > `L`

$$
P(Y|X)
$$

`Y` is just a text sequence or transcript and `X` is the audio/processed spectrogram.

![ctc](./assets/12_ctc.png)

Here's how the frame predictions map to a output sequence. 

Each timestep can produce a symbol or letter through the softmax output. Some tokens may be duplicated like

```
cc<b>aa<b>t<b>
```

The original transcript maps to all possible paths in the duplicated space.

```
cc<b>aa<b>t<b> => cat
cc<b><b>a<b>t<b> => cat
cccc<b>aaaa<b>tttt<b> => cat
cccccc<b>aa<b>tt<b> => cat
```

The score of any path is the sum of the score of individual categories at different time steps. The probability of any transcript is the sum of probabilities of all paths that correspond to that transcript. The `<b>` is known as the blank symbol.

![ctc_prediction](./assets/12_ctc_prediction.png)

### Language Model

The end result will pronounce transcripts that sound correct, but lack the correct spelling and grammar rules. Although more training data can help, eventually a language model is required to fix these problems. With a simple language model rescoring, the word error rate goes from 30.1% to 8.7%. Google's CTC implementation fixes these problems by integrating a language model into CTC during training.

## Sequence-to-Sequence 

In the CTC model, the model makes prediction based on only input data of a given frame. Once the prediction is made, there is no room to make adjustment. The next improvement we can make is to use sequence to sequence, passing a hidden state forward. Prediction for each timestep will factor in the predictions from previous timesteps and current waveform frame input.

$$
P(y_{i} \mid y_{0..i}, x)
$$

The challenge with S2S training is that the sequence can be very long for audio streams. Each second is made up of 100 frames and for a 10 seconds audio input, it will have thousand of inputs. Even with LSTM, this can be quite stretching its limit. Therefore, we must use **attention** to guide where to look for the relevant input.

![attention](./assets/12_attention.gif)

### Listen, Attend, and Spell

*Neural Machine Translation by Jointly Learning to Align and Translate*, 2015. 

First we have a bi-directional RNN as an encoder that acts as the listener. For every time step of the input, it produces some vector representation which encodes the input as `h[t]`. And then you generate the next character at every timestep with a decoder. You take the state vector of the decoder and compare it with each of the hidden time step of the encoder. The state vector is known as a **query**. We want to compute the similarity score between decoder query and encoder hidden state. The score is to tell us where to find the data we are looking for. This is where the attentions at.

$$
e_{t} = f\left([h_{t}, s]\right)
$$

![las_acoustic_model](./assets/12_las_acoustic_model.png)

The decoder is also another recurrent neural network which computes the actual next word of the sequence by understanding where to look for that word from the audio signal.

![las_architecture](./assets/12_las_architecture.png)

### Limitation

This is not an online model. All inputs must be received before transcripts can be produced. Attention is a computational bottleneck since every output token pays attention to every input time step. Length of the input has a big impact on accuracy as well.

## Neural Transducer