In [2]:
# Access local .py files in the deepspeech dir
%rehashx

### Building an acoustic model

Raw audio can be described as a 1D vector:

$X = [x_1, x_2 ...]$

Pre-processing
+ Two ways to start:
  + Minimally pre-process, such as using as simple spectorgram
  + This is going away with autoencoding-style deep networks that can go directly from an audio source to a label
  
Spectrogram idea:
+ take a small window, say 20ms of waveform
+ $log \mid FFT(X)^2 \mid$

This can then describe the frequency content in a local window.

The goal is to create a DNN/RNN from which we can extract a transcription, trained from labeled pairs.

**_Main issue_**: length(x) != length(y)

We don't know how symbols in y map to frames of audio

### Connectionist Temporal Classification (CTC)

1. RNN output neurons $c$ encode distro over symbols. length(c) == length(x)
    + For phoneme-based model: $c \in \{AA,AE,AX,...,ER1,blank\}$
    
    + For grapheme-based model: $c \in \{A,B,C,D,...,Z,blank,space\}$

2. Define a mapping $\beta(c) \rightarrow y$

3. Max likelihood of $y*$ under this model

**_Encoding_**

Output neurons define distro over whole character seqs $c$ assuing independence, via the following formula:

$ P(c\mid x) \equiv \prod_{i=1}^{N} P(c_i\mid x)$

**_Mapping_**

Given a specific char seq $c$, squeeze out duplicates and blanks to yield a useful transcription.

This mapping implies a distro over _possible transcriptions_ $y$.

We can now understand the basics of the CTC model:

In [11]:
%%script false
# This won't work outside the actual lib, hence the magic script call
"""
Define functions used to construct a multilayer GRU CTC model, and
functions for training and testing it.
"""

import ctc
import logging
import keras.backend as K

from keras.layers import (BatchNormalization, Convolution1D, Dense,
                          Input, GRU, TimeDistributed)
from keras.models import Model
# from keras.optimizers import SGD
import lasagne

from utils import conv_output_length

logger = logging.getLogger(__name__)


def compile_train_fn(model, learning_rate=2e-4):
    """ Build the CTC training routine for speech models.
    Args:
        model: A keras model (built=True) instance
    Returns:
        train_fn (theano.function): Function that takes in acoustic inputs,
            and updates the model. Returns network outputs and ctc cost
    """
    logger.info("Building train_fn")
    acoustic_input = model.inputs[0]
    network_output = model.outputs[0]
    output_lens = K.placeholder(ndim=1, dtype='int32')
    label = K.placeholder(ndim=1, dtype='int32')
    label_lens = K.placeholder(ndim=1, dtype='int32')
    network_output = network_output.dimshuffle((1, 0, 2))

    ctc_cost = ctc.cpu_ctc_th(network_output, output_lens,
                              label, label_lens).mean()
    trainable_vars = model.trainable_weights
    # optimizer = SGD(nesterov=True, lr=learning_rate, momentum=0.9,
    #                 clipnorm=100)
    # updates = optimizer.get_updates(trainable_vars, [], ctc_cost)
    trainable_vars = model.trainable_weights
    grads = K.gradients(ctc_cost, trainable_vars)
    grads = lasagne.updates.total_norm_constraint(grads, 100)
    updates = lasagne.updates.nesterov_momentum(grads, trainable_vars,
                                                learning_rate, 0.99)
    train_fn = K.function([acoustic_input, output_lens, label, label_lens,
                           K.learning_phase()],
                          [network_output, ctc_cost],
                          updates=updates)
    return train_fn


def compile_test_fn(model):
    """ Build a testing routine for speech models.
    Args:
        model: A keras model (built=True) instance
    Returns:
        val_fn (theano.function): Function that takes in acoustic inputs,
            and calculates the loss. Returns network outputs and ctc cost
    """
    logger.info("Building val_fn")
    acoustic_input = model.inputs[0]
    network_output = model.outputs[0]
    output_lens = K.placeholder(ndim=1, dtype='int32')
    label = K.placeholder(ndim=1, dtype='int32')
    label_lens = K.placeholder(ndim=1, dtype='int32')
    network_output = network_output.dimshuffle((1, 0, 2))

    ctc_cost = ctc.cpu_ctc_th(network_output, output_lens,
                              label, label_lens).mean()
    val_fn = K.function([acoustic_input, output_lens, label, label_lens,
                        K.learning_phase()],
                        [network_output, ctc_cost])
    return val_fn


def compile_output_fn(model):
    """ Build a function that simply calculates the output of a model
    Args:
        model: A keras model (built=True) instance
    Returns:
        output_fn (theano.function): Function that takes in acoustic inputs,
            and returns network outputs
    """
    logger.info("Building val_fn")
    acoustic_input = model.inputs[0]
    network_output = model.outputs[0]
    network_output = network_output.dimshuffle((1, 0, 2))

    output_fn = K.function([acoustic_input, K.learning_phase()],
                           [network_output])
    return output_fn


def compile_gru_model(input_dim=161, output_dim=29, recur_layers=3, nodes=1024,
                      conv_context=11, conv_border_mode='valid', conv_stride=2,
                      initialization='glorot_uniform', batch_norm=True):
    """ Build a recurrent network (CTC) for speech with GRU units """
    logger.info("Building gru model")
    # Main acoustic input
    acoustic_input = Input(shape=(None, input_dim), name='acoustic_input')

    # Setup the network
    conv_1d = Convolution1D(nodes, conv_context, name='conv1d',
                            border_mode=conv_border_mode,
                            subsample_length=conv_stride, init=initialization,
                            activation='relu')(acoustic_input)
    if batch_norm:
        output = BatchNormalization(name='bn_conv_1d', mode=2)(conv_1d)
    else:
        output = conv_1d

    for r in range(recur_layers):
        output = GRU(nodes, activation='relu',
                     name='rnn_{}'.format(r + 1), init=initialization,
                     return_sequences=True)(output)
        if batch_norm:
            bn_layer = BatchNormalization(name='bn_rnn_{}'.format(r + 1),
                                          mode=2)
            output = bn_layer(output)

    # We don't softmax here because CTC does that
    network_output = TimeDistributed(Dense(
        output_dim, name='dense', activation='linear', init=initialization,
    ))(output)
    model = Model(input=acoustic_input, output=network_output)
    model.conv_output_length = lambda x: conv_output_length(
        x, conv_context, conv_border_mode, conv_stride)
    return model


### Finding max likelihood of $\theta$

$\theta * = argmax_\theta \sum_i log^P(y^{*(i)} \mid x^{(i)})$

which is

$\theta * = argmax_\theta \sum_i log \sum_{c:\beta (c) = y^{*(i)}} P (c \mid x^{(i)})$

[The CTC paper](http://www.cs.toronto.edu/~graves/icml_2006.pdf) provides a DP algorithm to compute the inner sum & its gradient

Libs efficiently do this for us these days, no need to write by hand:

+ Warp CTC [`baidu-research/warp-ctc`](https://github.com/baidu-research/warp-ctc)
+ Stanford CTC [`amaas/stanford-ctc`](https://github.com/amaas/stanford-ctc)
+ Tensorflow `tf.nn.ctc_loss`

These work by computing the following loss function and provide the $\nabla$ w.r.t. _c_:

$$L(\theta) = log P(y^{*(i)} \mid x^{(i)}) = CTC(c^{(i)}, y^{*(i)})$$

![](https://storage.googleapis.com/personal-notes/tricks.png)

Shorter utterances result in simpler RNNs, and this helps reduce underflow/overflow and other network issues

**_Decoding_**

Approximate solution is so-called _max decoding_:

$$\beta (argmax_c P(c \mid x))$$

The above is "often terrible" but useful as a diagnostic

This diagram roughly matches the code we looked at earlier:

![](https://storage.googleapis.com/personal-notes/dlspeech_example.png)

**_Language models_** are a useful tool

+ Basic strategy: use beam search to maximize

Pseudo-implementation of the decoding process:

```py
# we have a set of transcript prefixes A, candidates in a list T,
# a language model function LM and an audio model function AM
def decode(AM, LM, A, T):
    for t in T:
        for c in A:
            add_blank()
            update_probability(AM)
            add_space()
            udpdate_probability(LM)
            add_char()
            update_probability(AM)
            A_prime = A
            A_prime.append(c)
    return k_most_probable(A_prime)
```

Rescoring with a neural LM can enhance N-gram trained from big corpora.

Application design is very important for a successful deep speech pipeline.

We want to find data that matches our goals.

|Styles of speech|Issues|Applications|
|-|-|-|
|Read|Disfluency/stuttering|Dictation|
|Conversational|Noise|Meeting transcription|
|Spontaneous|Mic quality/#channels|Call centers|
|Command/control|Far field|Device control|
||Reverb/echo|Mobile texting|
||[Lombard effect](https://en.wikipedia.org/wiki/Lombard_effect)|Home/IoT/Cars|
||Speaker accents||

Additive noise can help by synthesizing noisy environments.

Engineer the data pipeline to be robust against noise, **_not the recognition pipeline!_**.

Be aware of inefficiencies in OTS code, and make sure to pay special attention to minibatch sizes.