<a href="https://colab.research.google.com/github/desai-nitin/BootstrapPortfolioProject/blob/master/speech_Recognition_Tensorflow_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech recognition using a CNN

This is a simple example notebook for a speech recognition task using a 2D Convolutional Neural Network. This architecture may not provide the state of art results, but it is easy to understand, fast and lightweight (244.2K parameters and 9.7M calculations). The same CNN model could be useful for pattern recognition in 1D signals other than sound as vibration signals, earthquake signals, especially when the training dataset size is small.

# Notebook Index
- [Architecture overview](#overview)
- [Example dataset](#dataset)
- [Data preprocessing](#preprocessing)
- [Model](#model)
- [Training and evaluation](#training)
- [Predictions for streaming audio](#serving)
- [References](#references)

## Architecture overview<a id='overview'></a>

To train a speech recognition model, we need audio recordings of speech and the corresponding word labels. The audio recording is a single or multi-channel sequence of amplitude values recorded on equal intervals over time ([sample rate](https://en.wikipedia.org/wiki/Sampling_(signal_processing))).

Sequence to Sequence models are trained on in multi-word audio and inference multi-word predictions. They can produce better results but will be more computationally expensive. This implementation uses a fixed window of audio recording as input. The window length is set to be able to fit a single word. The training dataset is a collection of individual word recordings and corresponding labels. To process streaming audio, we use overlapping moving window technique.

### Imports

In [0]:
!pip install matplotlib numpy scipy scikit-learn pandas tensorflow

In [0]:
import matplotlib.pyplot as plt
import os
import tensorflow as tf
import numpy as np
import IPython.display as ipd
from tensorflow.contrib.framework.python.ops import audio_ops
from scipy.io import wavfile
from sklearn.metrics import confusion_matrix
import pandas as pd

## Parameters

In [0]:
# parameters

data_url = 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz'
data_dir = 'speech_dataset'
model_dir = 'model_dir'
vocabulary = ['__other__', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go']
test_size = .1
batch_size = 128
learning_rate = 0.001
max_training_steps = 300000
sample_rate = 16000
window_size_ms = 30.
window_stride_ms = 10.

tf.logging.set_verbosity(tf.logging.WARN)

## Example dataset<a id='dataset'></a>

This example uses the [speech commands dataset v0.02](https://arxiv.org/pdf/1804.03209.pdf). It is a collection of 105,000 single-word [WAVE](https://en.wikipedia.org/wiki/WAV) audio files in folders that represent the label. You can easily record your dataset that follows this folder convention (recordings of the word "cat" go inside a folder named "cat"). If the recordings are in a format different than WAVE, then you will have to modify the decoding function.

### Download and extract the speech commands dataset v0.02

The archive file is about 2GB and will be downloaded and extracted the first time you run next code cell.

In [4]:
# download maybe
if not os.path.exists(data_dir):
    !wget $data_url -P $data_dir
    print('Extracting ...')
    !tar -xzf {data_dir}/speech_commands_v0.02.tar.gz -C $data_dir

display(ipd.Audio(os.path.join(data_dir, 'happy/0227998e_nohash_1.wav')))
!ls speech_dataset

--2019-10-02 14:48:36--  http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 74.125.69.128, 2607:f8b0:4001:c08::80
Connecting to download.tensorflow.org (download.tensorflow.org)|74.125.69.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2428923189 (2.3G) [application/gzip]
Saving to: ‘speech_dataset/speech_commands_v0.02.tar.gz’


2019-10-02 14:49:19 (54.1 MB/s) - ‘speech_dataset/speech_commands_v0.02.tar.gz’ saved [2428923189/2428923189]

Extracting ...


_background_noise_  four     on				   tree
backward	    go	     one			   two
bed		    happy    README.md			   up
bird		    house    right			   validation_list.txt
cat		    learn    seven			   visual
dog		    left     sheila			   wow
down		    LICENSE  six			   yes
eight		    marvin   speech_commands_v0.02.tar.gz  zero
five		    nine     stop
follow		    no	     testing_list.txt
forward		    off      three


### Stratified train-evaluation split

This implementation is limited by the vocabulary of eight words + one special class labeled "\_\_other\_\_". The "\_\_other\_\_" class is a random sample of other words that are not in the vocabulary. An integer ID represents every word according to its position in the vocabulary list. The only input feature for the model is the WAVE filename.

In [5]:
from sklearn.model_selection import train_test_split

word_files = []
word_ids = []
for word_id, word in enumerate(vocabulary):
    if word == '__other__':
        continue
    word_dir = os.path.join(data_dir, word)
    files = [os.path.join(data_dir, word, f) for f in os.listdir(word_dir)]
    assert len(files)
    word_files += files
    word_ids += [word_id] * len(files)

if '__other__' in vocabulary:
    all_other_dirs = [i for i in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, i))]
    all_other_dirs = [i for i in all_other_dirs if i not in vocabulary]
    other_files = []
    for word_dir in all_other_dirs:
        other_files += [os.path.join(data_dir, word_dir, f) for f in os.listdir(os.path.join(data_dir, word_dir))]
    np.random.seed(0)
    np.random.shuffle(other_files)
    average_examples_per_word = len(word_ids) // (len(vocabulary)-1)
    other_count = average_examples_per_word * 2
    word_files += other_files[:other_count]
    word_ids += [vocabulary.index('__other__')] * other_count

word_files = np.array(word_files)
word_ids = np.array(word_ids)

train_inputs, eval_inputs, train_labels, eval_labels = train_test_split(
    word_files, word_ids, test_size=test_size, stratify=word_ids, random_state=0)

print('Train size: {}, Evaluation size:{}'.format(len(train_inputs), len(eval_inputs)))
print('*' * 50)
print('Sample of evaluation inputs:')
print(eval_inputs[:5])
print('*' * 50)
print('Sample of evaluation labels:')
print(eval_labels[:5])

Train size: 34380, Evaluation size:3821
**************************************************
Sample of evaluation inputs:
['speech_dataset/up/879a2b38_nohash_0.wav'
 'speech_dataset/left/cee22275_nohash_1.wav'
 'speech_dataset/left/9a7c1f83_nohash_2.wav'
 'speech_dataset/dog/01d22d03_nohash_1.wav'
 'speech_dataset/off/332d33b1_nohash_2.wav']
**************************************************
Sample of evaluation labels:
[1 3 3 0 6]


# Data preprocessing<a id='preprocessing'></a>

The example model requires some input preprocessing.

## WAVE contents

The first step is to read the raw binary contents of the input files.

In [6]:
wav_contents = tf.read_file(os.path.join(data_dir, 'happy/0227998e_nohash_1.wav'))
with tf.Session() as sess:
    wav_contents_val = sess.run(wav_contents)

print('This is how the contents of a WAVE file look like:')
wav_contents_val[:100] + ' ... + {} more bytes'.format(len(wav_contents_val[100:])).encode()

This is how the contents of a WAVE file look like:


b'RIFFpn\x00\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\x80>\x00\x00\x00}\x00\x00\x02\x00\x10\x00dataLn\x00\x00\xf9\xff\xf5\xff\xf5\xff\xf5\xff\xfe\xff#\x00p\x00\xc8\x00\xfa\x00\xcb\x00M\x00\xf2\xff\xd7\xff\xd2\xff\xd4\xff\xce\xff\xe1\xff\xf6\xff\x0b\x001\x00O\x00g\x00~\x00\xa8\x00\xa2\x00M\x00\xc6\xff=\xff ... + 28180 more bytes'

## Decodes WAVE contents to a float32 tensor

A decoded WAVE file is a sequence of numbers representing its amplitude values recorded on equal intervals over time.

![](https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.6/kfp-components/notebooks/speech_recognition/assets/audio.png)

In [7]:
def decode_wav(wav_contents, desired_samples=16000):
    audio = audio_ops.decode_wav(wav_contents, desired_channels=1, desired_samples=desired_samples)[0]
    return tf.reshape(audio, [1, -1])

print('Function decode_wav returns:', decode_wav(tf.constant(b"foo")))

Function decode_wav returns: Tensor("Reshape:0", shape=(1, 16000), dtype=float32)


## Audio augmentation

The goal here is to randomly modify the input audio in a way that mimics the natural variation of voice. There is much more that can be done as adding background noise and random speed.

In [8]:
def audio_augmentation(audio):
    # this input is expected to have shape: [1, 16000]
    audio = tf.reshape(audio, [16000], name='augmentation_input_reshape')
    # random volume addjustment
    audio = audio * tf.random.truncated_normal(shape=[], mean=1., stddev=.2)
    # random trim from the start and end of the audio
    start_offset = tf.random.uniform(shape=[], minval=0, maxval=16000//4, dtype=tf.int32)
    end_offset = tf.random.uniform(shape=[], minval=1, maxval=16000//4, dtype=tf.int32)
    # negative indexing e.g.: audio[:-end_offset] in TF doesn't work as expected with 0
    # so append a dummy 0 at the end have end_offset to be at least 1
    audio = tf.concat([audio, [0.]], axis=0)
    audio = audio[start_offset:-end_offset]
    # change the center of the audio
    max_move = start_offset + end_offset
    start_pad = tf.random.uniform(shape=[], minval=0, maxval=max_move, dtype=tf.int32)
    end_pad = max_move - start_pad - 1
    audio = tf.concat([tf.zeros(start_pad), audio, tf.zeros(end_pad)], axis=0)

    return tf.reshape(audio, [1, 16000], name='augmentation_output_reshape')

print('Function augmentation returns:', audio_augmentation(tf.zeros(shape=(1, 16000), dtype=tf.float32)))

Function augmentation returns: Tensor("augmentation_output_reshape:0", shape=(1, 16000), dtype=float32)


### Create spectrogram

A [spectrogram](https://en.wikipedia.org/wiki/Spectrogram) is the decoded sequence of amplitude values represented visually as an image. In other words, we convert every WAVE file to a one channel image. This allows us to treats this problem as grayscale image classification.

The function `create_spectrogram` creates a spectrogram and changes the resolution and frequency range of the spectrogram to optimize for human speech.

![](https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.6/kfp-components/notebooks/speech_recognition/assets/spectrogram.png)
![](https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.6/kfp-components/notebooks/speech_recognition/assets/audio_s.png)

In [9]:
def create_spectrogram(audio):
    window_size = int(sample_rate * window_size_ms / 1000)
    stride = int(sample_rate * window_stride_ms / 1000)
    dct_coefficient_count = 40

    spectrogram = audio_ops.audio_spectrogram(
        input=tf.reshape(audio, [-1, 1]),
        window_size=window_size,
        stride=stride,
        magnitude_squared=True)

    speech_optimized_spectrogram = audio_ops.mfcc(
        spectrogram,
        sample_rate=sample_rate,
        dct_coefficient_count=dct_coefficient_count)
    
    return tf.expand_dims(tf.squeeze(speech_optimized_spectrogram, axis=0), axis=-1)

print('Function create_spectrogram returns:', create_spectrogram(tf.zeros(shape=(1, 16000), dtype=tf.float32)))

Function create_spectrogram returns: Tensor("ExpandDims:0", shape=(98, 40, 1), dtype=float32)


## Reading input data

The input function takes a list of filenames and creates the following transformations:
- optional suffle the filenames
- read a WAVE file contents into a raw sequence of bytes
- decode the file contents to a numerical sequence (amplitude measurments)
- create a spectrogram (98x40x1 gray scale image) from the audio
- group a number of spectrograms into a batch (Nx98x40x1)

In [10]:
def input_fn(filenames, labels=None, batch_size=64, repeat=1, shuffle=False, augmentation=False):
    if labels is None:
        labels = np.zeros(len(filenames))
    if not shuffle:
        num_parallel_calls = None
    else:
        num_parallel_calls = 4
    
    dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=200)

    # read file contents
    dataset = dataset.map(
        lambda filename, label: (tf.read_file(filename), label),
        num_parallel_calls=num_parallel_calls)

    # decode wav files
    dataset = dataset.map(
        lambda wav_contents, label: (decode_wav(wav_contents), label),
        num_parallel_calls=num_parallel_calls)

    # augmentation of the audio
    if augmentation:
        dataset = dataset.map(
            lambda audio, label: (audio_augmentation(audio), label),
            num_parallel_calls=num_parallel_calls)

    # spectrogram
    dataset = dataset.map(
        lambda audio, label: ({'spectrogram': create_spectrogram(audio)}, label),
        num_parallel_calls=num_parallel_calls)

    return dataset.batch(batch_size).repeat(repeat)

# showing dataset output
input_fn(train_inputs, train_labels)

<DatasetV1Adapter shapes: ({spectrogram: (?, 98, 40, 1)}, (?,)), types: ({spectrogram: tf.float32}, tf.int64)>

## Model<a id='model'></a>

The model created based on the research paper: [Convolutional Neural Networks for Small-footprint Keyword Spotting](http://www.isca-speech.org/archive/interspeech_2015/papers/i15_1478.pdf). In essence, it chains `Conv -> MaxPool -> Conv -> Dense` layers. This architecture been shown to outperform DNNs with far fewer parameters.

In [11]:
def model(spectrograms, is_training):
    
    net = tf.layers.conv2d(spectrograms, filters=64, kernel_size=(20, 8), padding='same', activation=tf.nn.relu)
    net = tf.layers.max_pooling2d(net, pool_size=2, strides=2, padding='same')
    net = tf.layers.conv2d(net, filters=64, kernel_size=(10, 4), padding='same', activation=tf.nn.relu)
    net = tf.layers.flatten(net)
    logits = tf.layers.dense(net, len(vocabulary), activation=None)

    return logits

print('Function model returns:', model(tf.zeros(shape=[1, 98, 40, 1]), True))

Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use keras.layers.MaxPooling2D instead.
Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Use keras.layers.dense instead.
Function model returns: Tensor("dense/BiasAdd:0", shape=(1, 9), dtype=float32)


## Speech classifier using the Estimator API

In [0]:
def model_fn(features, labels, mode, params):
    
    logits = model(
        spectrograms=features['spectrogram'],
        is_training=mode==tf.estimator.ModeKeys.TRAIN)
    
    predictions = tf.argmax(logits, 1)
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {
            'class_probability': tf.reshape(tf.nn.softmax(logits), [-1, len(vocabulary)]),
            'class_id': tf.reshape(predictions, [-1, 1]),
        }
        print(predictions)
        return tf.estimator.EstimatorSpec(mode, predictions=predictions)

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    
    if mode == tf.estimator.ModeKeys.EVAL:
        accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions)
        metrics = {'accuracy': accuracy}
        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

    assert mode == tf.estimator.ModeKeys.TRAIN

    optimizer = tf.train.AdagradOptimizer(learning_rate=params['learning_rate'])
    train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
    
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir=model_dir,
    params = {'learning_rate': learning_rate},
)

# Training<a id='training'></a>

The training is not practical without a GPU. Training for 300000 steps takes about 4 hours.

![](https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.6/kfp-components/notebooks/speech_recognition/assets/tensorboard.png)

In [0]:
train_spec = tf.estimator.TrainSpec(
    input_fn=lambda: input_fn(
        train_inputs,
        train_labels,
        batch_size=batch_size,
        repeat=-1,
        shuffle=True,
        augmentation=True),
    max_steps=max_training_steps)

eval_spec = tf.estimator.EvalSpec(
    input_fn=lambda: input_fn(
        eval_inputs,
        eval_labels,
        batch_size=batch_size,
        repeat=1,
        shuffle=False,
        augmentation=False)
)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Instructions for updating:
Use standard file APIs to delete files with this prefix.


## Evaluation

In [0]:
estimator.evaluate(
    input_fn=lambda: input_fn(
        eval_inputs, eval_labels, batch_size=batch_size, repeat=1, shuffle=False))

## Single-word input prediction

In [0]:
predict_iterator = estimator.predict(
    input_fn=lambda: input_fn(eval_inputs),
    yield_single_examples=False,
)

predicted_labels = []

for pred in predict_iterator:
    predicted_labels += pred['class_id'].flatten().tolist()

print('Predicted labels:', predicted_labels[:10], 'True labels:', eval_labels[:10].tolist())

## Confusion matrix

In [0]:
pd.DataFrame(confusion_matrix(eval_labels, predicted_labels), columns=vocabulary, index=vocabulary)

# Predictions from streaming audio<a id='serving'></a>

We have a model that can predict a single label for fixed length audio. To assign multiple labels to a stream of audio, we can slice the input signal to overlapping windows, make predictions on every window with this model, and aggregate the overlapping predictions. This implementation trains the model with single word audio clips and applies the moving window technique to enable prediction on streaming audio.

### Make combined audio file

This file will be used to simulate streaming audio.

In [0]:
from scipy.io import wavfile

audios = []
for filename in eval_inputs[:10]:
    rate, audio = wavfile.read(filename)
    audios.append(audio)

audios = np.concatenate(audios)
wavfile.write('combined.wav', 16000, audios)

print('This file combines:', eval_inputs[:10])
print('Labels for combined.wav:', eval_labels[:10])
ipd.Audio('combined.wav')

#### Sliding window technique

The following graph demonstrates how a sliding window is taking overlapping patches from the input audio.

![](https://storage.googleapis.com/kf-pipeline-contrib-public/release-0.1.6/kfp-components/notebooks/speech_recognition/assets/sliding_window.png)

#### Sliding window function

The following function is a vectorized implementation of a sliding window. For the test case, we have input audio with shape=(4*16000, ) and the returned windowed audio will have shape=(13, 16000).

In [0]:
def sliding_window(audio, window_size=16000, step_size=4000):
    audio = tf.reshape(audio, [-1])
    audio_len = tf.cast(tf.reshape(tf.shape(audio), []), tf.int32)

    # make the audio lengh multiple of the step size
    pad = tf.floormod(audio_len, step_size)
    audio = tf.concat([audio, tf.zeros(pad)], axis=0)
    audio_len = tf.cast(tf.reshape(tf.shape(audio), []), tf.int32)

    number_windows = 1 + (audio_len - window_size) // step_size
    #number_windows = tf.reduce_max([number_windows, 1])

    padding =  tf.zeros(tf.reduce_max([window_size - audio_len, tf.constant(0)]))
    audio = tf.concat([padding, audio], axis=0)

    indexes = tf.range(start=0, limit=window_size, delta=1)
    row_offsets = tf.range(start=0, limit=number_windows, delta=1)

    # broadcasted summation
    indexes = indexes[None, :] + (row_offsets[:, None] * step_size)
    windowed_audio = tf.gather(audio, indexes, validate_indices=False)
    return windowed_audio

with tf.Graph().as_default():
    print(sliding_window(audio=tf.zeros(shape=(4*16000))))

### A new input with sliding window

Note that this input function takes only one file at a time.

In [0]:
def sliding_window_input_fn(filename):
    wav_content = tf.read_file(filename)
    audio = decode_wav(wav_content, desired_samples=-1)
    windowed_audio = sliding_window(audio)    
    spectrograms = tf.map_fn(create_spectrogram, windowed_audio, dtype=tf.float32)
    return {'spectrogram': spectrograms}
    
sliding_window_input_fn('combined.wav')

## Windowed prediction for a multi-word input audio file

Note that the raw predictions are created from sliding window with overlapping and 

In [0]:
predictino_iterator = estimator.predict(
    input_fn=lambda: sliding_window_input_fn('combined.wav'),
    yield_single_examples=False,
)

result = next(predictino_iterator)
raw_predictions = result['class_id'].flatten()
raw_predictions

## Final streaming prediction aggregation

To digest this into a final streaming prediction we can use a heuristic rule of taking rolling mode and combining up to file consecutive ids into one.

In [0]:
import scipy

def mode_of_non_zeros(values):
    other_index = vocabulary.index('__other__')
    if np.all(values==other_index):
        return other_index
    else:
        return scipy.stats.mode(window)[0][0]
    
max_word_distance = 5

predictions = []
while len(raw_predictions):
    window = raw_predictions[:10]
    prediction = mode_of_non_zeros(window)
    if len(predictions) and prediction == predictions[-1] and distance < max_word_distance:
        distance += 1
    else:
        distance = 1
        predictions.append(prediction)
    raw_predictions = raw_predictions[1:]
predictions = np.array(predictions)

print('Streaming audio predictions:', predictions)
print('Streaming audio true labels:', eval_labels[:10])

## References<a id='references'></a>

- [Tensorflow: Simple audio recognition tutorial](https://www.tensorflow.org/tutorials/sequences/audio_recognition)
- [Speech_commands dataset v0.02](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz)
- [Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition](https://arxiv.org/abs/1804.03209)
- [WAV](https://en.wikipedia.org/wiki/WAV)