# Speech Recognition with AI
Brought to you by Daniel Sikar - daniel.sikar@city.ac.uk
and
City Data Science Society - https://www.datasciencesociety.city/

## Natural Language Processing with Convolutional Neural Networks

Notebook: https://github.com/dsikar/natural-language-processing/blob/master/NaturalLanguageProcessing.ipynb

Tensorflow's Speech Commands Datasets: http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

Consisting of:
* 65,000 one-second long utterances
* 30 short words plus a background noise set
* Thousands of different people

Using a subset ("yes" and "no") of Tensorflow's Speech Commands Datasets. The full set consists of 30 words plus a background noise set: _background_noise_, bed, bird, cat, dog, down, eight, five, four, go, happy, house, left, marvin, no, nine, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, wow, yes, zero.

Note: In this workshop, we will **not** use the full dataset.

In [None]:
# Get the data subset
# Install PyDrive
!pip install PyDrive

#Import modules
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

#Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Get the shareable link e.g. https://drive.google.com/file/d/1OebaOg7YlGHa1UYQDIpPOuNWNTn2AHlq/view?usp=sharing
# Get the id from the link 1OebaOg7YlGHa1UYQDIpPOuNWNTn2AHlq
downloaded = drive.CreateFile({'id':"1OebaOg7YlGHa1UYQDIpPOuNWNTn2AHlq"})   
downloaded.GetContentFile('nlp-dataset.tar.gz')   
# Alternatively, if you are running the notebook locally, file can be downloaded by pasting shareable link
# into browser

In [None]:
# list
# !ls
# unpack
# !tar xvf nlp-dataset.tar.gz
# !ls dataset/training
# !ls dataset/training/no
# !ls dataset/training/yes

In [None]:
import os
import librosa
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile
import warnings
warnings.filterwarnings("ignore")

In [None]:
# what is the space available?
!df -h

**Data Exploration and Visualization**

Data Exploration and Visualization helps us to understand the data as well as pre-processing steps in a better way. 

**Visualization of Audio signal in time series domain**

Now, we’ll visualize the audio signal in the time series domain:

In [None]:
# about 20 Hz to 20 kHz
# Humans can detect sounds in a frequency range from about 20 Hz to 20 kHz.
train_audio_path = 'dataset/training/'
samples, sample_rate = librosa.load(train_audio_path + 'yes/8d8d9855_nohash_0.wav', sr = 8000)
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('"Yes" - waveform for file ' + 'dataset/training/yes/8d8d9855_nohash_0.wav')
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)
ipd.Audio(samples, rate=sample_rate)

In [None]:
train_audio_path = 'dataset/training/'
samples, sample_rate = librosa.load(train_audio_path + 'no/8a194ee6_nohash_0.wav', sr = 8000)
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('"NO" - Waveform for file ' + 'dataset/training//no/8a194ee6_nohash_0.wav')
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)
ipd.Audio(samples, rate=sample_rate)

In [None]:
print(type(samples))
print(samples.shape)

In [None]:
# Human voice: 
# In telephony, the usable voice frequency band ranges from approximately 300 to 3400 Hz
# The bandwidth allocated for a single voice-frequency transmission channel is usually 4 kHz
# Per the Nyquist–Shannon sampling theorem, the sampling frequency (8 kHz) must be at least twice 
# the highest component of the voice frequency via appropriate filtering prior to sampling at discrete 
# times (4 kHz) for effective reconstruction of the voice signal. 

# TODO plot frequency x ampliture

In [None]:
labels=os.listdir(train_audio_path)
print("Audio labels: ", labels)

In [None]:
#find count of each label and plot bar graph
no_of_recordings=[]
for label in labels:
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    no_of_recordings.append(len(waves))
    
#plot
plt.figure()
index = np.arange(len(labels))
plt.bar(index, no_of_recordings)
plt.xlabel('Commands', fontsize=12)
plt.ylabel('No of recordings', fontsize=12)
plt.xticks(index, labels, fontsize=15, rotation=60)
plt.title('No. of recordings for each command')
plt.show()

**Duration of recordings**

What’s next? A look at the distribution of the duration of recordings:

In [None]:
duration_of_recordings=[]
for label in labels:
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    for wav in waves:
        sample_rate, samples = wavfile.read(train_audio_path + '/' + label + '/' + wav)
        duration_of_recordings.append(float(len(samples)/sample_rate))
    
plt.hist(np.array(duration_of_recordings))

**Preprocessing the audio waves**

In the data exploration part earlier, we have seen that the duration of a few recordings is less than 1 second and the sampling rate is too high. So, let us read the audio waves and use the below-preprocessing steps to deal with this.

Here are the two steps we’ll follow:

* Resampling
* Removing shorter commands of less than 1 second

Let us define these preprocessing steps in the below code snippet:

In [None]:
# 743s execution time
train_audio_path = 'dataset/training'

all_wave = []
all_label = []
for label in labels:
    print(label)
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    for wav in waves:
        samples, sample_rate = librosa.load(train_audio_path + '/' + label + '/' + wav, sr = 8000)
        #samples = librosa.resample(samples, sample_rate, 8000)
        if(len(samples)== 8000) : 
            all_wave.append(samples)
            all_label.append(label)

Convert the output labels to integer encoded:

In [91]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(all_label)
classes= list(le.classes_)
type(labels)

list

Now, convert the integer encoded labels to a one-hot vector since it is a multi-classification problem:

In [None]:
from keras.utils import np_utils
y=np_utils.to_categorical(y, num_classes=len(labels))

Reshape the 2D array to 3D since the input to the conv1d must be a 3D array:

In [None]:
all_wave = np.array(all_wave).reshape(-1,8000,1)

**Split into train and validation set**

Next, we will train the model on 80% of the data and validate on the remaining 20%:


In [None]:
from sklearn.model_selection import train_test_split
x_tr, x_val, y_tr, y_val = train_test_split(np.array(all_wave),np.array(y),stratify=y,test_size = 0.2,random_state=777,shuffle=True)

**Model Architecture for this problem**

We will build the speech-to-text model using conv1d. Conv1d is a convolutional neural network which performs the convolution along only one dimension. 

**Model building**

Let us implement the model using Keras functional API.

In [None]:
from keras.layers import Dense, Dropout, Flatten, Conv1D, Input, MaxPooling1D
from keras.models import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K
K.clear_session()

inputs = Input(shape=(8000,1))

#First Conv1D layer
conv = Conv1D(8,13, padding='valid', activation='relu', strides=1)(inputs)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Second Conv1D layer
conv = Conv1D(16, 11, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Third Conv1D layer
conv = Conv1D(32, 9, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Fourth Conv1D layer
conv = Conv1D(64, 7, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Flatten layer
conv = Flatten()(conv)

#Dense Layer 1
conv = Dense(256, activation='relu')(conv)
conv = Dropout(0.3)(conv)

#Dense Layer 2
conv = Dense(128, activation='relu')(conv)
conv = Dropout(0.3)(conv)

outputs = Dense(len(labels), activation='softmax')(conv)

model = Model(inputs, outputs)
model.summary()

Define the loss function to be categorical cross-entropy since it is a multi-classification problem:

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

Early stopping and model checkpoints are the callbacks to stop training the neural network at the right time and to save the best model after every epoch:

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5, min_delta=0.0001) 
mc = ModelCheckpoint('best_model.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='max')

Let us train the model on a batch size of 32 and evaluate the performance on the holdout set:

In [None]:
# execution time 258s
history=model.fit(x_tr, y_tr ,epochs=10, callbacks=[es,mc], batch_size=32, validation_data=(x_val,y_val))

In [None]:
# save model
model.save('nlp-model.h5')

In [None]:
# verify 
!ls -lh nlp-model.h5

**Diagnostic plot**

I’m going to lean on visualization again to understand the performance of the model over a period of time:

In [None]:
from matplotlib import pyplot
pyplot.plot(history.history['loss'], label='train_loss')
pyplot.plot(history.history['val_loss'], label='test_loss')
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
pyplot.legend()
pyplot.show()

**Loading the best model**

In [None]:
from keras.models import load_model
model=load_model('nlp-model.h5')

Define the function that predicts text for the given audio:

In [None]:
def predict(audio):
    prob=model.predict(audio.reshape(1,8000,1))
    index=np.argmax(prob[0])
    return prob, classes[index]

Prediction time! Make predictions on the validation data:

In [128]:
import random
print("Number of testing examples - len(x_val):", len(x_val))
index=random.randint(0,len(x_val)-1)
print("Random index selected:", index)
samples=x_val[index].ravel() # x_val[index] shape: (8000, 1), "samples" shape: (8000,)
print("Shape of samples: ", samples.shape)
# prob, classes = predict(samples)
print("Data - x_val[index]:", x_val[index])
print("Randomly selected audio:",classes[np.argmax(y_val[index])])

pred, predClass = predict(samples)
print("Prediction output: ", pred)
print("Predicted class:", predClass)

ipd.Audio(samples, rate=8000)

Number of testing examples - len(x_val): 76
Random index selected: 58
Shape of samples:  (8000,)
Data - x_val[index]: [[-3.0517578e-05]
 [-2.4414062e-04]
 [-4.2724609e-04]
 ...
 [ 8.2397461e-04]
 [ 8.2397461e-04]
 [ 5.1879883e-04]]
Randomly selected audio: no
Prediction output:  [[0.42285758 0.5771424 ]]
Predicted class: yes


In [None]:
# TODO
# 1. Loop through training examples, get accuracy on 200 examples
# 2. GitHub DSS login
# 3. Confirm Conv2D number of parameters
# 4. Softmax prediction output
# 5. Datasets x3 20/200/2000 examples

In [61]:
print(prob.shape)
print(prob)
print(classes)
y_val[index]
labels
print(samples.shape)

(1, 2)
[[0.9933895  0.00661047]]
y
(8000,)


In [None]:
# x_val[index].shape
# (8000, 1)
# x_val[index]: x_val[index]: [[ 0.00422602]
# [ 0.01268432]
# [ 0.00283716]
# ...
# xr = x_val[index].ravel()
# type(xr) # numpy.ndarray
# type(x_val[index]) # numpy.ndarray
# xr.shape # (8000,)
# xr: array([-0.00036546, -0.00062576,  0.00048751, ..., -0.0003528 ,
y_val[index]