# Building a Speech Recognition Engine in Keras

In this Notebook we will train a Convolutional Neural Network by feeding it speech signals (actually one word). After training we will try to predict the corresponding text of a .wav-file.

The task will be to classify audio between three classes: bed, cat and happy.  In this 
<a href="http://taiwan.thomasmore.be/pr2/koen/bedcathappy.rar">rar-file</a> you will find three folders. Download this rar-file and extract the folders in a sub-folder `data`. The name of the sub-folders is actually the label of the audio files in it. Each folder contains approximately 1700 audio files. Play some audio files randomly to get an overall idea.

<img src="./resources/single.png"  style="height: 200px"/>

As you know from the previous Notebook, directly feeding a speech signal to a ConvNet model won't do the job. There are some preprocessing steps you'll need to take. Basically we turn sound waves into numbers so that they can be used as input for a neural network.

## 1. Preprocess our sound data

In the previous Notebook we talked about the Fourier Transformation to transform our sound wave into a spectrogram. There are actually two ways to calculate such a spectrogram: MFCC (Mel Frequency Cepstral Coefficients) and FTT (Fast Fourier Transformation). The code below will use the first technique to preprocess the .wav-files in our three folders into spectrograms and create our test and training set.

First install Librosa, a Python package for music and audio analysis.

In [None]:
pip install librosa==0.6.3

Import the necessary Keras modules and a Python code library with some functions we will use later. By the way, you don't need to know the details of this code library.

In [None]:
import sys
sys.path.append('library/preprocess.py')
from library.preprocess import *

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.utils import to_categorical

Now run the code below to preprocess our sound data. Since computing MFCC is time consuming, we will do it only once and save the computed values in a .npy file which is named after the name of the label. After running the code you can find those .npy-files in the root folder of this lesson.

In [None]:
# second dimension of the feature is dim2
feature_dim_2 = 11

# save data to array file first
save_data_to_array(max_len=feature_dim_2)

## 2. Prepare the train and test set

We'll take advantage of scikit-learn function `train_test_split` which will automatically split the whole dataset.

In [None]:
# loading train set and test set
X_train, X_test, y_train, y_test = get_train_test()

# feature dimension
feature_dim_1 = 20
channel = 1

# reshaping to perform 2D convolution, there is only one channel (normally for images 3: RGB)
X_train = X_train.reshape(X_train.shape[0], feature_dim_1, feature_dim_2, channel)
X_test = X_test.reshape(X_test.shape[0], feature_dim_1, feature_dim_2, channel)

# one hot encoding (already explained in the Computer Vision Lesson, CIFAR-10 )
y_train_hot = to_categorical(y_train)
y_test_hot = to_categorical(y_test)

## 3. Build the model and train it

Finally it is time to build our CNN and train it with the train data. The code below has no secrets anymore.

In [None]:
epochs = 50
batch_size = 100
verbose = 1

class_names =  ['bed', 'cat', 'happy']
num_classes = len(class_names)

model = Sequential()

model.add(Conv2D(32, kernel_size=(2, 2), activation='relu', input_shape=(feature_dim_1, feature_dim_2, channel)))
model.add(Conv2D(48, kernel_size=(2, 2), activation='relu'))
model.add(Conv2D(120, kernel_size=(2, 2), activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=keras.optimizers.Adadelta(),
                  metrics=['accuracy'])
    
model.fit(X_train, y_train_hot, batch_size=batch_size, epochs=epochs, verbose=verbose, validation_data=(X_test, y_test_hot))

## 4. Predict

In the folder `data_test` you will find some .wav-files to test your model with. Since we achieved an accuracy of 94%, all these files should predict the correct corresponding text. Maybe you can try if this is still the case if you pronounce the words yourself?

In [None]:
sample = wav2mfcc('./data_test/bed_0.wav')
sample_reshaped = sample.reshape(1, feature_dim_1, feature_dim_2, channel)

predicted_index = np.argmax(model.predict(sample_reshaped))

print(class_names[predicted_index])

## 5. Exercise

In the `data_fruit` folder you will find another dataset for Speach Recognition. Use this dataset as input and build and train a model. Can you achieve a high accuracy? Try to predict your own words.