# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [3]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [4]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               10496     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)               

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [5]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Train on 6985 samples, validate on 1747 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 2.18623, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 2.18623 to 2.00029, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 2.00029 to 1.83673, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.83673 to 1.66261, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.66261 to 1.56322, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 1.56322 to 1.44242, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 1.44242 to 1.29420, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 1.29420 to 1.21578, savin


Epoch 00033: val_loss improved from 0.58846 to 0.58490, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 34/100

Epoch 00034: val_loss improved from 0.58490 to 0.55106, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 35/100

Epoch 00035: val_loss did not improve from 0.55106
Epoch 36/100

Epoch 00036: val_loss improved from 0.55106 to 0.54734, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 37/100

Epoch 00037: val_loss improved from 0.54734 to 0.54116, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 38/100

Epoch 00038: val_loss did not improve from 0.54116
Epoch 39/100

Epoch 00039: val_loss did not improve from 0.54116
Epoch 40/100

Epoch 00040: val_loss did not improve from 0.54116
Epoch 41/100

Epoch 00041: val_loss improved from 0.54116 to 0.53371, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 42/100

Epoch 00042: val_loss improved from 0.53371 to 0.52995, saving model to saved_models/weights.best.basic_


Epoch 00070: val_loss did not improve from 0.45448
Epoch 71/100

Epoch 00071: val_loss did not improve from 0.45448
Epoch 72/100

Epoch 00072: val_loss improved from 0.45448 to 0.44828, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 73/100

Epoch 00073: val_loss improved from 0.44828 to 0.43458, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 74/100

Epoch 00074: val_loss did not improve from 0.43458
Epoch 75/100

Epoch 00075: val_loss did not improve from 0.43458
Epoch 76/100

Epoch 00076: val_loss did not improve from 0.43458
Epoch 77/100

Epoch 00077: val_loss did not improve from 0.43458
Epoch 78/100

Epoch 00078: val_loss did not improve from 0.43458
Epoch 79/100

Epoch 00079: val_loss did not improve from 0.43458
Epoch 80/100

Epoch 00080: val_loss improved from 0.43458 to 0.43117, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 81/100

Epoch 00081: val_loss improved from 0.43117 to 0.42347, saving model to saved_models/weights.best.

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [6]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9317107796669006
Testing Accuracy:  0.8826559782028198


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [7]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.[0m
  from numba.decorators import jit as optional_jit
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.[0m
  from numba.decorators import jit as optional_jit


In [8]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [9]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99999964237213134765625000000000
car_horn 		 :  0.00000011712788960949183092452586
children_playing 		 :  0.00000002885492378368326171766967
dog_bark 		 :  0.00000003877984511291288072243333
drilling 		 :  0.00000005908338351900965790264308
engine_idling 		 :  0.00000005422520388265184010379016
gun_shot 		 :  0.00000000267914401774760335683823
jackhammer 		 :  0.00000001554728257247006695251912
siren 		 :  0.00000000094775920445044903317466
street_music 		 :  0.00000014979642060097830835729837


In [10]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000000008224136649470636939441
car_horn 		 :  0.00000214497208617103751748800278
children_playing 		 :  0.00009362694981973618268966674805
dog_bark 		 :  0.00001217670069308951497077941895
drilling 		 :  0.97418028116226196289062500000000
engine_idling 		 :  0.00000000075200978777445470768725
gun_shot 		 :  0.00000000740195993387260386953130
jackhammer 		 :  0.00000000894259688521970019792207
siren 		 :  0.00000000957716750349391077179462
street_music 		 :  0.02571182325482368469238281250000


In [11]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.09999072551727294921875000000000
car_horn 		 :  0.00305506144650280475616455078125
children_playing 		 :  0.09950152784585952758789062500000
dog_bark 		 :  0.02582867257297039031982421875000
drilling 		 :  0.00509325042366981506347656250000
engine_idling 		 :  0.00916280318051576614379882812500
gun_shot 		 :  0.00549275847151875495910644531250
jackhammer 		 :  0.03270008042454719543457031250000
siren 		 :  0.00361734302714467048645019531250
street_music 		 :  0.71555775403976440429687500000000


In [12]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00188611494377255439758300781250
car_horn 		 :  0.68632853031158447265625000000000
children_playing 		 :  0.01224335655570030212402343750000
dog_bark 		 :  0.16461659967899322509765625000000
drilling 		 :  0.05645351111888885498046875000000
engine_idling 		 :  0.00212736334651708602905273437500
gun_shot 		 :  0.00211420282721519470214843750000
jackhammer 		 :  0.00372551172040402889251708984375
siren 		 :  0.00587591761723160743713378906250
street_music 		 :  0.06462877988815307617187500000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [11]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00041168121970258653163909912109
car_horn 		 :  0.00089477357687428593635559082031
children_playing 		 :  0.09841609746217727661132812500000
dog_bark 		 :  0.62324690818786621093750000000000
drilling 		 :  0.00877229031175374984741210937500
engine_idling 		 :  0.00002467866761435288935899734497
gun_shot 		 :  0.03237913921475410461425781250000
jackhammer 		 :  0.00010647259477991610765457153320
siren 		 :  0.00025971565628424286842346191406
street_music 		 :  0.23548822104930877685546875000000


In [12]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.03187213838100433349609375000000
car_horn 		 :  0.00031004374613985419273376464844
children_playing 		 :  0.00008082303247647359967231750488
dog_bark 		 :  0.00045101894647814333438873291016
drilling 		 :  0.91103124618530273437500000000000
engine_idling 		 :  0.00066664698533713817596435546875
gun_shot 		 :  0.00014731122064404189586639404297
jackhammer 		 :  0.05533346161246299743652343750000
siren 		 :  0.00000345394050782488193362951279
street_music 		 :  0.00010387785005150362849235534668


In [13]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: dog_bark 

air_conditioner 		 :  0.15668992698192596435546875000000
car_horn 		 :  0.00028948130784556269645690917969
children_playing 		 :  0.00210997764952480792999267578125
dog_bark 		 :  0.54222160577774047851562500000000
drilling 		 :  0.00530693493783473968505859375000
engine_idling 		 :  0.01762679778039455413818359375000
gun_shot 		 :  0.00704973144456744194030761718750
jackhammer 		 :  0.00019645661814138293266296386719
siren 		 :  0.00757358316332101821899414062500
street_music 		 :  0.26093551516532897949218750000000


In [14]:
filename = '../Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.00025223082047887146472930908203
car_horn 		 :  0.00478093931451439857482910156250
children_playing 		 :  0.00389584968797862529754638671875
dog_bark 		 :  0.08564702421426773071289062500000
drilling 		 :  0.00010684659355320036411285400391
engine_idling 		 :  0.19492025673389434814453125000000
gun_shot 		 :  0.00535361655056476593017578125000
jackhammer 		 :  0.00035517592914402484893798828125
siren 		 :  0.67424273490905761718750000000000
street_music 		 :  0.03044527582824230194091796875000


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*