# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [24]:
# retrieve the preprocessed data from previous notebook
import os
import numpy as np
from random import shuffle

np.random.seed(1)

path = '../../panotti/Preproc/'
train_path = os.path.join(path, 'Train')
test_path = os.path.join(path, 'Test')
class_names = sorted(os.listdir(train_path))

In [25]:
def load_data(data_dir):
    class_names = sorted(os.listdir(data_dir))
    nb_classes = len(class_names)
    print("class_names = ",class_names)

    for (dirpath, dirnames, filenames) in os.walk(os.path.join(data_dir, class_names[0])):
        with np.load(os.path.join(data_dir, class_names[0], filenames[0])) as sample_file:
            mel_dims = sample_file['melgram'].shape

    total_load = 0
    for classname in class_names:
        files = os.listdir(os.path.join(data_dir, classname))
        n_files = len(files)
        total_load += n_files

    print(" melgram dimensions: ",mel_dims)
    X = np.zeros((total_load, mel_dims[1], mel_dims[2], mel_dims[3]))
    Y = np.zeros((total_load, nb_classes))
    paths = []

    load_count = 0
    num_classes = len(class_names)
    label_smoothing = 0.005

    for idx, classname in enumerate(class_names):

        idx = class_names.index(classname)
        vec = np.zeros(num_classes)
        vec[idx] = 1
        vec = vec * (1 - label_smoothing) + label_smoothing / num_classes

        this_Y = np.array(vec)
        this_Y = this_Y[np.newaxis,:]
        file_list = os.listdir(os.path.join(data_dir, classname))
        shuffle(file_list)  # just to remove any special ordering

        for _, infilename in enumerate(file_list):   # Load files in a particular class
            audio_path = os.path.join(data_dir, classname, infilename)

            with np.load(audio_path) as data:
                melgram = data['melgram']
            if melgram.shape != mel_dims:
                raise Error('Dimension mismatch')

            # usually it's the 2nd dimension of melgram.shape that is affected by audio file length
            X[load_count,:,:] = melgram[:,:,:]
            #X[load_count,:,:] = melgram
            Y[load_count,:] = this_Y
            paths.append(audio_path)
            load_count += 1
        print('Successfully processed {} files for class {}'
              .format(len(file_list), classname))


    assert (X.shape[0] == Y.shape[0] )
    #print("shuffle_XY_paths: Y.shape[0], len(paths) = ",Y.shape[0], len(paths))
    idx = np.array(range(Y.shape[0]))
    np.random.shuffle(idx)
    newX = np.copy(X)
    newY = np.copy(Y)
    for i in range(len(idx)):
        newX[i] = X[idx[i],:,:]
        newY[i] = Y[idx[i],:]

    return newX, newY

In [26]:

X_train, Y_train = load_data(train_path)
X_test, Y_test = load_data(test_path)

class_names =  ['bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', 'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up', 'wow', 'yes', 'zero']
 melgram dimensions:  (1, 96, 87, 1)
Successfully processed 1368 files for class bed
Successfully processed 1384 files for class bird
Successfully processed 1386 files for class cat
Successfully processed 1396 files for class dog
Successfully processed 1887 files for class down
Successfully processed 1881 files for class eight
Successfully processed 1885 files for class five
Successfully processed 1897 files for class four
Successfully processed 1897 files for class go
Successfully processed 1393 files for class happy
Successfully processed 1400 files for class house
Successfully processed 1882 files for class left
Successfully processed 1396 files for class marvin
Successfully processed 1891 files for class nine
Successfully processed 1900 

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [8]:
from keras import backend as K
from keras.models import Sequential,  load_model, save_model
from keras.layers import Input, Dense, Dropout, Activation
from keras.layers import Convolution2D, MaxPooling2D, Flatten, Conv2D
from keras.layers.normalization import BatchNormalization

nb_layers=4
# Here's where one might 'swap out' different neural network 'model' choices
K.set_image_data_format('channels_last')                   # SHH changed on 3/1/2018 b/c tensorflow prefers channels_last

nb_filters = 32  # number of convolutional filters = "feature maps"
kernel_size = (3, 3)  # convolution kernel size
pool_size = (2, 2)  # size of pooling area for max pooling
cl_dropout = 0.5    # conv. layer dropout
dl_dropout = 0.6    # dense layer dropout
X_shape = X_train.shape

print(" CNN: X_shape = ",X_shape,", channels = ",X_shape[3])
input_shape = (X_shape[1], X_shape[2], X_shape[3])
model = Sequential()
model.add(Conv2D(nb_filters, kernel_size, padding='same', input_shape=input_shape, name="Input"))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Activation('relu'))        # Leave this relu & BN here.  ELU is not good here (my experience)
model.add(BatchNormalization(axis=-1))  # axis=1 for 'channels_first'; but tensorflow preferse channels_last (axis=-1)

for layer in range(nb_layers-1):   # add more layers than just the first
    model.add(Conv2D(nb_filters, kernel_size, padding='same'))
    model.add(MaxPooling2D(pool_size=pool_size))
    model.add(Activation('elu'))
    model.add(Dropout(cl_dropout))

model.add(Flatten())
model.add(Dense(128))            # 128 is 'arbitrary' for now
#model.add(Activation('relu'))   # relu (no BN) works ok here, however ELU works a bit better...
model.add(Activation('elu'))
model.add(Dropout(dl_dropout))
model.add(Dense(len(class_names)))
model.add(Activation("softmax",name="Output"))



 CNN: X_shape =  (51762, 96, 87, 1) , channels =  1


### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [9]:
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input (Conv2D)               (None, 96, 87, 32)        320       
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 48, 43, 32)        0         
_________________________________________________________________
activation_6 (Activation)    (None, 48, 43, 32)        0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 48, 43, 32)        128       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 48, 43, 32)        9248      
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 24, 21, 32)        0         
_________________________________________________________________
activation_7 (Activation)    (None, 24, 21, 32)       

In [10]:
from keras.callbacks import ModelCheckpoint 
from keras.models import load_model

# Display model architecture summary 
val_split = 0.2
split_index = int(X_train.shape[0]*(1-val_split))
X_val_data, Y_val_data = X_train[split_index:], Y_train[split_index:]
X_train_data, Y_train_data = X_train[:split_index-1], Y_train[:split_index-1]
weights_file='weights.hdf5'

if os.path.isfile(weights_file):
    loaded_model = load_model(weights_file)   # strip any previous parallel part, to be added back in later
    model.set_weights( loaded_model.get_weights() )  
    print('Loading Weights from file {}'.format(weights_file))

checkpointer = ModelCheckpoint(filepath=weights_file, 
                               verbose=1, save_best_only=True)

model.fit(X_train_data, Y_train_data, batch_size=32, epochs=2, shuffle=True,  callbacks=[checkpointer],
              verbose=1, validation_data=(X_val_data, Y_val_data))



Loding Weights from file weights.hdf5
Train on 41408 samples, validate on 10353 samples
Epoch 1/2

Epoch 00001: val_loss improved from inf to 0.46442, saving model to weights.hdf5
Epoch 2/2

Epoch 00002: val_loss did not improve from 0.46442


<keras.callbacks.callbacks.History at 0x7fb5c77d0c88>

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [11]:
# Evaluating the model on the training and testing set
score = model.evaluate(X_train, Y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(X_test, Y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.8819790482521057
Testing Accuracy:  0.8679478168487549


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [12]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
y = np.array(class_names)

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

In [13]:
import librosa 
import numpy as np 

def extract_melgram(file_name):
    signal, sr = librosa.load(file_name, mono=False, sr=44100)
    if len(signal.shape) == 1:
        signal = np.reshape(signal, (1, signal.shape[0]))
    melgram = librosa.amplitude_to_db(librosa.feature.melspectrogram(signal[0], sr=sr, n_mels=96))[np.newaxis,:,:,np.newaxis] 
    melgram = melgram.astype(np.float16)
    return  melgram


In [14]:
def print_prediction(file_name):
    prediction_feature = extract_melgram(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [15]:
# Class: Air Conditioner

filename = '../../googleData' + '/no/14_6c968bd9_nohash_2.wav'
print_prediction(filename) 

bed 		 :  0.00000004052580848679099290166050
bird 		 :  0.00000000679806300141194697062019
cat 		 :  0.00000001439836516681225475622341
dog 		 :  0.00001661814894760027527809143066
down 		 :  0.00569163775071501731872558593750
eight 		 :  0.00000000772307551244466594653204
five 		 :  0.00000000219587259486786479101283
four 		 :  0.00000010190364463369405712001026
go 		 :  0.22750957310199737548828125000000
happy 		 :  0.00000000720958093225476659426931
house 		 :  0.00015906471526250243186950683594
left 		 :  0.00000000133494137966039261300466
marvin 		 :  0.00000006934050844620287534780800
nine 		 :  0.00000939214714890113100409507751
no 		 :  0.76590603590011596679687500000000
off 		 :  0.00000001844883712465161806903780
on 		 :  0.00000001816302130919211776927114
one 		 :  0.00000012959999651229736628010869
right 		 :  0.00000001137163607722868619021028
seven 		 :  0.00000001087525713927561810123734
sheila 		 :  0.00000016204155883769999491050839
six 		 :  0.000000000042555434870417

In [18]:
# Class: Drilling

filename = '../../googleData' + '/right/18_3411cf4b_nohash_0.wav'
print_prediction(filename) 

bed 		 :  0.00000000020194251826310960495903
bird 		 :  0.00000000030625743607792799139133
cat 		 :  0.00000000000002218492621867355213
dog 		 :  0.00000000000340034424228807807822
down 		 :  0.00000000000001644698950870136095
eight 		 :  0.00000000947302858378407108830288
five 		 :  0.00000050353168035144335590302944
four 		 :  0.00000000000704127728345937953236
go 		 :  0.00000000002691116426922768312124
happy 		 :  0.00000000000085788743730413896671
house 		 :  0.00000000039158967735097860440874
left 		 :  0.00000019342060397775640012696385
marvin 		 :  0.00000000025054236463262213874259
nine 		 :  0.00000090260050455981399863958359
no 		 :  0.00000000003136934115244294218883
off 		 :  0.00000000000012417260416511949339
on 		 :  0.00000000000309874261666953643157
one 		 :  0.00000027808297886622312944382429
right 		 :  0.99998676776885986328125000000000
seven 		 :  0.00000000000024133662733169525261
sheila 		 :  0.00000000000014602113463692278916
six 		 :  0.000000003770439960248950

In [21]:
# Class: Street music 

filename = '../../googleData2/three/23_38d78313_nohash_2.wav'
print_prediction(filename) 

bed 		 :  0.00000000129506505519572101547965
bird 		 :  0.00000015873901304530591005459428
cat 		 :  0.00000000000703198653820291674776
dog 		 :  0.00000000003155511615893225041418
down 		 :  0.00000000005554279952635354788981
eight 		 :  0.00017474179912824183702468872070
five 		 :  0.00000039951615349309577140957117
four 		 :  0.00000026758473836707707960158587
go 		 :  0.00000000062846861048626578849507
happy 		 :  0.00000024444116775157453957945108
house 		 :  0.00000000154519030903799148291000
left 		 :  0.00000000021475429767825460203312
marvin 		 :  0.00000000352613094278808603121433
nine 		 :  0.00000319332457365817390382289886
no 		 :  0.00000000023936472248742290958035
off 		 :  0.00000000004671573544667850796941
on 		 :  0.00000043983840214423253200948238
one 		 :  0.00000025243829782084503676742315
right 		 :  0.00000413649104302749037742614746
seven 		 :  0.00000000091512836197793490100594
sheila 		 :  0.00000000920826259687146375654265
six 		 :  0.000000020265105149519513

In [23]:
# Class: Car Horn 

filename = '../../googleData2' + '/tree/24_07363607_nohash_0.wav'
print_prediction(filename) 

bed 		 :  0.00000113291150682925945147871971
bird 		 :  0.00000210597454497474245727062225
cat 		 :  0.00000027005157221537956502288580
dog 		 :  0.00000000285830048518675994273508
down 		 :  0.00000016647804557123890845105052
eight 		 :  0.00528699019923806190490722656250
five 		 :  0.00000028968324272682366427034140
four 		 :  0.00000354614871866942849010229111
go 		 :  0.00000092270693130558356642723083
happy 		 :  0.00001443670316803036257624626160
house 		 :  0.00000026161319510720204561948776
left 		 :  0.00000000640821795627743995282799
marvin 		 :  0.00000010081701162789613590575755
nine 		 :  0.00000373409511666977778077125549
no 		 :  0.00000010192528065999795217067003
off 		 :  0.00000001023202589323091160622425
on 		 :  0.00000080969545024345279671251774
one 		 :  0.00000025621864097047364339232445
right 		 :  0.00000016550166037632152438163757
seven 		 :  0.00000018194423034856299636885524
sheila 		 :  0.00000962884769251104444265365601
six 		 :  0.000010148657565878238528

#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### *In the next notebook we will refine our model*