# Mel Spectrogram CNN Classifier Demo

This is a demo on constructing a CNN classifier to classify speech as one of two words (zero or one) based on a training set of Mel spectra (images) obtained from speech. We shall be using the same Librosa library as we did in other demos to obtain the Mel spectra from wav files.

Our dataset will consist of two folders, one with 1000 speech samples of the word 'one', and another with 1000 samples of the word 'zero'. The first thing we have to do is to pre-process the wav files in order to generate spectrographic data for all the files. We will be using the same code in the Mel spectrum generation demo for this purpose.

Starting with some imports:

In [1]:
import os
import matplotlib.pyplot as plt
import librosa
import librosa.display
import numpy as np

We will also need a constant for the frame size for Mel spectrogram processing.

In [2]:
mel_spec_frame_size = 512

We shall define some useful functions that will help our preprocessing stage. The first function scans the data folders we have to determine how many classes are present. We have two subfolders, one per class. The following function returns the subfolder list, as well as the associated label for each of these classes.

In [3]:
def get_classes_in_datapath(datapath='./data_notebook_2'):
    """
    Returns a list of sub-folders in the data path, assuming each subfolder is a separate class. Each sub-folder
    is associated with a numeric class label, which is also returned.
    :param datapath: Main data path
    :return: Tuple of individual class folder paths, and the corresponding numeric class label for each folder
    """
    subfolders = [f.path for f in os.scandir(datapath) if f.is_dir()]
    class_labels = np.arange(0,len(subfolders))
    return subfolders, class_labels

The next function we need, is to return a list of audio files for a particular folder:

In [4]:
def get_wav_files_in_path(datapath):
    """
    Returns the list of .wav files in a directory.
    :param datapath: Directory to search for wav files in.
    :return: List of paths to wav files.
    """
    files = os.listdir(datapath)
    files_wav = [i for i in files if i.endswith('.wav')]
    return files_wav

One function you should be familiar with is to get a Mel spectrogram from an audio file.

In [5]:
def get_melspec_from_wav(wavfile, n_mels=64, plot=False):
    """
    Given a path to a wav file, returns a melspectrogram array.
    np.ndarray [shape=(n_mels, t)]
    :param wavfile: The input wav file.
    :param n_mels: The number of mel spectrogram filters.
    :param plot: Flag to either plot the spectorgram or not.
    :return: Returns a tuple of np.ndarray [shape=(n_mels, t)] and fs
    """
    sig, fs = librosa.load(wavfile,sr=None)

    # Normalize audio to between -1.0 and +1.0
    sig /= np.max(np.abs(sig), axis=0)

    if len(sig) < fs: # pad if less than a second
        shape = np.shape(sig)
        padded_array = np.zeros(fs)
        padded_array[:shape[0]] = sig
        sig = padded_array

    melspec = librosa.feature.melspectrogram(y=sig,
                                             sr=fs,
                                             center=True,
                                             n_fft=mel_spec_frame_size,
                                             hop_length=int(mel_spec_frame_size/2),
                                             n_mels=n_mels)

    if plot:
        plt.figure(figsize=(8, 6))
        plt.xlabel('Time')
        plt.ylabel('Mel-Frequency')
        librosa.display.specshow(librosa.power_to_db(melspec, ref=np.max),
                                 y_axis='mel',
                                 fmax=fs/2,
                                 sr=fs,
                                 hop_length=int(mel_spec_frame_size / 2),
                                 x_axis='time')
        plt.colorbar(format='%+2.0f dB')
        plt.title('Mel spectrogram')
        plt.tight_layout()
        plt.show()

    return melspec, fs

We want a way to export the Mel spectra and not have to process the training set every time we change our CNN architecture. We can do this in two ways: one is to export the raw spectrum values in a compressed numpy file (NPZ) - which can be done with a numpy call. The other method (and this will be used just for viewing images easily) will be to export the actual image as a PNG file with this function:

In [6]:
def save_image(filepath, fig=None):
    '''Save the current image with no whitespace
    Example filepath: "myfig.png" or r"C:\myfig.pdf"
    '''
    if not fig:
        fig = plt.gcf()

    plt.subplots_adjust(0,0,1,1,0,0)
    for ax in fig.axes:
        ax.axis('off')
        ax.margins(0,0)
        ax.xaxis.set_major_locator(plt.NullLocator())
        ax.yaxis.set_major_locator(plt.NullLocator())
    fig.savefig(filepath, pad_inches = 0, bbox_inches='tight')

We now have all the basic building blocks to run a data pre-processing routine to:
1) Get a list of classes in our dataset

2) For each class, get a list of wav files

3) For each wav file, get a normalized Mel spectrogram

4) Save the Mel spectrogram as an NPZ file and as a PNG file

We shall be saving the NPZ and PNG files adjacent to the WAV files, with the same filename, but different extensions. The function below does all of the above will take some time to execute for our 2000 files, but once completed, we can re-use the data for our CNN experiments:

In [7]:
def preprocessing():
    """
    Performs initial data preprocessing - extracting Mel spectra for all files and padding them appropriately
    :return:
    """
    subfolders, class_labels = get_classes_in_datapath()

    for folder in subfolders:
        files_wav = get_wav_files_in_path(datapath=folder)
        for file in files_wav:
            # get melspec
            melspec, fs = get_melspec_from_wav(wavfile=os.path.join(folder, file),
                                               plot=False,
                                               n_mels=64)

            melspec = librosa.power_to_db(melspec, ref=1.0)
            melspec = melspec / 80.0 # scale by max dB

            #check we have 64 time samples (and pad)
            if melspec.shape[1] < 64:
                shape = np.shape(melspec)
                padded_array = np.zeros((shape[0],64))-1
                padded_array[0:shape[0],:shape[1]] = melspec
                melspec = padded_array

            # save melspec
            melspec_filename = (os.path.join(folder, file)).replace('.wav', '.mel')
            np.savez(melspec_filename, melspec=melspec, fs=fs)

            # save melspec image
            melspec_image_filename = (os.path.join(folder, file)).replace('.wav', '.png')
            fig = plt.figure(figsize=(10, 4))
            plt.imshow(melspec, origin='lower')
            plt.tight_layout()
            save_image(filepath=melspec_image_filename,fig=fig)
            plt.close()

Let's call the function and let it run.

In [9]:
do_preprocessing = False  # change to true if preprocessing has not been performed.

print('Starting pre-processing...')
if do_preprocessing:
    preprocessing()
print('Pre-processing finished.')

Starting pre-processing...
Pre-processing finished.


## CNN Classifier Construction
We shall now create a Python class that will construct and execute a CNN classifier for our Mel spectrogram data. Let's start by the class definition.

In [19]:
import tensorflow
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers import ReLU, Dense, Flatten, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

class SpeechCNNClassifier():
    def __init__(self):
        # Input shape
        self.img_rows = 64
        self.img_cols = 64
        self.channels = 1
        self.img_shape = (self.img_rows, self.img_cols, self.channels)

        # Scan Data Set, shuffle it, and split into training/validation
        images, labels = self.get_trainingset()
        images, labels = shuffle(images, labels)
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(images, labels, test_size = 0.33, random_state = 42)

        # Build and compile the discriminator
        optimizer = Adam(learning_rate=0.0001)
        self.classifier = self.build_classifier()
        self.classifier.compile(loss='binary_crossentropy',
                                optimizer=optimizer,
                                metrics=['binary_accuracy', 'categorical_accuracy'])

    def get_trainingset(self):
        """
        Returns a list of training data files (NPZ files) and their respective labels
        :return:
        """
        def get_npz_files_in_path(datapath):
            files = os.listdir(datapath)
            files_npz = [i for i in files if i.endswith('.npz')]
            return files_npz

        training_images = []
        labels = []

        subfolders, class_labels = get_classes_in_datapath()

        class_idx = 0
        for folder in subfolders:
            label = class_labels[class_idx]
            files_mel = get_npz_files_in_path(datapath=folder)

            temp_labels = np.empty(shape=(len(files_mel)),dtype=int)
            temp_labels[:] = label
            labels.extend(temp_labels)
            class_idx += 1

            for file in files_mel:
                training_images.append((os.path.join(folder, file)))

        return training_images, labels

The constructor above contains a number of definitions that will come in handy. The first parameters define the size of the image in terms of rows, columns and images. We will be working with 64x64 pixel images (the Mel spectra). Each image contains just one channel of information. These values define the image shape. Furthermore, we shall be supplying the classifier with a training set, which is a list of image file paths and their respective label. This will be done with a method called get_trainingset, which scans our directory structure and prepares the dataset.

Furthermore, these labels are also one-hot encoded, for categorical classification purposes. We then split the data into training/validation with around 1/3 of the data kept for validation.

Furthemore, the constructor also creates the CNN compilation for us. This consists of:
1) defining an optimization function. We shall be using an Adam optimizer, with a learning rate of 0.0001. Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. Most CNNs today make use of this optimizer.

2) We shall call an internal method that builds the classifier (CNN) for us. We shall be discussing this method soon. This method will be the main place where the architecture of the CNN is defined/modified.

3) We combine the constructed CNN and the optimizer into a compilation that works to minimize binary cross-entropy loss. A very good explanation of this loss function can be found here: https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

Now let us define our CNN architecture, the nuts and bolts of this class:

In [20]:
def build_classifier(self):
    """
    Defines the classifier network.
    :return:
    """
    model = Sequential()
    model.add(Conv2D(16, kernel_size=3, strides=1, padding='same', input_shape=self.img_shape))
    model.add(ReLU())

    model.add(Conv2D(32, kernel_size=3, strides=2, padding='same'))
    model.add(ReLU())

    model.add(Conv2D(64, kernel_size=3, strides=2, padding='same'))
    model.add(ReLU())

    model.add(Conv2D(128, kernel_size=3, strides=2, padding='same'))
    model.add(ReLU())

    model.add(Flatten())
    model.add(Dropout(0.4))
    model.add(Dense(1, activation='sigmoid'))

    model.summary()

    return model

The CNN is a sequential structured network, made up of 4 convolutional blocks. Each convolutional block will have a ReLU activation function. The convolutional layers all have a 3x3 kernel size. The strides vary between 1 or 2 across the different layers. The number of filters in each convolution block increase twice at every layer, starting from 16, going up to 128.

After the convolutional blocks, the CNN is flattened into a 1-D layer, a dropout is applied, and an output sigmoid layer with one node, which will give us a continuous value between 0 and 1. We will assume the proximity to 0 and 1 to be an indicator of classification of zero or one images.

We now need to define our training loop. Since we are dealing with quite a lot of data, we shall not use the typical Keras tutorial approach of loading all data in memory and running a 'fit' with defined epochs and batch size. Instead, we shall customize the way data is fed into the CNN for training.

We need to train our CNN over a number of epochs. An epoch is defined as a full pass of the CNN over the training data. Since we will be training our CNN with a particular batch size, we need to calculate how many steps are to be executed in our epoch (dataset size / batch size). We then prepare our validation dataset in memory, as we want to run a validation test at the end of every epoch.

Training then takes places, by looping for the number of epochs required, and for all the steps in an epoch. The batch of data required for a particular step is loaded in memory (read from NPZ files) with associated labels. The training batch is passed on to the CNN for a batch update. When all steps in an epoch are complete, the validation metrics are calculated. Training completes when all epochs have been executed.

In [21]:
def train(self, epochs, batch_size=64):
    # steps per epoch
    steps_per_epoch = int(len(self.X_train)/batch_size)

    # prepare validation set images and labels
    validation_imgs = []
    for file_path in self.X_test:
        npzfile = np.load(file_path)
        melspec = npzfile['melspec']
        validation_imgs.append(melspec)
    validation_imgs = np.asarray(validation_imgs)
    validation_imgs = np.expand_dims(validation_imgs, axis=3)
    validation_lbls = np.asarray(self.y_test)

    for epoch in range(epochs):
        for step in range(steps_per_epoch):
            # ---------------------
            #  Train CNN
            # ---------------------

            # Select next batch of images (and shuffle indexes)
            idx = np.arange(step*batch_size,(step*batch_size)+batch_size)
            imgs = []
            for file_path in np.asarray(self.X_train)[idx]:
                npzfile = np.load(file_path)
                melspec = npzfile['melspec']
                imgs.append(melspec)
            imgs = np.asarray(imgs)
            imgs = np.expand_dims(imgs, axis=3)
            lbls = np.asarray(self.y_train)[idx]

            # batch-train CNN
            loss, accuracy = self.classifier.train_on_batch(imgs,lbls)
            # Output the progress
            print("Epoch: (%d/%d) Step: (%d/%d) [Loss: %f, Accuracy: %f]"
                % (epoch, epochs - 1, step, steps_per_epoch - 1, loss, accuracy))

        # After all steps in this epoch are complete, run validation
        v_loss, v_accuracy = self.classifier.evaluate(x=validation_imgs,y=validation_lbls)
        # Output the progress
        print("Epoch: (%d/%d) Validation [Loss: %f, Accuracy: %f]"
              % (epoch, epochs - 1, v_loss, v_accuracy))

The complete CNN class is defined here:

In [22]:
class SpeechCNNClassifier():
    def __init__(self):
        # Input shape
        self.img_rows = 64
        self.img_cols = 64
        self.channels = 1
        self.img_shape = (self.img_rows, self.img_cols, self.channels)

        # Scan Data Set, shuffle it, and split into training/validation
        images, labels = self.get_trainingset()
        onehot_encoder = OneHotEncoder(sparse=False)
        label_encoder = LabelEncoder()
        integer_encoded = label_encoder.fit_transform(labels)
        integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
        labels = onehot_encoder.fit_transform(integer_encoded)

        images, labels = shuffle(images, labels)
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(images, labels, test_size = 0.33, random_state = 42)

        # Build and compile the discriminator
        optimizer = Adam(learning_rate=0.0001)
        self.classifier = self.build_classifier()
        self.classifier.compile(loss='categorical_crossentropy',
                                        optimizer=optimizer,
                                        metrics=['accuracy'])

    def get_trainingset(self):
        """
        Returns a list of training data files (NPZ files) and their respective labels
        :return:
        """
        def get_npz_files_in_path(datapath):
            files = os.listdir(datapath)
            files_npz = [i for i in files if i.endswith('.npz')]
            return files_npz

        training_images = []
        labels = []

        subfolders, class_labels = get_classes_in_datapath()

        class_idx = 0
        for folder in subfolders:
            label = class_labels[class_idx]
            files_mel = get_npz_files_in_path(datapath=folder)

            temp_labels = np.empty(shape=(len(files_mel)),dtype=int)
            temp_labels[:] = label
            labels.extend(temp_labels)
            class_idx += 1

            for file in files_mel:
                training_images.append((os.path.join(folder, file)))

        return training_images, labels

    def build_classifier(self):
        """
        Defines the classifier network.
        :return:
        """
        model = Sequential()
        model.add(Conv2D(16, kernel_size=3, strides=1, padding='same', input_shape=self.img_shape))
        model.add(ReLU())

        model.add(Conv2D(32, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Conv2D(64, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Conv2D(128, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Flatten())
        model.add(Dropout(0.4))
        model.add(Dense(2, activation='softmax'))

        model.summary()

        return model

    def train(self, epochs, batch_size=64):
        # steps per epoch
        steps_per_epoch = int(len(self.X_train)/batch_size)

        # prepare validation set images and labels
        validation_imgs = []
        for file_path in self.X_test:
            npzfile = np.load(file_path)
            melspec = npzfile['melspec']
            validation_imgs.append(melspec)
        validation_imgs = np.asarray(validation_imgs)
        validation_imgs = np.expand_dims(validation_imgs, axis=3)
        validation_lbls = np.asarray(self.y_test)

        for epoch in range(epochs):
            for step in range(steps_per_epoch):
                # ---------------------
                #  Train CNN
                # ---------------------

                # Select next batch of images (and shuffle indexes)
                idx = np.arange(step*batch_size,(step*batch_size)+batch_size)
                imgs = []
                for file_path in np.asarray(self.X_train)[idx]:
                    npzfile = np.load(file_path)
                    melspec = npzfile['melspec']
                    imgs.append(melspec)
                imgs = np.asarray(imgs)
                imgs = np.expand_dims(imgs, axis=3)
                lbls = np.asarray(self.y_train)[idx]

                # batch-train CNN
                loss, accuracy = self.classifier.train_on_batch(imgs,lbls)
                # Output the progress, uncomment to see
#                 print("Epoch: (%d/%d) Step: (%d/%d) [Loss: %f, Accuracy: %f]"
#                     % (epoch, epochs - 1, step, steps_per_epoch - 1, loss, accuracy))

            # After all steps in this epoch are complete, run validation
            v_loss, v_accuracy = self.classifier.evaluate(x=validation_imgs,y=validation_lbls)
            # Output the progress
            print("Epoch: (%d/%d) Validation [Loss: %f, Accuracy: %f]"
                  % (epoch, epochs - 1, v_loss, v_accuracy))

Let's give it a go:

In [23]:
# Build CNN
cnn = SpeechCNNClassifier()

# sTrain CNN
batch_size = 32
cnn.train(epochs=25, batch_size=batch_size)

2022-04-21 12:16:23.048771: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-04-21 12:16:24.478951: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-04-21 12:16:24.478998: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-04-21 12:16:24.525844: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (o

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 64, 64, 16)        160       
                                                                 
 re_lu (ReLU)                (None, 64, 64, 16)        0         
                                                                 
 conv2d_1 (Conv2D)           (None, 32, 32, 32)        4640      
                                                                 
 re_lu_1 (ReLU)              (None, 32, 32, 32)        0         
                                                                 
 conv2d_2 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 re_lu_2 (ReLU)              (None, 16, 16, 64)        0         
                                                                 
 conv2d_3 (Conv2D)           (None, 8, 8, 128)         7

## Homework

You should have a go at modifying this notebook in order to achieve the following:

1) Plot loss and accuracy curves at the end of epoch training. This will allow you to diagnose your CNNs performance and training, and possibly identify problems.

2) Train the system for more epochs. Does performance improve? When do improvements reach a plateau? 

Plateaus at around 21 epochs

3) Modify the CNN architecture to try and achieve better classification results on the validation set.

In [23]:
class SpeechCNNClassifier():
    def __init__(self):
        # Input shape
        self.img_rows = 64
        self.img_cols = 64
        self.channels = 1
        self.img_shape = (self.img_rows, self.img_cols, self.channels)

        # Scan Data Set, shuffle it, and split into training/validation
        images, labels = self.get_trainingset()
        onehot_encoder = OneHotEncoder(sparse=False)
        label_encoder = LabelEncoder()
        integer_encoded = label_encoder.fit_transform(labels)
        integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
        labels = onehot_encoder.fit_transform(integer_encoded)

        images, labels = shuffle(images, labels)
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(images, labels, test_size = 0.33, random_state = 42)

        # Build and compile the discriminator
        optimizer = Adam(learning_rate=0.0001)
        self.classifier = self.build_classifier()
        self.classifier.compile(loss='categorical_crossentropy',
                                        optimizer=optimizer,
                                        metrics=['accuracy'])

    def get_trainingset(self):
        """
        Returns a list of training data files (NPZ files) and their respective labels
        :return:
        """
        def get_npz_files_in_path(datapath):
            files = os.listdir(datapath)
            files_npz = [i for i in files if i.endswith('.npz')]
            return files_npz

        training_images = []
        labels = []

        subfolders, class_labels = get_classes_in_datapath()

        class_idx = 0
        for folder in subfolders:
            label = class_labels[class_idx]
            files_mel = get_npz_files_in_path(datapath=folder)

            temp_labels = np.empty(shape=(len(files_mel)),dtype=int)
            temp_labels[:] = label
            labels.extend(temp_labels)
            class_idx += 1

            for file in files_mel:
                training_images.append((os.path.join(folder, file)))

        return training_images, labels

    def build_classifier(self):
        """
        Defines the classifier network.
        :return:
        """
        model = Sequential()
        model.add(Conv2D(16, kernel_size=3, strides=1, padding='same', input_shape=self.img_shape))
        model.add(ReLU())

        model.add(Conv2D(32, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Conv2D(64, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Conv2D(128, kernel_size=3, strides=2, padding='same'))
        model.add(ReLU())

        model.add(Flatten())
        model.add(Dropout(0.4))
        model.add(Dense(2, activation='softmax'))

        model.summary()

        return model

    def train(self, epochs, batch_size=64):
        # steps per epoch
        steps_per_epoch = int(len(self.X_train)/batch_size)

        # prepare validation set images and labels
        validation_imgs = []
        for file_path in self.X_test:
            npzfile = np.load(file_path)
            melspec = npzfile['melspec']
            validation_imgs.append(melspec)
        validation_imgs = np.asarray(validation_imgs)
        validation_imgs = np.expand_dims(validation_imgs, axis=3)
        validation_lbls = np.asarray(self.y_test)

        for epoch in range(epochs):
            for step in range(steps_per_epoch):
                # ---------------------
                #  Train CNN
                # ---------------------

                # Select next batch of images (and shuffle indexes)
                idx = np.arange(step*batch_size,(step*batch_size)+batch_size)
                imgs = []
                for file_path in np.asarray(self.X_train)[idx]:
                    npzfile = np.load(file_path)
                    melspec = npzfile['melspec']
                    imgs.append(melspec)
                imgs = np.asarray(imgs)
                imgs = np.expand_dims(imgs, axis=3)
                lbls = np.asarray(self.y_train)[idx]

                # batch-train CNN
                loss, accuracy = self.classifier.train_on_batch(imgs, lbls)
                # for plotting
                history = self.classifier.fit(imgs, lbls, epochs=epochs)
                # Output the progress, uncomment to see
#                 print("Epoch: (%d/%d) Step: (%d/%d) [Loss: %f, Accuracy: %f]"
#                     % (epoch, epochs - 1, step, steps_per_epoch - 1, loss, accuracy))

            # After all steps in this epoch are complete, run validation
            v_loss, v_accuracy = self.classifier.evaluate(x=validation_imgs,y=validation_lbls)
            # Output the progress
            print("Epoch: (%d/%d) Validation [Loss: %f, Accuracy: %f]"
                  % (epoch, epochs - 1, v_loss, v_accuracy))
        print(history.history.keys())
        

In [None]:
cnn = SpeechCNNClassifier()

# sTrain CNN
batch_size = 8  # 32
cnn.train(epochs=5, batch_size=batch_size)  # 25 epochs