# Classification for voice control
The current implementation of the MOPS uses a combination of Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM) to classify the commands detected by the voice activity detection. In this notebook, Tensorflow should be used instead, in order to get practice in applying Tensorflow to classification tasks.

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import os
os.chdir('../Python')
import TrainingsDataInterface
import TrainingsInterface
import Train
import DatasetAugmentation
import Constants

TempFolder = "NeuralNetworks/VoiceControlByTensorflow"
FilenameData = TempFolder + '/Data.npz'
try:
    os.mkdir(TempFolder)
except:
    pass

print('installed version of Tensorflow: ', tf.__version__)

installed version of Tensorflow:  2.14.0


## Evaluate the input data
In the following codeblock, the software of the MOPS is used to perform the following three steps:

1) Read the audio data: x, Fs, bits = ATrainingsDataInterface.GetWaveOfCommandInstance(CommandIndex, InstanceIndex)

2) Optionally apply dataset augmentation: y, Timestretchfactor = ADatasetAugmentation.GenerateSingleDistortion(DistortionIndex)

3) Evaluate the MFCC, $\Delta$MFCC und $\Delta\Delta$MFCC: Feature = TrainingsInterface.SamplesToFeature(z, Fs)

The evaluated features are stored in three different datasets:

1) train_images,

2) validation_images and

3) test_images.

The corresponding indices of the commands is stored in the ground truth

1) train_labels,

2) validation_labels and

3) test_labels.

Warning: This Code is slow. It need to be called only if the trainingsdata need to be re-evaluated. Otherwise, everything is loaded in the next code block directly from the hard disk.

In [2]:
AudioDataLengthInMilliseconds = Constants.theConstants.getWordLengthInMilliseconds()
NumberOfTestSamples = 20
NumberOfValidationSamples = 20

ATrainingsDataInterface = TrainingsDataInterface.CTrainingsDataInterface()
def GetNumberOfTrainingsData():
    res = 0
    for CommandIndex in range(ATrainingsDataInterface.GetNumberOfCommands()):
        command = ATrainingsDataInterface.GetCommandString(CommandIndex)
        if command in Train.VOCABULARY:
            NewSamples = ATrainingsDataInterface.GetNumberOfCommandInstances(CommandIndex)
            NewSamples -= NumberOfTestSamples
            NewSamples -= NumberOfValidationSamples
            assert NewSamples > 0, str('not enough training samples for command ' + command)
            res += NewSamples
    return res

def GetAudioWithConstantLength(x, Fs):
    LengthInSamples = int(AudioDataLengthInMilliseconds * Fs / 1000)
    if x.shape[0] < LengthInSamples:
        y = np.concatenate((x, np.zeros((LengthInSamples - x.shape[0]))), axis = 0)
    else:
        E_cumsum = np.cumsum(x**2)
        tmp = E_cumsum[LengthInSamples:]
        tmp -= E_cumsum[:tmp.shape[0]]
        MaxIndex = np.argmax(tmp)
        y = x[MaxIndex:MaxIndex + LengthInSamples]
    assert np.abs(y.shape[0] - LengthInSamples) < 1e-1, 'wrong output length'
    return y

def IsTraining(InstanceIndex):
    return not (IsTest(InstanceIndex) or IsValidation(InstanceIndex))

def IsValidation(InstanceIndex):
    return (not IsTest(InstanceIndex)) and (InstanceIndex < (NumberOfTestSamples + NumberOfValidationSamples))

def IsTest(InstanceIndex):
    return InstanceIndex < NumberOfTestSamples

def EvaluateAllData():
    Constants.theConstants.SetUseVAD(False)
    train_images = None
    train_counter = 0
    test_counter = 0
    validation_counter = 0
    for CommandIndex in tqdm(range(ATrainingsDataInterface.GetNumberOfCommands())):
        command = ATrainingsDataInterface.GetCommandString(CommandIndex)
        if command in Train.VOCABULARY:
            for n in range(len(Train.VOCABULARY)):
                if Train.VOCABULARY[n] == command:
                    commandlabel = n
            for InstanceIndex in range(ATrainingsDataInterface.GetNumberOfCommandInstances(CommandIndex)):
                x, Fs, bits = ATrainingsDataInterface.GetWaveOfCommandInstance(CommandIndex, InstanceIndex)
                assert np.abs(Constants.theConstants.getSamplingFrequencyMicrofone() - Fs) < 1e-3, 'wrong sampling rate'
                ADatasetAugmentation = DatasetAugmentation.CAudioDatasetAugmentation(x, Fs)
                NumberOfDistortions = 1#ADatasetAugmentation.GetNumberOfResults()
                if IsTraining(InstanceIndex):
                    MaxDistortionIndex = NumberOfDistortions
                else:
                    MaxDistortionIndex = 1
                for DistortionIndex in range(MaxDistortionIndex):
                    y, Timestretchfactor = ADatasetAugmentation.GenerateSingleDistortion(DistortionIndex)
                    z = GetAudioWithConstantLength(y, Fs)
                    Feature = TrainingsInterface.SamplesToFeature(z, Fs)
                    if train_images is None:
                        train_images = np.zeros((GetNumberOfTrainingsData()*NumberOfDistortions, Feature.shape[0], Feature.shape[1]))
                        test_images = np.zeros((NumberOfTestSamples*len(Train.VOCABULARY), Feature.shape[0], Feature.shape[1]))
                        validation_images = np.zeros((NumberOfValidationSamples*len(Train.VOCABULARY), Feature.shape[0], Feature.shape[1]))
                        train_labels = np.zeros((train_images.shape[0]))
                        test_labels = np.zeros((test_images.shape[0]))
                        validation_labels = np.zeros((validation_images.shape[0]))
                    if IsTraining(InstanceIndex):
                        train_images[train_counter, :, :] = Feature
                        train_labels[train_counter] = commandlabel
                        train_counter += 1   
                    elif IsTest(InstanceIndex):
                        test_images[test_counter, :, :] = Feature
                        test_labels[test_counter] = commandlabel
                        test_counter += 1   
                    else:
                        validation_images[validation_counter, :, :] = Feature
                        validation_labels[validation_counter] = commandlabel
                        validation_counter += 1                         
    return train_images, train_labels, test_images, test_labels, validation_images, validation_labels

train_images, train_labels, test_images, test_labels, validation_images, validation_labels = EvaluateAllData()
np.savez(FilenameData, x0 = train_images, x1 = train_labels, x2 = test_images, x3 = test_labels, x4 = validation_images, x5 = validation_labels)

100%|██████████████████████████████████████████████████████████████████████████████████| 47/47 [01:13<00:00,  1.56s/it]


## Load the data
In the following code block the pre-evaluated trainings-, validation- and testdata is loaded.

In [3]:
def GetAllData():    
    try:
        data = np.load(FilenameData)
        train_images = data['x0']
        train_labels = data['x1']
        test_images = data['x2']
        test_labels = data['x3']
        validation_images = data['x4']
        validation_labels = data['x5']
    except:
        train_images, train_labels, test_images, test_labels, validation_images, validation_labels = EvaluateAllData()
        np.savez(Filename, x0 = train_images, x1 = train_labels, x2 = test_images, x3 = test_labels, x4 = validation_images, x5 = validation_labels)
    return train_images, train_labels, test_images, test_labels, validation_images, validation_labels

train_images, train_labels, test_images, test_labels, validation_images, validation_labels = GetAllData()
print('number of trainings samples: ', train_images.shape[0])
print('number of validation samples: ', validation_images.shape[0])
print('number of test samples: ', test_images.shape[0])

number of trainings samples:  1565
number of validation samples:  180
number of test samples:  180


## Output and loss
In a classification task, you want to have a last layer in your neural network, which outputs values, which can be interpreted as a probability distribution. This is done by the so called softmax layer:

a) Evaluate the exponential function.

b) Normalize the output values to a sum of $1$.

$y_j = \frac{e^{x_j}}{\sum_{l=0}^{J-1} e^{x_l}}$

The softmax layer is usually combined with the cross entropy as loss function:

$L=-\sum_j o_j\cdot\log z_j$

## Model architecture
In the following block, the model architecture is defined:

In [4]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(train_images.shape[1], train_images.shape[2])),
    tf.keras.layers.Dense(units = 100, activation='LeakyReLU'),
    tf.keras.layers.Dense(units = 50, activation='LeakyReLU'),
    tf.keras.layers.Dense(len(Train.VOCABULARY), activation='softmax')
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

## Callbacks
The training should stop, when the accuracy is no longer increasing. For this, the early stopping callback can be used. The accuracy of the trainingsdata is usually strictly increasing. Therefore, the validation accuracy is a good control measure for early stopping.

As an additional callback, the training process is stored in so called checkpoints, which can be reloaded for future usage.

In [5]:
checkpoint_path = TempFolder + "/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
cbCheckpoints = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
cbEarlyStopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5)

try:
    model.load_weights(checkpoint_path)
except:
    print('problem loading old weights, starting with scratch new network')

problem loading old weights, starting with scratch new network


## Training
The training can be applied to a very large number of epochs, due to the usage of the early stopping callback. After finishing the training, the test accuracy is evaluated:

The test data is not seen by the algorithm during the training process. Therefore, the test accuracy is a good measure of fit.

In [6]:
history = model.fit(train_images, train_labels, epochs=1000,
                    validation_data=(validation_images, validation_labels),
                    callbacks=[cbEarlyStopping, cbCheckpoints], verbose = 1)
print('training finished after ', len(history.history['loss']), ' epochs')

test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

Epoch 1/1000
Epoch 1: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 2/1000
Epoch 2: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 3/1000
Epoch 3: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 4/1000
Epoch 4: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 5/1000
Epoch 5: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 6/1000
Epoch 6: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 7/1000
Epoch 7: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 8/1000
Epoch 8: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 9/1000
Epoch 9: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 10/1000
Epoch 10: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 11/1000
Epoch 11: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 12/1000
Epoch 12: saving model to NeuralNetw

## Programming exercise

The log-mel spectrogram as input feature can be interpreted as an image. Search the web for standard image classification layers and models. Implement model as a sequential set of layers for typical image classification tasks.
Compile model and fit it in order to increase the test-accuracy to the maximum possible value.

In [7]:
model = None
### solution begins

### solution ends

import unittest

class TestProgrammingExercise(unittest.TestCase):

    def test_1(self):
        test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
        print('\nTest accuracy:', test_acc)        
        self.assertGreater(test_acc, 0.8)

unittest.main(argv=[''], verbosity=2, exit=False)

Epoch 1/1000
Epoch 1: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 2/1000
Epoch 2: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 3/1000
Epoch 3: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 4/1000
Epoch 4: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 5/1000
Epoch 5: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 6/1000
Epoch 6: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 7/1000
Epoch 7: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 8/1000
Epoch 8: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 9/1000
Epoch 9: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 10/1000
Epoch 10: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 11/1000
Epoch 11: saving model to NeuralNetworks/VoiceControlByTensorflow\cp.ckpt
Epoch 12/1000
Epoch 12: saving model to NeuralNetw

test_1 (__main__.TestProgrammingExercise.test_1) ... 

training finished after  28  epochs
6/6 - 0s - loss: 0.4073 - accuracy: 0.8944 - 246ms/epoch - 41ms/step


ok

----------------------------------------------------------------------
Ran 1 test in 0.336s

OK



Test accuracy: 0.894444465637207


<unittest.main.TestProgram at 0x1fa60782710>

## Exam preparation

1) Assuming a dataset with nine different classes and the corresponding counts: label0: 136 times, label1: 180 times, label2: 180 times, label3: 180 times, label4: 180 times, label5: 213 times, label6: 180 times, label7: 136 times and label 8: 180 times. Evaluate the accuracy of the simplest possible classificator. Is the dataset balanced?

2) Evaluate the derivative of the softmax layer: $\frac{dy_j}{dx_i}$.

3) Evaluate the derivative of the loss given by the cross entropy to the input of the softmax layer: $\frac{dL}{dx_i}=\frac{dL}{dy_j}\cdot\frac{dy_j}{dx_i}$.