# CNN Speaker Recognition using Keras/TensorFlow
## Advanced ML Final Part 1
### By: Daniel Hill

#### Business Use Case:
This project is part of a MS in Business Analytics program. The intention of this assignment is to be applied to a business use case. For the business use case for Part 1 of my final, I will be setting up the following fake scenario:

A business has begun implementing a new system for recording meetings in light of the advances in AI. They are using a transcription tool after getting the recordings, and then feeding this through an LLM to give them a summary of the meeting. The next step of their AI meeting recording is that they would like to know who is speaking at what time. The company has recorded each of their employees saying 20 phrases. The first 10 phrases are all the same, and the second 10 are all different. The company would like to see if they can build a model on the 10 phrases that are the same to successfully identify the speaker in the second 10 phrases. 

#### This Notebook:

I will be reimplementing the code here: [Keras Speaker Recognition Example](https://keras.io/examples/audio/speaker_recognition_using_cnn/).

I'm using a different dataset. The dataset I'm using can be obtained here: [Voice-Based Human Identity Recognition Dataset](https://data.mendeley.com/datasets/zw4p4p7sdh/1).

Uncomment the following cell to install required packages:

In [1]:
# Uncomment to install the required packages:
# !pip install -r requirements.txt

In [5]:
# Uncomment to download the dataset
# !curl -O https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/zw4p4p7sdh-1.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  959M  100  959M    0     0   359k      0  0:45:33  0:45:33 --:--:--  434k0:00:05  1:37:35  185k0     0   324k      0  0:50:26  0:00:20  0:50:06  381k02k      0  0:54:13  0:00:28  0:53:45  190k    0   298k      0  0:54:49  0:00:33  0:54:16  279k   2 21.6M    0     0   338k      0  0:48:25  0:01:05  0:47:20  333k6M    0     0   342k      0  0:47:46  0:01:16  0:46:30  376k     0   349k      0  0:46:51  0:01:23  0:45:28  434k 0     0   352k      0  0:46:28  0:01:35  0:44:53  396k 355k      0  0:46:07  0:01:54  0:44:13  325k6  0:01:56  0:44:10  303k357k      0  0:45:50  0:02:53  0:42:57  418k  357k      0  0:45:52  0:03:03  0:42:49  369k  0   356k      0  0:45:56  0:03:06  0:42:50  343k8k      0  0:45:40  0:03:28  0:42:12  380k8k      0  0:45:43  0:03:31  0:42:12  347kM    0     0   359k      0  0:45:36  0:03:36  0:42:00  404k0   

In [9]:
# Uncomment to rename the zipped file
# !mv zw4p4p7sdh-1.zip AudioData.zip

In [10]:
# Uncomment to unzip the dataset after you download
# !unzip -qq AudioData.zip

In [71]:
# imports
import os
import shutil
import io
import subprocess

import numpy as np
import setuptools as _setuptools
import tensorflow as tf
import keras

from pathlib import Path
from IPython.display import display, Audio
import soundfile as sf

In the voice recognition dataset, they have a folder storing noise. Additionally, they have one repository of speaker data and they do a train-test split within that. I will be modifying this approach slightly.

## Setup

In [65]:
# Define global variables

DATA_FOLDER = "DataFolder"

# This was set to 16000 because there are 16000 samples in the
# example's dataset. This should be changed to the number of samples?
SAMPLING_RATE = 16000

# Seed to use when shuffling the dataset
SHUFFLE_SEED = 43

# Set batch size for
BATCH_SIZE = 128

# Set epochs for model training
EPOCHS = 1

## Dataset Generation

In [60]:
def convert_flac_to_wav(source_folder, target_folder):
    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    for subdir, dirs, files in os.walk(source_folder):
        for file in files:
            if file.endswith('.flac'):
                flac_path = os.path.join(subdir, file)
                wav_path = os.path.join(target_folder, os.path.splitext(file)[0] + '.wav')
                
                # Ensure target subdirectory exists
                target_subdir = os.path.dirname(wav_path)
                if not os.path.exists(target_subdir):
                    os.makedirs(target_subdir)

                # Convert flac to wav
                subprocess.run(['ffmpeg', '-i', flac_path, wav_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Example usage
source_folder = 'A Dataset for Voice-Based Human Identity Recognition'  # Path to the source folder
target_folder = 'DataFolder'  # Path to the target folder where .wav files will be saved
convert_flac_to_wav(source_folder, target_folder)

In [66]:
def path_to_audio(path):
    """Reads and decodes an audio file."""
    audio = tf.io.read_file(path)
    audio, _ = tf.audio.decode_wav(audio, 1, SAMPLING_RATE)
    return audio

In [67]:
def paths_and_labels_to_dataset(audio_paths, labels):
    """Constructs a dataset of audios and labels."""
    path_ds = tf.data.Dataset.from_tensor_slices(audio_paths)
    audio_ds = path_ds.map(
        lambda x: path_to_audio(x), num_parallel_calls=tf.data.AUTOTUNE
    )
    label_ds = tf.data.Dataset.from_tensor_slices(labels)
    return tf.data.Dataset.zip((audio_ds, label_ds))

In [68]:
def audio_to_fft(audio):
    # Since tf.signal.fft applies FFT on the innermost dimension,
    # we need to squeeze the dimensions and then expand them again
    # after FFT
    audio = tf.squeeze(audio, axis=-1)
    fft = tf.signal.fft(
        tf.cast(tf.complex(real=audio, imag=tf.zeros_like(audio)), tf.complex64)
    )
    fft = tf.expand_dims(fft, axis=-1)

    # Return the absolute value of the first half of the FFT
    # which represents the positive frequencies
    return tf.math.abs(fft[:, : (audio.shape[1] // 2), :])

In [69]:
class_names = os.listdir(DATA_FOLDER)
print(
    "Train class names: {}".format(
        class_names,
    )
)

Train class names: ['105-4.wav', '82-8.wav', '78-11.wav', '140-4.wav', '23-15.wav', '139-11.wav', '123-7.wav', '91-2.wav', '71-7.wav', '24-20.wav', '137-14.wav', '34-7.wav', '12-4.wav', '57-4.wav', '76-14.wav', '118-1.wav', '144-11.wav', '87-12.wav', '143-18.wav', '59-20.wav', '116-19.wav', '29-2.wav', '118-20.wav', '50-10.wav', '57-19.wav', '129-14.wav', '142-6.wav', '107-6.wav', '34-19.wav', '68-14.wav', '121-5.wav', '33-10.wav', '61-18.wav', '36-5.wav', '73-5.wav', '66-11.wav', '120-18.wav', '55-6.wav', '127-11.wav', '10-6.wav', '108-19.wav', '97-17.wav', '47-20.wav', '49-19.wav', '15-14.wav', '109-9.wav', '106-20.wav', '40-15.wav', '101-15.wav', '48-3.wav', '99-12.wav', '95-6.wav', '127-3.wav', '31-15.wav', '134-9.wav', '66-9.wav', '38-19.wav', '64-14.wav', '23-9.wav', '125-14.wav', '30-3.wav', '36-20.wav', '75-3.wav', '88-3.wav', '151-18.wav', '95-12.wav', '10-18.wav', '17-11.wav', '45-19.wav', '19-14.wav', '42-10.wav', '104-19.wav', '68-6.wav', '103-10.wav', '21-10.wav', '125-1.w

In [70]:
# Get all labels and process the audio

audio_paths = []
labels = []
for label, name in enumerate(class_names):
    print(
        "Processing speaker {}".format(
            name,
        )
    )
    dir_path = Path(DATA_FOLDER) / name
    speaker_sample_paths = [
        os.path.join(dir_path, filepath)
        for filepath in os.listdir(dir_path)
        if filepath.endswith(".wav")
    ]
    audio_paths += speaker_sample_paths
    labels += [label] * len(speaker_sample_paths)

print(
    "Found {} files belonging to {} classes.".format(len(audio_paths), len(class_names))
)


Processing speaker 105-4.wav


NotADirectoryError: [Errno 20] Not a directory: 'DataFolder/105-4.wav'

In [50]:
# Get all testing labels and process the audio

test_audio_paths = []
test_labels = []
for label, name in enumerate(test_class_names):
    print(
        "Processing speaker {}".format(
            name,
        )
    )
    dir_path = Path(DATASET_TEST_PATH) / name
    speaker_sample_paths = [
        os.path.join(dir_path, filepath)
        for filepath in os.listdir(dir_path)
        if filepath.endswith(".flac")
    ]
    test_audio_paths += speaker_sample_paths
    test_labels += [label] * len(speaker_sample_paths)

print(
    "Found {} files belonging to {} classes.".format(len(test_audio_paths), len(test_class_names))
)

Processing speaker 135
Processing speaker 61
Processing speaker 95
Processing speaker 132
Processing speaker 59
Processing speaker 92
Processing speaker 66
Processing speaker 104
Processing speaker 50
Processing speaker 68
Processing speaker 103
Processing speaker 57
Processing speaker 150
Processing speaker 32
Processing speaker 35
Processing speaker 102
Processing speaker 69
Processing speaker 56
Processing speaker 105
Processing speaker 51
Processing speaker 58
Processing speaker 133
Processing speaker 67
Processing speaker 93
Processing speaker 134
Processing speaker 94
Processing speaker 60
Processing speaker 34
Processing speaker 33
Processing speaker 20
Processing speaker 18
Processing speaker 27
Processing speaker 9
Processing speaker 145
Processing speaker 11
Processing speaker 142
Processing speaker 7
Processing speaker 29
Processing speaker 16
Processing speaker 129
Processing speaker 42
Processing speaker 89
Processing speaker 116
Processing speaker 45
Processing speaker 11

In [51]:
# Shuffle training set
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(train_audio_paths)

rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(train_labels)


# Shuffle testing set
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(test_audio_paths)

rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(test_labels)

In [58]:
# Create 2 datasets, one for training and the other for validation
train_ds = paths_and_labels_to_dataset(train_audio_paths, train_labels)
train_ds = train_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
    BATCH_SIZE
)

valid_ds = paths_and_labels_to_dataset(test_audio_paths, test_labels)
valid_ds = valid_ds.shuffle(buffer_size=32 * 8, seed=SHUFFLE_SEED).batch(32)

AttributeError: in user code:

    File "/var/folders/wt/222lk3fx0yl_nndvq_nx48gw0000gn/T/ipykernel_93072/1601657103.py", line 5, in None  *
        lambda x: path_to_audio(x)
    File "/var/folders/wt/222lk3fx0yl_nndvq_nx48gw0000gn/T/ipykernel_93072/2988448067.py", line 4, in path_to_audio  *
        audio_content = audio_binary.numpy()  # Convert to numpy array

    AttributeError: 'SymbolicTensor' object has no attribute 'numpy'


In [40]:
# Transform audio wave to the frequency domain using `audio_to_fft`
train_ds = train_ds.map(
    lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)

valid_ds = valid_ds.map(
    lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)

## Model Definition

In [41]:
def residual_block(x, filters, conv_num=3, activation="relu"):
    # Shortcut
    s = keras.layers.Conv1D(filters, 1, padding="same")(x)
    for i in range(conv_num - 1):
        x = keras.layers.Conv1D(filters, 3, padding="same")(x)
        x = keras.layers.Activation(activation)(x)
    x = keras.layers.Conv1D(filters, 3, padding="same")(x)
    x = keras.layers.Add()([x, s])
    x = keras.layers.Activation(activation)(x)
    return keras.layers.MaxPool1D(pool_size=2, strides=2)(x)


def build_model(input_shape, num_classes):
    inputs = keras.layers.Input(shape=input_shape, name="input")

    x = residual_block(inputs, 16, 2)
    x = residual_block(x, 32, 2)
    x = residual_block(x, 64, 3)
    x = residual_block(x, 128, 3)
    x = residual_block(x, 128, 3)

    x = keras.layers.AveragePooling1D(pool_size=3, strides=3)(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(256, activation="relu")(x)
    x = keras.layers.Dense(128, activation="relu")(x)

    outputs = keras.layers.Dense(num_classes, activation="softmax", name="output")(x)

    return keras.models.Model(inputs=inputs, outputs=outputs)

In [42]:
# If this breaks, maybe check the train_class_names unique values and compare to test_class_names values.
model = build_model((SAMPLING_RATE // 2, 1), len(train_class_names))

model.summary()

# Compile the model using Adam's default learning rate
model.compile(
    optimizer="Adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Add callbacks:
# 'EarlyStopping' to stop training when the model is not enhancing anymore
# 'ModelCheckPoint' to always keep the model that has the best val_accuracy
model_save_filename = "model.keras"

earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
mdlcheckpoint_cb = keras.callbacks.ModelCheckpoint(
    model_save_filename, monitor="val_accuracy", save_best_only=True
)

## Training

In [43]:
history = model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=valid_ds,
    callbacks=[earlystopping_cb, mdlcheckpoint_cb],
)

2024-06-03 11:08:05.519767: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at decode_wav_op.cc:55 : INVALID_ARGUMENT: Header mismatch: Expected RIFF but found fLaC
2024-06-03 11:08:05.519787: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at decode_wav_op.cc:55 : INVALID_ARGUMENT: Header mismatch: Expected RIFF but found fLaC
2024-06-03 11:08:05.519796: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at decode_wav_op.cc:55 : INVALID_ARGUMENT: Header mismatch: Expected RIFF but found fLaC
2024-06-03 11:08:05.519804: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at decode_wav_op.cc:55 : INVALID_ARGUMENT: Header mismatch: Expected RIFF but found fLaC
2024-06-03 11:08:05.519807: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at decode_wav_op.cc:55 : INVALID_ARGUMENT: Header mismatch: Expected RIFF but found fLaC
2024-06-03 11:08:05.519818: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRE

InvalidArgumentError: Graph execution error:

Detected at node DecodeWav defined at (most recent call last):
<stack traces unavailable>
Header mismatch: Expected RIFF but found fLaC
	 [[{{node DecodeWav}}]]
	 [[IteratorGetNext]] [Op:__inference_one_step_on_iterator_7649]

In [None]:
print(model.evaluate(valid_ds))

## Demonstration

In [None]:
SAMPLES_TO_DISPLAY = 10

test_ds = paths_and_labels_to_dataset(test_audio_paths, test_labels)
test_ds = test_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
    BATCH_SIZE
)

for audios, labels in test_ds.take(1):
    # Get the signal FFT
    ffts = audio_to_fft(audios)
    # Predict
    y_pred = model.predict(ffts)
    # Take random samples
    rnd = np.random.randint(0, BATCH_SIZE, SAMPLES_TO_DISPLAY)
    audios = audios.numpy()[rnd, :, :]
    labels = labels.numpy()[rnd]
    y_pred = np.argmax(y_pred, axis=-1)[rnd]

    for index in range(SAMPLES_TO_DISPLAY):
        # For every sample, print the true and predicted label
        # as well as run the voice with the noise
        print(
            "Speaker:\33{} {}\33[0m\tPredicted:\33{} {}\33[0m".format(
                "[92m" if labels[index] == y_pred[index] else "[91m",
                train_class_names[labels[index]],
                "[92m" if labels[index] == y_pred[index] else "[91m",
                train_class_names[y_pred[index]],
            )
        )
        display(Audio(audios[index, :, :].squeeze(), rate=SAMPLING_RATE))