# Speaker Identification

## Installation & Imports

[VGGish Embedding Colab](https://colab.research.google.com/drive/1E3CaPAqCai9P9QhJ3WYPNCVmrJU4lAhF)

In [1]:
!pip3 install tensorflow==2.8.0
!pip3 install tensorflow-io==0.25.0
!pip install numpy scipy
!pip install resampy tensorflow
!pip install tf_slim
!rm -rf models
!git clone https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 90080, done.[K
remote: Counting objects: 100% (113/113), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 90080 (delta 61), reused 100 (delta 50), pack-reused 89967[K
Receiving objects: 100% (90080/90080), 606.63 MiB | 25.46 MiB/s, done.
Resolving deltas: 100% (64896/64896), done.


In [2]:
# Check to see where are in the kernel's file system.
!pwd

/content


In [3]:
# Grab the VGGish model
!curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
!curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277M  100  277M    0     0  38.3M      0  0:00:07  0:00:07 --:--:-- 43.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73020  100 73020    0     0   164k      0 --:--:-- --:--:-- --:--:--  164k


In [4]:
# Make sure we got the model data.
!ls

drive		 requirements.txt	   vggish_model.ckpt	  vggish_smoke_test.py
mel_features.py  sample_data		   vggish_params.py	  vggish_train_demo.py
models		 vggish_export_tfhub.py    vggish_pca_params.npz
__pycache__	 vggish_inference_demo.py  vggish_postprocess.py
README.md	 vggish_input.py	   vggish_slim.py


In [5]:
# Verify the location of the AudioSet source files
!ls models/research/audioset/vggish

mel_features.py   vggish_export_tfhub.py    vggish_params.py	   vggish_smoke_test.py
README.md	  vggish_inference_demo.py  vggish_postprocess.py  vggish_train_demo.py
requirements.txt  vggish_input.py	    vggish_slim.py


In [6]:
# Copy the source files to the current directory.
!cp models/research/audioset/vggish/* .

In [7]:
# Make sure the source files got copied correctly.
!ls

drive		 requirements.txt	   vggish_model.ckpt	  vggish_smoke_test.py
mel_features.py  sample_data		   vggish_params.py	  vggish_train_demo.py
models		 vggish_export_tfhub.py    vggish_pca_params.npz
__pycache__	 vggish_inference_demo.py  vggish_postprocess.py
README.md	 vggish_input.py	   vggish_slim.py


In [8]:
# Run the test, which also loads all the necessary functions.
from vggish_smoke_test import *


Testing your install of VGGish

Resampling via resampy works!
Log Mel Spectrogram example:  [[-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 ...
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]
 [-4.48313252 -4.27083405 -4.17064267 ... -4.60069383 -4.60098887
  -4.60116305]]




VGGish embedding:  [-2.72986382e-01 -1.80314153e-01  5.19921184e-02 -1.43571526e-01
 -1.04673728e-01 -4.96598154e-01 -1.75267965e-01  4.23147976e-01
 -8.22126150e-01 -2.16801405e-01 -1.17509276e-01 -6.70077026e-01
  1.43174574e-01 -1.44183934e-01  8.73491913e-03 -8.71972442e-02
 -1.84393525e-01  5.96655607e-01 -3.43809605e-01 -5.79104424e-02
 -1.65071294e-01  4.22911644e-02 -2.55293399e-01 -2.36356765e-01
  1.80295616e-01  3.02612185e-01  1.08356833e-01 -4.48398024e-01
  1.22757629e-01 -2.99955189e-01 -5.55934191e-01  5.05966544e-01
  2.05210358e-01  8.87591839e-01  9.03702497e-01 -2.10566416e-01
 -3.27462405e-02  1.38691410e-01 -2.27416530e-01  1.14804000e-01
  5.95410109e-01 -4.76971269e-01  2.28232622e-01  1.54627025e-01
  1.64934218e-01  7.19252825e-01  1.24101830e+00  5.61996222e-01
  2.73531973e-01  3.09788287e-02  2.10977703e-01 -6.09551668e-01
 -3.15282375e-01  1.76392645e-01 -8.96190405e-02 -4.26822364e-01
  3.12993884e-01 -1.56592295e-01  3.31673503e-01  1.29436389e-01
  1.66

Necessary imports

In [9]:
import sys
import os
import random
from datetime import datetime
from pathlib import Path
from typing import Tuple

import librosa
import numpy as np
import keras
import matplotlib.pyplot as plt
from matplotlib import cm

import tensorflow as tf
import tensorflow_io as tfio
print("TensorFlow version:", tf.__version__)
print("TensorFlow IO version:", tfio.__version__)
print(tf.executing_eagerly())
%load_ext tensorboard

TensorFlow version: 2.8.0
TensorFlow IO version: 0.25.0
True


In [10]:
from google.colab import drive
drive.mount('/content/drive')
ROOT_DIR='/content/drive/MyDrive/College/Research/Linh_2023_Research'

DATASET_PATH=ROOT_DIR+'/test_data/vox'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Dataset preparation
[Keras Speaker Recognition](https://keras.io/examples/audio/speaker_recognition_using_cnn/)

In [11]:
SAMPLING_RATE = 16000
BATCH_SIZE = 128
SHUFFLE_SEED = 43
TRAIN_VALID_SPLIT = 0.2
EPOCHS = 100

def paths_and_labels_to_dataset(audio_paths, labels):
    """Constructs a dataset of audios and labels."""
    path_ds = tf.data.Dataset.from_tensor_slices(audio_paths)
    audio_ds = path_ds.map(lambda x: path_to_audio(x))
    label_ds = tf.data.Dataset.from_tensor_slices(labels)
    return tf.data.Dataset.zip((audio_ds, label_ds))


def path_to_audio(path):
    """Reads and decodes an audio file."""
    audio = tf.io.read_file(path)
    audio, _ = tf.audio.decode_wav(audio, 1, SAMPLING_RATE)
    return audio

### FFT & MFCC Pre-processing


In [12]:
N_MFCC = 13

def audio_to_fft(audio):
    # Since tf.signal.fft applies FFT on the innermost dimension,
    # we need to squeeze the dimensions and then expand them again
    # after FFT
    audio = tf.squeeze(audio, axis=-1)
    fft = tf.signal.fft(
        tf.cast(tf.complex(real=audio, imag=tf.zeros_like(audio)), tf.complex64)
    )
    fft = tf.expand_dims(fft, axis=-1)

    # Return the absolute value of the first half of the FFT
    # which represents the positive frequencies
    return tf.math.abs(fft[:, : (audio.shape[1] // 2), :])

def audio_to_mfcc(audio):
    audio = tf.squeeze(audio, axis=-1)
    # Convert the audio to MFCC
    stfts = tf.signal.stft(audio, frame_length=1024, frame_step=256, fft_length=1024)
    spectrograms = tf.abs(stfts)

    # Warp the linear scale spectrograms into the mel-scale
    num_spectrogram_bins = stfts.shape[-1]
    lower_edge_hertz, upper_edge_hertz = 80.0, 7600.0
    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        N_MFCC, num_spectrogram_bins, SAMPLING_RATE, lower_edge_hertz, upper_edge_hertz)
    mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

    # Compute a stabilized log to get log-magnitude mel-scale spectrograms
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)

    # Compute MFCCs from log_mel_spectrograms and take the first N_MFCC
    mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :N_MFCC]
    # print(mfccs)

    return mfccs

### VGGish Embeddings

In [31]:
import vggish_slim
import vggish_params
import vggish_input


def load_vggish_slim_checkpoint(checkpoint_path):
    """Loads a pre-trained VGGish-compatible checkpoint.

    This function can be used as an initialization function (referred to as
    init_fn in TensorFlow documentation) which is called after
    initializing all variables. When used as an init_fn, this will load
    a pre-trained checkpoint that is compatible with the VGGish model
    definition. Only variables defined by VGGish will be loaded.

    Args:
        checkpoint_path: path to a file containing a checkpoint that is
          compatible with the VGGish model definition.
    """
    # Get the list of names of all VGGish variables that exist in
    # the checkpoint (i.e., all inference-mode VGGish variables).
    vggish_slim.define_vggish_slim(training=False)
    vggish_var_names = [v.name for v in tf.compat.v1.global_variables()]

    # Get the list of all currently existing variables that match
    # the list of variable names we just computed.
    vggish_vars = [v for v in tf.compat.v1.global_variables() if v.name in vggish_var_names]

    # Use a Saver to restore just the variables selected above.
    saver = tf.train.Saver(vggish_vars, name='vggish_load_pretrained', save_relative_paths=True)
    saver.restore(checkpoint_path)


@tf.function
def create_vggish_network(hop_size=0.96):   # Hop size is in seconds.
    """Define VGGish model, load the checkpoint, and return a dictionary that points
    to the different tensors defined by the model.
    """
    assert not tf.executing_eagerly()
    vggish_slim.define_vggish_slim()
    checkpoint_path = 'vggish_model.ckpt'
    vggish_params.EXAMPLE_HOP_SECONDS = hop_size

    g = tf.Graph()
    with g.as_default():
        load_vggish_slim_checkpoint(checkpoint_path)

        features_tensor = g.get_tensor_by_name(vggish_params.INPUT_TENSOR_NAME)
        embedding_tensor = g.get_tensor_by_name(vggish_params.OUTPUT_TENSOR_NAME)

        layers = {'conv1': 'vggish/conv1/Relu:0',
                'pool1': 'vggish/pool1/MaxPool:0',
                'conv2': 'vggish/conv2/Relu:0',
                'pool2': 'vggish/pool2/MaxPool:0',
                'conv3': 'vggish/conv3/conv3_2/Relu:0',
                'pool3': 'vggish/pool3/MaxPool:0',
                'conv4': 'vggish/conv4/conv4_2/Relu:0',
                'pool4': 'vggish/pool4/MaxPool:0',
                'fc1': 'vggish/fc1/fc1_2/Relu:0',
                'embedding': 'vggish/embedding:0',
                'features': 'vggish/input_features:0',
                }

        for k in layers:
            layers[k] = g.get_tensor_by_name(layers[k])

        return {'features': features_tensor,
                'embedding': embedding_tensor,
                'layers': layers,
                'graph': g
            }


@tf.function
def embeddings_from_vggish(vgg, x, sr=SAMPLING_RATE):
    """Run the VGGish model, starting with a sound (x) at sample rate
    (sr). Return a dictionary of embeddings from the different layers
    of the model."""
    assert not tf.executing_eagerly()
    input_batch = vggish_input.waveform_to_examples(x, sr)

    layer_names = vgg['layers'].keys()
    tensors = [vgg['layers'][k] for k in layer_names]

    with vgg['graph'].as_default():
        with tf.compat.v1.Session() as session:
            session.run(tf.compat.v1.global_variables_initializer())
            input_tensor = vgg['features']
            feed_dict = {input_tensor: input_batch}
            results = session.run(tensors, feed_dict=feed_dict)

    resdict = {}
    for i, k in enumerate(layer_names):
        resdict[k] = results[i]

    return resdict

In [18]:
audio_file_path = DATASET_PATH + '/id10004/BOAd7pybyZw00003.wav'
x = path_to_audio(audio_file_path)
print(type(x))
print(tf.executing_eagerly())

<class 'tensorflow.python.framework.ops.EagerTensor'>
True


In [None]:
@tf.function
def audio_to_embeddings(audio):
    assert not tf.executing_eagerly()
    print('inner', tf.executing_eagerly())
    vgg = create_vggish_network(0.01)
    # resdict = embeddings_from_vggish(vgg, audio)

    # print(resdict['embedding'].shape)
    # print(type(resdict['embedding']))
    # return resdict['embedding']
print('outter', tf.executing_eagerly())
audio_to_embeddings(x.numpy())
print(tf.executing_eagerly())

### Process data

In [None]:
N_CLASS = 5
class_names = os.listdir(DATASET_PATH)
random.shuffle(class_names)
audio_paths = []
labels = []
for label, name in enumerate(class_names):
    if label > N_CLASS - 1: break
    print("Processing speaker {}".format(name,))
    dir_path = Path(DATASET_PATH) / name
    speaker_sample_paths = [
        os.path.join(dir_path, filepath)
        for filepath in os.listdir(dir_path)
        if filepath.endswith(".wav")
    ]
    audio_paths += speaker_sample_paths
    labels += [label] * len(speaker_sample_paths)

print(
    "Found {} files belonging to {} classes.".format(len(audio_paths), len(class_names))
)

# Shuffle
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(audio_paths)
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(labels)

# Split into training and validation
num_val_samples = int(TRAIN_VALID_SPLIT * len(audio_paths))
print("Using {} files for training.".format(len(audio_paths) - num_val_samples))
train_audio_paths = audio_paths[:-num_val_samples]
train_labels = labels[:-num_val_samples]

print("Using {} files for validation.".format(num_val_samples))
valid_audio_paths = audio_paths[-num_val_samples:]
valid_labels = labels[-num_val_samples:]

# Create 2 datasets, one for training and the other for validation
train_ds = paths_and_labels_to_dataset(train_audio_paths, train_labels)
train_ds = train_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
    BATCH_SIZE
)

valid_ds = paths_and_labels_to_dataset(valid_audio_paths, valid_labels)
valid_ds = valid_ds.shuffle(buffer_size=32 * 8, seed=SHUFFLE_SEED).batch(32)

# Transform audio wave to the frequency domain using `audio_to_mfcc`
train_ds = train_ds.map(
    lambda x, y: (audio_to_mfcc(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)

valid_ds = valid_ds.map(
    lambda x, y: (audio_to_mfcc(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)

In [None]:
# for i, data in enumerate(train_ds):
#     if i > 2: break
#     x, y = data
#     # Extract the chosen sample
#     selected_sample_np = x[i].numpy()

#     # Display the MFCC for the selected sample
#     plt.imshow(selected_sample_np, cmap='viridis', origin='lower', aspect='auto')
#     plt.title(f'MFCC for Sample {i}')
#     plt.xlabel('MFCC Coefficient')
#     plt.ylabel('Time Step')
#     plt.colorbar(label='Magnitude')
#     plt.show()


## Model

MFCC

FFT (focus on low freq) ---> CNN (max pool in one direction)

Is speaker unique in consonant or vowel?

To try: Cifar, Transfer learning

### Simple ResNet

In [None]:
def residual_block(x, filters, conv_num=3, activation="relu"):
    # Shortcut
    s = keras.layers.Conv1D(filters, 1, padding="same")(x)
    for i in range(conv_num - 1):
        x = keras.layers.Conv1D(filters, 3, padding="same")(x)
        x = keras.layers.Activation(activation)(x)
    x = keras.layers.Conv1D(filters, 3, padding="same")(x)
    x = keras.layers.Add()([x, s])
    x = keras.layers.Activation(activation)(x)
    return keras.layers.MaxPool1D(pool_size=2, strides=2)(x)


def simple_resnet(input_shape, num_classes):
    inputs = keras.layers.Input(shape=input_shape, name="input")

    x = residual_block(inputs, 16, 2)
    x = residual_block(x, 32, 2)
    x = residual_block(x, 64, 3)
    x = residual_block(x, 128, 3)
    x = residual_block(x, 128, 3)

    x = keras.layers.Flatten()(x)
    # x = keras.layers.Dense(256, activation="relu")(x)
    x = keras.layers.Dense(128, activation="relu")(x)

    outputs = keras.layers.Dense(num_classes, activation="softmax", name="output")(x)

    return keras.models.Model(inputs=inputs, outputs=outputs)

In [None]:


# model = simple_resnet((SAMPLING_RATE//2, 1), N_CLASS)
model = simple_resnet((59, N_MFCC), N_CLASS)

model.summary()

# Compile the model using Adam's default learning rate
model.compile(
    optimizer="Adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Add callbacks:
# 'EarlyStopping' to stop training when the model is not enhancing anymore
# 'ModelCheckPoint' to always keep the model that has the best val_accuracy
model_save_filename = "model.h5"

earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
mdlcheckpoint_cb = keras.callbacks.ModelCheckpoint(
    model_save_filename, monitor="val_accuracy", save_best_only=True
)

## Training

`fit()` is for training the model with the given inputs (and corresponding training labels).

`evaluate()` is for evaluating the already trained model using the validation (or test) data and the corresponding labels. Returns the loss value and metrics values for the model.

`predict()` is for the actual prediction. It generates output predictions for the input samples.

In [None]:
# Define the Keras TensorBoard callback
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

# Train the model
history = model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=valid_ds,
    callbacks=[earlystopping_cb, mdlcheckpoint_cb, tensorboard_callback],
)

In [None]:
%tensorboard --logdir logs

## Evaluate

In [None]:
print(model.evaluate(valid_ds))

In [None]:
print(model.predict(valid_ds))