# Speaker Identification: Text Independent Context

The human speech signal conveys many levels of information.  At the base level it carries a message in words.  But at other levels, it conveys information about language, dialect, emotion, gender and identity of the speaker.  While the speech recognition systems aim to identify the words spoken in the speech, the goal of the speaker recognition system is to extract the identity of the speaker associated with the speech signal.

The broad area of speaker recognition emcompasses two more fundamental tasks. Speaker verification (also known as speaker authentication) is a task of determining whether a person is who she claims to be.  Speaker identification is a task of determining who is speaking from a known set of speakers.  The unknown speaker makes no identity claim so the system must perform a 1:N classification.

These tasks can be further divided into text dependent and text independent categories. In a text dependent system the recognition system has prior knowledge of the text being spoken to.  In a text independent the recognition system is agnostic to the associated text.

Our focus is the problem of speaker identification in the text independent context.  Further, we will concentrate this study on short speeches (usually 2-5 seconds) from a large number of speakers.


## Dataset and Data Exploration

Our choice of audio dataset is open source VoxForge dataset.  It is freely available under GNU General Public License.  VoxForge was set up to collect transcribed speech for use in Open Source Speech Recognition Engines ("SRE"s).  The dataset contains 1216 unique speaker’s multiple audio files in wav format.  Each speech is of short duration (2-10 seconds).  

The voxforge dataset contains few samples where the speakers are not known and hence grouped under anonymous category.  We decided to exclude these samples from our project since they just impede the learning.  During the pre-processing stage, the wav files are converted to Mel-frequency cepstral coefficients (MFCCs) matrix of shape 20x196x1.  MFCCs can approximate the human auditory system response more closely than the linearly-spaced frequency bands used in the normal cepstrum. We experimented with Filter Bank energies as alternate but our findings indicate that for the speaker recognition task, the MFCC provides better accuracy.

In [39]:
import os
from pathlib import Path

%matplotlib inline
import numpy as np
from sklearn import preprocessing
import tensorflow as tf
from tensorflow import keras
import matplotlib
import matplotlib.pyplot as plt
import librosa
import tarfile
import librosa.display
import os
from glob import glob

In [40]:
# voxforge_root points to the root folder of the voxforge dataset.
voxforge_root = Path('dataset')

In [49]:
def is_valid(file_path):
    """
    returns True if a regular files. False for hidden files.
    Also, True is a known user with a name, False if anon.
    """
    file_name = tf.strings.split(file_path, '\\')[1]
    if tf.strings.substr(file_name, 0, 1) == tf.constant(b'.'):
        return False
    sc = tf.strings.split(file_path, '\\')[-3]
    speaker = tf.strings.split(sc, '-')[0]
    return not tf.strings.substr(speaker, 0, 9) == tf.constant(b'anonymous')

In [53]:
files = glob('**/*.wav', recursive=True)
tensors = [tf.constant(file) for file in files]
list_ds = tf.data.Dataset.from_tensors(tensors)
list_ds = list_ds.filter(is_valid)
for f in list_ds.take(3):
  print(f.numpy())

InvalidArgumentError: Input to reshape is a tensor with 5 values, but the requested shape has 1
	 [[{{node Reshape}}]]

In [None]:
def extract_speaker(file_path):
    ''' extract speaker name from the file path '''
    sc = tf.strings.split(file_path, '/')[-3]
    return tf.strings.split(sc, '-')[0]

In [None]:
# each folder under root contains audio files for a speaker.
# the folder name is the name of the speaker plus date and three digit code separated by dash.
# let's print few sample speaker names.
speaker_ds = list_ds.map(extract_speaker)
for speaker in speaker_ds.take(3):
    print(speaker)

In [None]:
# create one-hot vector dataset from speakers
speaker_encoder = preprocessing.LabelEncoder()
speaker_idx = speaker_encoder.fit_transform([bytes.decode(s.numpy()) for s in speaker_ds])
encoded_speaker_ds = tf.data.Dataset.from_tensor_slices(speaker_idx)
unique_speakers = len(speaker_encoder.classes_)
for es in encoded_speaker_ds.take(3):
    print(es)

### Let's listen to a clip from the dataset.

In [None]:
sample_audio = os.path.join(voxforge_root, 'chocoholic-20070523/wav/rom0001.wav')
import IPython.display as ipd
ipd.Audio(sample_audio)

### Plot the audio array

In [None]:
x, sr = librosa.load(sample_audio)
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

### Display a spectrogram 

In [None]:
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')

In [None]:
def wav2mfcc(file_path, max_pad_len=196):
    """ convert wav file to mfcc matrix with truncation and padding """
    wave, sample_rate = librosa.load(file_path, mono=True, sr=None)
    mfcc = librosa.feature.mfcc(wave, sample_rate)
    mfcc = mfcc[:, :max_pad_len]
    pad_width = max_pad_len - mfcc.shape[1]
    mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')
    return mfcc

In [None]:
def extract_mfcc(file_path):
    """ returns 3D tensor of the mfcc coding from the wav file """
    file_name = bytes.decode(file_path.numpy())
    mfcc = tf.convert_to_tensor(wav2mfcc(file_name))
    mfcc = tf.expand_dims(mfcc, 2)
    return mfcc

In [None]:
def create_audio_ds(list_ds):
    """ creates audio dataset containing audio tensors from file list dataset """
    batch = []
    for f in list_ds:
        audio = extract_mfcc(f)
        batch.append(audio)
    return tf.data.Dataset.from_tensor_slices(batch)

In [None]:
%time audio_ds = create_audio_ds(list_ds)

In [None]:
# Audio (input) tensor is 3D tensor.
# 20x196 is MFCC encoding. Converting it to 3D for use in CNN layers.
for a in audio_ds.take(1):
    print(a.numpy().shape)

In [None]:
# Finally, zip the input and labels to a single dataset.
complete_labeled_ds = tf.data.Dataset.zip((audio_ds, encoded_speaker_ds))

In [None]:
input_shape = None
for audio, speaker in complete_labeled_ds.take(1):
    input_shape = audio.shape
    print('input_shape', audio.shape)
    print('output_shape', speaker.shape)

In [None]:
# for testing we just few samples.
#labeled_ds = complete_labeled_ds.take(3000)
# for complete run with all samples.
labeled_ds = complete_labeled_ds

In [None]:
# create train, validation and test datasets.
data_size = sum([1 for _ in labeled_ds])
train_size = int(data_size * 0.9)
val_size = int(data_size * 0.05)
test_size = data_size - train_size - val_size
print('all samples: {}'.format(data_size))
print('training samples: {}'.format(train_size))
print('validation samples: {}'.format(val_size))
print('test samples: {}'.format(test_size))

In [None]:
# create batched datasets
batch_size = 32
labeled_ds = labeled_ds.shuffle(data_size, seed=42)
train_ds = labeled_ds.take(train_size).shuffle(1000).batch(batch_size).prefetch(1)
val_ds = labeled_ds.skip(train_size).take(val_size).batch(batch_size).prefetch(1)
test_ds = labeled_ds.skip(train_size + val_size).take(test_size).batch(batch_size).prefetch(1)

## Model and training

In [None]:
def create_model():
    dropout_rate = .25
    regularazation = 0.001
    audio_input = keras.layers.Input(shape=input_shape)
    conv1 = keras.layers.Conv2D(16, kernel_size=(3, 3), padding='same',
                               activation='relu', input_shape=input_shape)(audio_input)
    maxpool1 = keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)(conv1)
    batch1 = keras.layers.BatchNormalization()(maxpool1)
    conv2 = keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same',
                               activation='relu', input_shape=input_shape)(batch1)
    maxpool2 = keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)(conv2)
    batch2 = keras.layers.BatchNormalization()(maxpool2)
    conv3 = keras.layers.Conv2D(64, kernel_size=(3, 3), padding='same', 
                activation='relu')(batch2)
    maxpool3 = keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)(conv3)
    batch3 = keras.layers.BatchNormalization()(maxpool3)
    flt = keras.layers.Flatten()(batch3)
    drp1 = keras.layers.Dropout(dropout_rate)(flt)
    dense1 = keras.layers.Dense(unique_speakers * 2, activation='relu',
                kernel_regularizer=keras.regularizers.l2(regularazation))(drp1)
    drp2 = keras.layers.Dropout(dropout_rate)(dense1)
    output = keras.layers.Dense(unique_speakers, activation='softmax', name='speaker')(drp2)
    model = keras.Model(inputs=audio_input, outputs=output)
    model.compile(loss=keras.losses.sparse_categorical_crossentropy,
                  optimizer=keras.optimizers.Adam(),
                  metrics=['acc'])
    return model

In [None]:
# if previously trained model is on the disk, use it without training.
# the model has millions of parameters and training with 32 epoches
# takes 12hrs+ on my mac. 
train_model = False
model_name = 'spr_model.h5'
model_path = os.path.join('.', model_name)
model = None
if os.path.exists(model_path):
    model = keras.models.load_model(model_path)
else:
    model = create_model()
    train_model = True

In [None]:
model.summary()

In [None]:
# if training, you can view the tensorboard.
# type on command line to start tensorboard: tensorboard --logdir=./spr_logs --port=6006
# view details: http://localhost:6006
if train_model:
    root_logdir = os.path.join(os.curdir, "spr_logs")
    def get_run_dir():
        import time
        run_id = time.strftime("run%Y_%m_%d-%H_%M_%S")
        return os.path.join(root_logdir, run_id)
    run_logdir = get_run_dir()
    tensorboard_cb = keras.callbacks.TensorBoard(run_logdir, update_freq='batch')
    history = model.fit(train_ds, epochs=8, validation_data=val_ds, callbacks=[tensorboard_cb])

In [None]:
model.evaluate(test_ds)

In [None]:
if train_model:
    model.save(model_name)

Let's test few files ourselves.

In [None]:
sample_file = [os.path.join(voxforge_root,'mk-20120531-ctv/wav/a0369.wav'),
               os.path.join(voxforge_root,'rocketman768-20080408-axr/wav/b0220.wav')]
sample_ds = tf.data.Dataset.from_tensor_slices(sample_file)
sample_input = create_audio_ds(sample_ds).batch(2)
output = model.predict(sample_input)

In [None]:
speaker_ids = output.argmax(axis=1)
speakers = speaker_encoder.inverse_transform(speaker_ids)
print(speakers)
print(output)

Thank you for your time.