This repository contains source code to train and use speech embedding models in order to identify if two audio samples come from the same person or different users.
You may also want to check the related Data Science post.
Environment: I recommend setting up a separate environment to run this code. Install requirements.txt
with conda and pip_requirements.txt
with pip (basically, tensorflow dependencies).
Data: This repo uses Mozilla Common Voice dataset. I suggest downloading the English corpus (cv-corpus-7.0-2021-07-21-en.tar.gz) into the data
folder.
Import TensorFlow libraries and files with auxiliary functions:
import os
import tensorflow as tf
import tensorflow_io as tfio
import train_speech_id_model
Load the model and perform inference on a file:
model = tf.keras.models.load_model('speech-id-model-110')
cur_data = tfio.audio.AudioIOTensor(file)
audio_data = cur_data.to_tensor()[:, 0]
# set batch size to 1, extract first element
audio_embedding = model.predict(
tf.expand_dims(audio_data, axis=0)
)[0]
Note that the expected sampling rate is 48 kHz, used in the training data. An example is provided in the script quick_inference.py
, which loads and compares the embeddings of all MP3 files in the current folder. The output should be:
Comparing ['sample_1a.mp3', 'sample_1b.mp3', 'sample_2.mp3']
Processing sample_1a.mp3 with sample rate of 48000
Processing sample_1b.mp3 with sample rate of 48000
Processing sample_2.mp3 with sample rate of 48000
sample_1a.mp3 and sample_1b.mp3: Same person: 0.7631306052207947
sample_1a.mp3 and sample_2.mp3: Different people: 1.1691123247146606
sample_1b.mp3 and sample_2.mp3: Different people: 1.3037692308425903
This model achieves specificity = 98.3% and sensitivity = 60.2% using a threshold level of 0.83. It is very simple to tune the value to an application specific needs - the lower the threshold, the more similar two audios need to be (in embedding space) to be considered "same" person.
Embedding distance of 2 audios given that a speaker is the same or different (English - test). Note that the median distance is significantly lower when the speaker is the same. Although there are cases where the distance for different speakers is low, those are rare and can be considered outliers in the distribution.
Download data: Go to Mozilla Common Voice dataset, download the language of your interest and place it in the data
folder. The results presented here were obtained using the English corpus, version 7 - cv-corpus-7.0-2021-07-21-en.tar.gz
.
Prepare tfrecords: Run the script python create_audio_tfrecords.py data
to process all .tar
and .tar.gz
in the data
folder in order to prepare TFRecords files (train/dev/test). Note: I had some issues when preparing the TFRecords in AWS EC2 instance with GPU. If this happens, try using a CPU only instance.
In the case of the English corpus, the generated files are:
- cv-corpus-7.0-2021-07-21-en.tar.gz_dev.tfrecords.gzip, 701 Mb
- cv-corpus-7.0-2021-07-21-en.tar.gz_test.tfrecords.gzip, 693 Mb
- cv-corpus-7.0-2021-07-21-en.tar.gz_train.tfrecords.gzip, 30.8 Gb
Since the English corpus is so large, we did not worry about adding extra languages to it. If that is something necessary for you, note that every client_id restarts from 0 in each TFRecords file and this needs to be corrected to group them together, i.e., it is necessary to assign each user a unique id.
Train model: Run the script python train_speech_id_model.py
to use the train TFRecords file in the data
folder to learn speech embeddings. If more than one is available, the script will pick only one. Checkpoints will be saved in the temp
folder if the loss decreased.
create_audio_tfrecords.py:
- class AudioTarReader, with methods that enable reading the tar files in the Common Voice dataset;
- class PersonIdAudio, with methods that enable reading and writing TFRecords files.
train_speech_id_model.py:
- def BaseSpeechEmbeddingModel, which defines the embedding model;
- code to train this model using triplet loss
The proposed model takes as input the audio data in time domain, computes the mel-frequency spectrogram, extracts features with a stack of 2 LSTM layers followed by a Dense layer that outputs the audio embeddings. The number of trainable parameters is 641k. The output is L2-normalized, which is a requirement of the Triplet loss.
- Librosa - https://librosa.org/doc/latest/index.html
- TensorFlow IO Audio - https://www.tensorflow.org/io/api_docs/python/tfio/audio
- TensorFlow Addons Losses: TripletSemiHardLoss - https://www.tensorflow.org/addons/tutorials/losses_triplet
- FaceNet: A Unified Embedding for Face Recognition and Clustering - https://arxiv.org/pdf/1503.03832.pdf