The goal of this research is to check speech anti-spoofing techniques.

As an example of open speech synthesis tools I took this one [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)

The experiment is built as follows:
   1. I have recorded 5 samples of my voice using iPhone 13
   2. Imported my samples into `Real-Time-Voice-Cloning` tool I am referencing above to calculate embeddings representation of my voice
   3. Generated 5 samples with same content but from text in `Real-Time-Voice-Cloning` tool (below these samples will be called `fake paired`)
   4. Generated 5 samples with different content (below these samples will be called `fake unpaired`)

**Goal**: Check out for potential solutions to discriminate my real samples from fake

Here are examples of my samples I am going to experiment with:

In [31]:
import os
from pathlib import Path
from utils import load_signal
from IPython.display import Audio, display

RATE = 16_000  # audio sampling rate will be 16kHz everywhere

my_samples_root = Path('/data/MyVoiceSamples')

real_samples = {s.split('.')[0]: load_signal(my_samples_root / 'RealSamples' / s) for s in os.listdir(my_samples_root / 'RealSamples')}
fake_paired_samples = {s.split('.')[0]: load_signal(my_samples_root / 'FakeSamples' / s) for s in os.listdir(my_samples_root / 'FakeSamples')}

for utt_name, au in real_samples.items():

    print(utt_name)
    print('Real audio:')
    display(Audio(au, rate=RATE))
    print('Fake audio:')
    display(Audio(fake_paired_samples[utt_name], rate=RATE))

    print()
    print('-' * 100)
    print()

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)


utt2
Real audio:


Fake audio:



----------------------------------------------------------------------------------------------------

utt1
Real audio:


Fake audio:



----------------------------------------------------------------------------------------------------

utt5
Real audio:


Fake audio:



----------------------------------------------------------------------------------------------------

utt4
Real audio:


Fake audio:



----------------------------------------------------------------------------------------------------

utt3
Real audio:


Fake audio:



----------------------------------------------------------------------------------------------------



First natural thing to try is to use existing speaker verification solutions. Standard procedure works as follows:

 1. A model is trained to produce a speaker embedding: a fixed dimensional vector that represents speaker information based on audio sample
 2. Having an example of speaker's voice we put it through the network to obtain speaker representation and store it for future
 3. Having another recording we get an embedding for it using the same model and compare to the reference
 4. If similarity between the two is sufficient (cosine similarity if frequently used) then two recordings can be considered as coming from the same speaker

[Transformers](https://huggingface.co/docs/transformers/index)
library suggests several pretrained models to obtain such embeddings e.g.
[WavLMForXVector](https://huggingface.co/microsoft/wavlm-base-plus-sv) and
[UniSpeechSatForXVector](https://huggingface.co/microsoft/unispeech-sat-base-plus-sv)

To demonstrate how it works I will take some random samples from different speakers from [asvspoof2019](https://datashare.ed.ac.uk/handle/10283/3336) dataset. I will get back to this data later but for now it is only important to know that this will be some speakers that are not me.

First we must create a speech pre-processing class that will be responsible for samples loading and normalization before feeding into verification model

In [32]:
import librosa
import warnings
from transformers import Wav2Vec2FeatureExtractor

class SpeechProcessor:

    def __init__(self, exp_name, **kwargs):

        # we must instantiate preprocessing object as it was used in original experiment
        self.feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(exp_name, **kwargs)
        self.sr = self.feature_extractor.sampling_rate

    def __call__(self, path):
        """
        A method to read and pre-process audio sample
        :param path: path to audio file
        :return: numpy array of audio
        """

        # read sample
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore')
            sig, sr = librosa.load(str(path))

        # make sure it is one channel sample
        if len(sig.shape) == 2:
            sig = sig.mean(axis=1)

        # resample to 16kHz
        if sr != self.sr:
            sig = librosa.resample(sig, orig_sr=sr, target_sr=self.sr)

        # run through feature extractor to properly normalize audio
        return self.feature_extractor(sig, sampling_rate=self.sr, return_tensors='pt')['input_values']

Next we must initialize audio processor and verification model

In [33]:
from transformers import UniSpeechSatForXVector

experiment_name = 'microsoft/unispeech-sat-base-plus-sv'

processor = SpeechProcessor(experiment_name)
model = UniSpeechSatForXVector.from_pretrained(experiment_name, cache_dir='/data/external_models')

_ = model.eval()

Load, process and listen to some random samples from `asvspoof2019` dataset

In [34]:
import pandas as pd
import numpy as np

seed = 1
np.random.seed(seed)

# read full list of dataset training samples
spoof_samples_meta_data = pd.read_csv(
    '/data/asvspoof19/LA/ASVspoof2019_LA_cm_protocols/ASVspoof2019.LA.cm.train.trn.txt',
    header=None, sep=' ', names=['speaker', 'sample_id', 'system', '-', 'label']
)

# sample 10 random speakers
speakers_list = np.random.choice(spoof_samples_meta_data.speaker.unique(), 10, replace=False)

# pick and load one random sample from each speaker
samples_root = Path('/data/asvspoof19/LA/ASVspoof2019_LA_train/flac')
other_speakers_samples = []
for speaker in speakers_list:

    sample = spoof_samples_meta_data[
        (spoof_samples_meta_data.speaker == speaker) &
        (spoof_samples_meta_data.label == 'bonafide')
    ].sample(1, random_state=seed).iloc[0].sample_id

    audio = processor(samples_root / f'{sample}.flac')

    display(Audio(audio, rate=RATE))

    other_speakers_samples.append(audio)


Now let's load and process my (real and fake) samples in the same way

In [35]:
real_samples = [processor(my_samples_root / 'RealSamples' / s) for s in os.listdir(my_samples_root / 'RealSamples')]
fake_paired_samples = [processor(my_samples_root / 'FakeSamples' / s) for s in os.listdir(my_samples_root / 'FakeSamples')]
fake_unpaired_samples = [processor(my_samples_root / 'FakeUnpairedSamples' / s) for s in os.listdir(my_samples_root / 'FakeUnpairedSamples')]

Calculate embeddings for all sets of samples

In [36]:
import torch
from torch.nn.functional import normalize

with torch.no_grad():

    real_embeddings = [normalize(model(s).embeddings) for s in real_samples]
    fake_paired_embeddings = [normalize(model(s).embeddings) for s in fake_paired_samples]
    fake_unpaired_embeddings = [normalize(model(s).embeddings) for s in fake_unpaired_samples]
    other_speakers_embeddings = [normalize(model(s).embeddings) for s in other_speakers_samples]

Now let's imagine that we have one sample of my voice as an example. First we will check how other random speakers are similar to me using cosine similarity measure (the higher it is the more similar two speakers are) comparing to how my samples are similar to each other

In [37]:
from torch.nn import CosineSimilarity

cosine_sim = CosineSimilarity(dim=-1)

my_voice_example = real_embeddings[0]

other_speakers_similarities = [cosine_sim(my_voice_example, emb).numpy()[0] for emb in other_speakers_embeddings]
myself_similarities = [cosine_sim(my_voice_example, emb).numpy()[0] for emb in real_embeddings]

print('How other samples of mine are similar')
print(myself_similarities)

print()
print('How other speakers are similar to me')
print(other_speakers_similarities)

How other samples of mine are similar
[1.0, 0.9748723, 0.9669921, 0.9655709, 0.9401013]

How other speakers are similar to me
[0.79962164, 0.821054, 0.63995963, 0.69750607, 0.5955784, 0.72785574, 0.79534537, 0.77658194, 0.5382721, 0.48320544]


Even though this (or similar) model has to be finetuned on a particular data it can be seen that it is already pretty usefull for speaker verification out of the box. My samples have similarity at least 0.94 when other speakers no more than 0.82. So by putting a threshold of 0.9 this model can detect me among the others (at least for this tiny test experiment).

Next we should check whether it is possible to discriminate me from 'fake' me in the same way

In [23]:
fake_paired_similarities = [cosine_sim(my_voice_example, emb).numpy()[0] for emb in fake_paired_embeddings]
fake_unpaired_similarities = [cosine_sim(my_voice_example, emb).numpy()[0] for emb in fake_unpaired_embeddings]

print('How other samples of mine are similar')
print(myself_similarities)

print()
print('How paired samples of mine are similar to me')
print(fake_paired_similarities)

print()
print('How unpaired samples of mine are similar to me')
print(fake_unpaired_similarities)

How other samples of mine are similar
[1.0, 0.9748723, 0.9669921, 0.9655709, 0.9401013]

How paired samples of mine are similar to me
[0.90169847, 0.90036255, 0.87100327, 0.9308988, 0.90426695]

How unpaired samples of mine are similar to me
[0.91090757, 0.9563506, 0.9140636, 0.92291135, 0.9240033]


As we can see a threshold of 0.9 can't be applied already and some samples get similarity above 0.95. Similar results appear with different pretrained embedding models and taking different samples of mine as an anchors

Alternative approach would be to train a separate model specialised on spoof signals detection. There are already a bunch of research dedicated to this topic e.g. [here](https://paperswithcode.com/paper/ur-channel-robust-synthetic-speech-detection) but I have decided to try my own model (mainly to give an example of my project organisation approach and code style).

I have used [asvspoof2019](https://datashare.ed.ac.uk/handle/10283/3336) dataset for training. This dataset is comprised of two parts: LA (logical access) and PA (physical access). Logical access means that spoofing was done by TTS and voice conversion systems when physical is about device/environment manipulations. I have used only train part of LA for my model. This is relatively small dataset of ~2500 real samples augmented by ~23000 fake samples.

My model is based on [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) encoder followed by pooling and classifications layers. A target was to classify fake vs real samples.

My project with model implementation and training pipeline can be found at `/data/w2v_spoof_detector`

The dataset can be found at `/data/asvspoof19/LA`

A trained model was stored at `/data/w2v_spoof_detector/results/base_la_spoof_classifier`

First we will load and initialize trained detector model

In [26]:
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.checkpoints import average_checkpoints

EXP_ROOT = Path('/data/w2v_spoof_detector/results')
EXP_NAME = 'base_la_spoof_classifier'
DEVICE = torch.device('cpu')


def load_model(params):

    model, checkpointer = params['model'], params['checkpointer']

    checkpoints = checkpointer.find_checkpoints(min_key='valid_loss', max_num_checkpoints=1)
    model.load_state_dict(average_checkpoints(checkpoints, "model"))

    return model.eval()


# load model training params
params = load_hyperpyyaml(EXP_ROOT / EXP_NAME / 'hyperparams.yaml')

# init model object
model = load_model(params)
model.eval()
model.to(DEVICE)

Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['quantizer.codevectors', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'project_hid.weight', 'project_q.bias', 'project_hid.bias', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Let's apply trained spoof detector to samples we considered above

In [30]:
def is_real_prob(model, au, beta=2580/22800):

    au = torch.Tensor(au).to(DEVICE)
    score = model(au).detach().cpu().numpy()[0][0]
    p = 1 / (1 + np.exp(-score))

    return np.round(p / (p + (1 - p) * beta), 3)  # here I do some calibration to mitigate dataset imbalance

with torch.no_grad():

    for set_name, samples in zip(
            ('other speakers', 'my samples', 'fake paired', 'fake unpaired'),
            (other_speakers_samples, real_samples, fake_paired_samples, fake_unpaired_samples)
    ):

        probs = [is_real_prob(model, s) for s in samples]
        print(f'{set_name}: ', probs)

other speakers:  [0.999, 0.999, 0.999, 0.999, 0.999, 0.999, 0.999, 0.999, 0.999, 0.999]
my samples:  [0.162, 0.129, 0.551, 0.473, 0.222]
fake paired:  [0.07, 0.06, 0.046, 0.04, 0.04]
fake unpaired:  [0.047, 0.054, 0.053, 0.046, 0.04]


Other speakers samples are here only for sanity check: since they all were exposed to the model during training naturally they all have a high probability to be real.

My real samples have a relatively low 'realness' probability. Actually only one has a more than 50% chance to be real. This is weird and requires some additional research. Probably it has something to do to the recording conditions or audio format. Or this is just a consequence of small training set with no augmentations applied.

What is more important here is that all fake samples have much lower probabilities no more than 7%. So by putting at least 0.1 threshold they can be clearly separated from my real samples.

**Conclusions**

 1. A direction towards separate spoof detector seems to be more promising comparing to standard speaker verification approach
 2. May be speaker verification model can achieve comparable or even better results after some fine tuning and having additional spoof detection task during training
 3. For spoof detection task train data can and should be significantly extended with real samples, augmentations and samples from modern speech generation tools
 4. There are other [Synthetic Speech Detection](https://paperswithcode.com/task/synthetic-speech-detection#benchmarks) approaches to try