In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
import os
!pip install wget
!apt-get install sox

!git clone https://github.com/NVIDIA/NeMo.git
os.chdir('NeMo')
!bash reinstall.sh

!pip install unidecode

# **SPEAKER RECOGNITION** 

Speaker Recognition (SR) is an broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who she claims to be?). In this work, we focmus on the far-field, text-independent speaker recognition when the identity of the speaker is based on how speech is spoken, not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances, 
which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings are also used in automatic speech recognition (ASR) and speech synthesis.

As the goal of most speaker related systems is to get good speaker level embeddings that could help distinguish from other speakers, we shall first train these embeddings in end-to-end manner optimizing the [QuatzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss. We modify the original quartznet based decoder to get these fixed size embeddings irrespective of the length of the input audio. We employ mean and variance based statistics pooling method to grab these embeddings.

In this tutorial we shall first train these embeddings on speaker related datasets and then get speaker embeddings from a pretrained network for a new dataset. Since Google Colab has very slow read-write speeds, Please run this locally for training on [hi-mia](https://arxiv.org/abs/1912.01231). 

We use the [get_hi-mia-data.py](https://github.com/NVIDIA/NeMo/blob/master/scripts/get_hi-mia_data.py) script to download the necessary files, extract them, also re-sample to 16Khz if any of these samples are not at 16Khz. We do also provide scripts to score these embeddings for a speaker-verification task like hi-mia dataset at the end. 

In [None]:
data_dir = 'scripts/data/'
!mkdir $data_dir

# Download and process dataset. This will take a few moments...
!python scripts/get_hi-mia_data.py --data_root=$data_data

After download and conversion, your `data` folder should contain directories with manifest files as:

* `data/<set>/train.json`
* `data/<set>/dev.json` 
* `data/<set>/{set}_all.json` 

Also for each set we also create utt2spk files, these files later would be used in PLDA training.

Each line in manifest file describes a training sample - `audio_filepath` contains path to the wav file, `duration` it's duration in seconds, and `label` is the speaker class label:

`{"audio_filepath": "<absolute path to dataset>/data/train/SPEECHDATA/wav/SV0184/SV0184_6_04_N3430.wav", "duration": 1.22, "label": "SV0184"}` 

`{"audio_filepath": "<absolute path to dataset>/data/train/SPEECHDATA/wav/SV0184/SV0184_5_03_F2037.wav", duration": 1.375, "label": "SV0184"}`



Import necessary packages

In [None]:
from ruamel.yaml import YAML

import nemo
import nemo.collections.asr as nemo_asr
import copy
from functools import partial

# Building Training and Evaluation DAGs with NeMo
Building a model using NeMo consists of 

1.  Instantiating the neural modules we need
2.  specifying the DAG by linking them together.

In NeMo, the training and inference pipelines are managed by a NeuralModuleFactory, which takes care of checkpointing, callbacks, and logs, along with other details in training and inference. We set its log_dir argument to specify where our model logs and outputs will be written, and can set other training and inference settings in its constructor. For instance, if we were resuming training from a checkpoint, we would set the argument checkpoint_dir=`<path_to_checkpoint>`.

Along with logs in NeMo, you can optionally view the tensorboard logs with the create_tb_writer=True argument to the NeuralModuleFactory. By default all the tensorboard log files will be stored in {log_dir}/tensorboard, but you can change this with the tensorboard_dir argument. One can load tensorboard logs through tensorboard by running tensorboard --logdir=`<path_to_tensorboard dir>` in the terminal.

In [None]:
exp_name = 'quartznet3x2_hi-mia'
work_dir = './myExps/'
neural_factory = nemo.core.NeuralModuleFactory(
    log_dir=work_dir+"/hi-mia_logdir/",
    checkpoint_dir="./myExps/checkpoints/" + exp_name,
    create_tb_writer=True,
    random_seed=42,
    tensorboard_dir=work_dir+'/tensorboard/',
)

Now that we have our neural module factory, we can specify our **neural modules and instantiate them**. Here, we load the parameters for each module from the configuration file. 

In [None]:
logging = nemo.logging
yaml = YAML(typ="safe")
with open('examples/speaker_recognition/configs/quartznet_spkr_3x2x512_xvector.yaml') as f:
    spkr_params = yaml.load(f)

sample_rate = spkr_params["sample_rate"]
time_length = spkr_params.get("time_length", 8)
logging.info("max time length considered for each file is {} sec".format(time_length))

Instantiating train data_layer using config arguments. `labels = None` automatically creates output labels from manifest files, if you would like to pass those speaker names you can use the labels option. So while instantiating eval data_layer, we can use pass labels to the class in order to match same the speaker output labels as we have in the training data layer. This comes in handy while training on multiple datasets with more than one manifest file. 

In [None]:
train_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
train_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["train"])
del train_dl_params["train"]
del train_dl_params["eval"]

batch_size=64
data_layer_train = nemo_asr.AudioToSpeechLabelDataLayer(
        manifest_filepath=data_dir+'/train/train.json',
        labels=None,
        batch_size=batch_size,
        time_length=time_length,
        **train_dl_params,
    )

eval_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
eval_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["eval"])
del eval_dl_params["train"]
del eval_dl_params["eval"]

data_layer_eval = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=data_dir+'/train/dev.json",
    labels=data_layer_train.labels,
    batch_size=batch_size,
    time_length=time_length,
    **eval_dl_params,
)

data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
        sample_rate=sample_rate, **spkr_params["AudioToMelSpectrogramPreprocessor"],
    )
encoder = nemo_asr.JasperEncoder(**spkr_params["JasperEncoder"],)

decoder = nemo_asr.JasperDecoderForSpkrClass(
        feat_in=spkr_params["JasperEncoder"]["jasper"][-1]["filters"],
        num_classes=data_layer_train.num_classes,
        pool_mode=spkr_params["JasperDecoderForSpkrClass"]['pool_mode'],
        emb_sizes=spkr_params["JasperDecoderForSpkrClass"]["emb_sizes"].split(","),
    )

xent_loss = nemo_asr.CrossEntropyLossNM(weight=None)

The next step is to assemble our training DAG by specifying the inputs to each neural module.

In [None]:
audio_signal, audio_signal_len, label, label_len = data_layer_train()
processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)
encoded, encoded_len = encoder(audio_signal=processed_signal, length=processed_signal_len)
logits, _ = decoder(encoder_output=encoded)
loss = xent_loss(logits=logits, labels=label)

We would like to be able to evaluate our model on the dev set, as well, so let's set up the evaluation DAG.

Our evaluation DAG will reuse most of the parts of the training DAG with the exception of the data layer, since we are loading the evaluation data from a different file but evaluating on the same model. Note that if we were using data augmentation in training, we would also leave that out in the evaluation DAG.

In [None]:
audio_signal_test, audio_len_test, label_test, _ = data_layer_eval()
processed_signal_test, processed_len_test = data_preprocessor(
            input_signal=audio_signal_test, length=audio_len_test
        )
encoded_test, encoded_len_test = encoder(audio_signal=processed_signal_test, length=processed_len_test)
logits_test, _ = decoder(encoder_output=encoded_test)
loss_test = xent_loss(logits=logits_test, labels=label_test)

# Creating CallBacks

We would like to be able to monitor our model while it's training, so we use callbacks. In general, callbacks are functions that are called at specific intervals over the course of training or inference, such as at the start or end of every n iterations, epochs, etc. The callbacks we'll be using for this are the SimpleLossLoggerCallback, which reports the training loss (or another metric of your choosing, such as \% accuracy for speaker recognition tasks), and the EvaluatorCallback, which regularly evaluates the model on the dev set. Both of these callbacks require you to pass in the tensors to be evaluated--these would be the final outputs of the training and eval DAGs above.

Another useful callback is the CheckpointCallback, for saving checkpoints at set intervals. We create one here just to demonstrate how it works.

In [None]:
from nemo.collections.asr.helpers import (
    monitor_classification_training_progress,
    process_classification_evaluation_batch,
    process_classification_evaluation_epoch,
)
from nemo.utils.lr_policies import CosineAnnealing

train_callback = nemo.core.SimpleLossLoggerCallback(
        tensors=[loss, logits, label],
        print_func=partial(monitor_classification_training_progress, eval_metric=[1]),
        step_freq=1000,
        get_tb_values=lambda x: [("train_loss", x[0])],
        tb_writer=neural_factory.tb_writer,
    )

callbacks = [train_callback]

chpt_callback = nemo.core.CheckpointCallback(
            folder="./myExps/checkpoints/" + exp_name,
            load_from_folder="./myExps/checkpoints/" + exp_name,
            step_freq=1000,
        )
callbacks.append(chpt_callback)

tagname = "hi-mia_dev"
eval_callback = nemo.core.EvaluatorCallback(
            eval_tensors=[loss_test, logits_test, label_test],
            user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),
            user_epochs_done_callback=partial(process_classification_evaluation_epoch, tag=tagname),
            eval_step=1000,  # How often we evaluate the model on the test set
            tb_writer=neural_factory.tb_writer,
        )

callbacks.append(eval_callback)

Now that we have our model and callbacks set up, how do we run it?

Once we create our neural factory and the callbacks for the information that we want to see, we can start training by simply calling the train function on the tensors we want to optimize and our callbacks! Since this notebook is for you to get started, by an4 as dataset is small it would quickly get higher accuracies. For better models use bigger datasets

In [None]:
# train model
num_epochs=25
N = len(data_layer_train)
steps_per_epoch = N // batch_size

logging.info("Number of steps per epoch {}".format(steps_per_epoch))

neural_factory.train(
        tensors_to_optimize=[loss],
        callbacks=callbacks,
        lr_policy=CosineAnnealing(
            num_epochs * steps_per_epoch, warmup_steps=0.1 * num_epochs * steps_per_epoch,
        ),
        optimizer="novograd",
        optimization_params={
            "num_epochs": num_epochs,
            "lr": 0.02,
            "betas": (0.95, 0.5),
            "weight_decay": 0.001,
            "grad_norm_clip": None,
        }
    )

Now that we trained our embeddings, we shall extract these embeddings using our pretrained checkpoint present at `checkpoint_dir`. As we can see from the neural architecture, we extract the embeddings after the `emb1` layer. 
![Speaker Recognition Layers](./speaker_reco.jpg)

Now use the test manifest to get the embeddings. As we saw before, let's create a new `data_layer` for test. Use previously instiated models and attach the DAGs

In [None]:
eval_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
eval_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["eval"])
del eval_dl_params["train"]
del eval_dl_params["eval"]
eval_dl_params['shuffle'] = False  # To grab  the file names without changing data_layer

test_dataset = data_dir+'/test/test_all.json',
data_layer_test = nemo_asr.AudioToSpeechLabelDataLayer(
        manifest_filepath=test_dataset,
        labels=None,
        batch_size=batch_size,
        **eval_dl_params,
    )

audio_signal_test, audio_len_test, label_test, _ = data_layer_test()
processed_signal_test, processed_len_test = data_preprocessor(
    input_signal=audio_signal_test, length=audio_len_test)
encoded_test, _ = encoder(audio_signal=processed_signal_test, length=processed_len_test)
_, embeddings = decoder(encoder_output=encoded_test)

Now get the embeddings using neural_factor infer command, that just does forward pass of all our modules. And save our embeddings in `<work_dir>/embeddings`

In [None]:
import numpy as np
import json
eval_tensors = neural_factory.infer(tensors=[embeddings, label_test], checkpoint_dir="./myExps/checkpoints/" + exp_name)

inf_emb, inf_label = eval_tensors
whole_embs = []
whole_labels = []
manifest = open(test_dataset, 'r').readlines()

for line in manifest:
    line = line.strip()
    dic = json.loads(line)
    filename = dic['audio_filepath'].split('/')[-1]
    whole_labels.append(filename)

for idx in range(len(inf_label)):
    whole_embs.extend(inf_emb[idx].numpy())

embedding_dir = './myExps/embeddings/'
if not os.path.exists(embedding_dir):
    os.mkdir(embedding_dir)

filename = os.path.basename(test_dataset).split('.')[0]
name = embedding_dir + filename

np.save(name + '.npy', np.asarray(whole_embs))
np.save(name + '_labels.npy', np.asarray(whole_labels))
logging.info("Saved embedding files to {}".format(embedding_dir))


In [None]:
!ls $embedding_dir

# Cosine Similarity Scoring

Here we provide a script scoring on hi-mia whose trial file has structure `<speaker_name1> <speaker_name2> <target/nontarget>` . First copy the `trails_1m` file present in test folder to our embeddings directory

In [None]:
!cp $data_dir/test/trails_1m $embedding_dir/

the below command would output the EER% based on cosine similarity score

In [None]:
!python examples/speaker_recognition/hi-mia_eval.py --data_root $embedding_dir --emb $embedding_dir/test_all.npy --emb_labels $embedding_dir/test_all_labels.npy --emb_size 1024


# PLDA Backend
To finetune our speaker embeddings further, we used kaldi PLDA scripts to train PLDA and evaluate as well. so from this point going forward, please make sure you installed kaldi and was added to your path as KALDI_ROOT.

To train PLDA, we can either use dev set or training set. Let's use the training set embeddings to train PLDA and further use this trained PLDA model to score in test embeddings. in order to do that we should get embeddings for our training data as well. As similar to above steps, generate the train embeddings

In [None]:
test_dataset = data_dir+'/train/train.json',

data_layer_test = nemo_asr.AudioToSpeechLabelDataLayer(
        manifest_filepath=test_dataset,
        labels=None,
        batch_size=batch_size,
        **eval_dl_params,
    )

audio_signal_test, audio_len_test, label_test, _ = data_layer_test()
processed_signal_test, processed_len_test = data_preprocessor(
    input_signal=audio_signal_test, length=audio_len_test)
encoded_test, _ = encoder(audio_signal=processed_signal_test, length=processed_len_test)
_, embeddings = decoder(encoder_output=encoded_test)

eval_tensors = neural_factory.infer(tensors=[embeddings, label_test], checkpoint_dir="./myExps/checkpoints/" + exp_name)

inf_emb, inf_label = eval_tensors
whole_embs = []
whole_labels = []
manifest = open(test_dataset, 'r').readlines()

for line in manifest:
    line = line.strip()
    dic = json.loads(line)
    filename = dic['audio_filepath'].split('/')[-1]
    whole_labels.append(filename)

for idx in range(len(inf_label)):
    whole_embs.extend(inf_emb[idx].numpy())

if not os.path.exists(embedding_dir):
    os.mkdir(embedding_dir)

filename = os.path.basename(test_dataset).split('.')[0]
name = embedding_dir + filename

np.save(name + '.npy', np.asarray(whole_embs))
np.save(name + '_labels.npy', np.asarray(whole_labels))
logging.info("Saved embedding files to {}".format(embedding_dir))


As part of kaldi necessary files we need `utt2spk` \& `spk2utt` file to get ark file for PLDA training. to do that, copy the generated utt2spk file from `data_dir` train folder to create spk2utt file using 

`utt2spk_to_spk2utt.pl  $data_dir/train/utt2spk > $embedding_dir/spk2utt`

Then run the below python script to get EER score using PLDA backend scoring. This script does both data preparation for kaldi followed by PLDA scoring. 

In [None]:
!python examples/speaker_recognition/kaldi_plda.py --root $embedding_dir  --train_embs $embedding_dir/train.npy --train_labels $embedding_dir/train_labels.npy 
--eval_embs $embedding_dir/all_embs_himia.npy --eval_labels $embedding_dir/all_ids_himia.npy --stage=1

Here `--stage = 1` trains PLDA model but if you already have a trained PLDA then you can directly evaluate on it by `--stage=2` option.

This should output an EER of 6.32% with minDCF: 0.455

# Performance Improvement

To improve your embeddings performance:
    
* Add more data and Train longer (100 epochs)

* Try adding the augmentation –see config file

* Use larger model

* Train on several GPUs and use mixed precision (on NVIDIA Volta and Turing GPUs)

* Start with pre-trained checkpoints