Introduction

This repository contains the code release for our paper Learning Speaker Embedding with Momentum Contrast.

The code has been developed using Kaldi and Pytorch. Kaldi is only used for feature extraction and post-processings. All neural networks are trained using Pytorch.

The purpose of the project is to make researches on neural network based speaker verification easier. We also try to reproduce some results in our papers.

Requirements

The code is tested with the following dependencies.

Python: 3.6.8
Kaldi: 5.5
kaldi-io: 0.9.4
numpy: 1.16.4
Pillow: 6.2.1
scikit-learn: 0.22.2.post1
six: 1.13.0
tensorboardX: 1.1
torch: 1.2.0

In addition to Kaldi, you can install the requirements via pip with:
pip install -r requirements.txt

Usage

Prerequisites

Set KALDI_ROOT properly in path.sh. Link $KALDI_ROOT/egs/wsj/s5/utils to utils.

Step1: Train MoCo Model

To train the MoCo Model with SpecAugment, run:

sh train_moco.sh \
  --voxceleb1_root [voxceleb1 dir] \
  --voxceleb2_root [voxceleb2 dir] \
  --rirs_noises_root [rirs_noises dir] \
  --musan_root [musan dir] \
  --data [train data dir] \
  --exp [exp dir]

Step2: Train AAM-Softmax

To train AAM-Softmax with the pretrained MoCo model, run:

sh train_xvector.sh \
  --voxceleb1_root [voxceleb1 dir] \
  --voxceleb2_root [voxceleb2 dir] \
  --rirs_noises_root [rirs_noises dir] \
  --musan_root [musan dir] \
  --moco_model [pretrained MoCo model] \
  --data [train data dir] \
  --exp [exp dir]

Note: If the MoCo Model doesn't exist, run_xvector.sh will be trained like standard Xvector recipe.

Step3: Evaluate the performace with trained model

After training, you can evaluate the performance on the test set, run:

sh test.sh \
  --data [test data dir] \
  --exp [exp] \
  --dir [trained model dir] \
  --mdl [model name, default final.pkl] \
  --plda_score [apply plda if true else apply cosine score, default true]

Setting

Training data: All of Voxceleb2, plus the training portion of Voxceleb1.
Test data: The test portion of Voxceleb1.
For standard Xvector and MoCo, the learning rate is initially set to 1e-4 and gradually reduced to 1e-5 along the training process, for AAM-Softmax, the learning rate is initially set to 1e-5 and gradually reduced to 1e-6 along the training process.
Chunk Size: From 200 ~ 400.
Batch Size: 1024.
Backend Classifier: PLDA and Cosine.
Evaluate Model: We selet epoch=900(checkpoint_e900.pkl) to evaluate the performance, although the AAM-Softmax trained with pretrained MoCo model would converge faster.

Performance

PLDA

Method	EER(%)	minDCT(0.01)	minDCT(0.001)
Ivector	5.467	0.4859	0.6213
Xvector	3.34	0.3795	0.6138
Xvector-AAM	2.55	0.3464	0.5848
Xvector-AAM-MoCo	2.423	0.2856	0.3850

Cosine

Method	EER(%)	minDCT(0.01)	minDCT(0.001)
Ivector	14.65	0.7195	0.8661
Xvector	7.349	0.5799	0.7418
Xvector-AAM	2.306	0.2647	0.3372
Xvector-AAM-MoCo	2.402	0.2232	0.3573

DET

Citation

If you used this code please kindly cite the following paper:
Ke Ding, Xuanji He, Guanglu Wan. Learning Speaker Embedding with Momentum Contrast. arXiv preprint arXiv:2001.01986 (2020)

Contact

If you have any question, please feel free to contact us:

Auther	E-mail
Ke Ding	dingke02@meituan.com
Xuanji He	hexuanji@meituan.com
Guanglu Wan	wanguanglu@meituan.com

License

The code is BSD-style licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
conf		conf
image		image
local		local
LICENSE		LICENSE
README.md		README.md
cmd.sh		cmd.sh
path.sh		path.sh
requirements.txt		requirements.txt
test.sh		test.sh
train_moco.sh		train_moco.sh
train_xvector.sh		train_xvector.sh

License

DingKe/speaker_embedding_moco

Folders and files

Latest commit

History

Repository files navigation

Introduction

Requirements

Usage

Prerequisites

Step1: Train MoCo Model

Step2: Train AAM-Softmax

Step3: Evaluate the performace with trained model

Setting

Performance

PLDA

Cosine

DET

Citation

Contact

License

About

Resources

License

Stars

Watchers

Forks

Languages