Speech Embeddings

Introduction

This is a PyTorch implementation of a self-attentive speaker embedding.

VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

	dev	test
# of speakers	1,211	40
# of utterances	148,642	4,874

Download following files into "data" folder:

Then concatenate the files using the command:

$ cat vox1_dev* > vox1_dev_wav.zip

Extract vox1_dev_wav.zip & vox1_test_wav.zip:

$ python extract.py

Split dev set to train and valid samples:

$ python pre_process.py

$ python train.py

If you want to visualize during training, run in your terminal:

$ tensorboard --logdir runs

Model	Margin-s	Margin-m	Test(%)	Inference speed
1	10.0	0.2	88.48%	18.18 ms

Visualize speaker embeddings from test set:

$ python visualize.py

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
audios		audios
data		data
images		images
models		models
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_gen.py		data_gen.py
export.py		export.py
extract.py		extract.py
pre_process.py		pre_process.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
utils.py		utils.py
visualize.py		visualize.py