Deep Speaker Recognition - Pytorch Implementation

Install • Features • Dataset • Neural Networks • Related • License

This is the respository for investigation of deep speaker recognition. The implementation is based on Pytorch with kaldi audio processing.

How To Install

# Clone this respository
git clone https://github.com/Wenhao-Yang/SpeakerVerifiaction-pytorch

# Install dependencies
pip install -r requirements.txt

Key-Features

Knowledge Distillation

Mixup-based Augmentation

Domain Adversarial Learning

Datasets

Prepare data in kaldi way, make features in Process_Data and store shuffled features with random length in egs. Other stages are processed in this resposity.

Development & Test:

Voxceleb1、Voxceleb2、Aishell1&2、CN-celeb1&2、aidatatang_200zh、MAGICDATA、TAL_CSASR ChiME5、VOiCES、CommonVoice、AMI、SITW、Librispeech、TIMIT

Augmentation:

MUSAN、RIRS

Pre-Processing

Resample

Butter Bandpass Filtering

Augmentation

LMS Filtering ( Defected )

Accoustic Features

MFCC

Fbank

Spectrogram

Neural-Networks

TDNN-based models

The TDNN_v2 is implemented from 'https://github.com/cvqluu/TDNN/blob/master/tdnn.py'. The TDNN_v4 layer is implemented using nn.Conv2d. The TDNN_v5 layer implemented using nn.Conv1d

ETDNN

FTDNN

DTDNN

Aggregated-Residual TDNN

ECAPA-TDNN

LSTM

ResNet-based models

ResNet34

ResCNN

Loss Type

Discriminative

A-Softmax

AM-Softmax

AAM-Softmax

Center Loss

Ring Loss

End-to-End

Generalized End-to-End Loss

Triplet Loss

Contrastive Loss

Prototypical Loss

Angular Prototypical Loss

Pooling Type

Self-Attention

Statistic Pooling

Attention Statistic Pooling

GhostVALD Pooling

Score with Normalization

Cosine

PLDA

DET

t-sne

Class Activation Mapping Analysis

Gradient

Grad-CAM

Grad-CAM++

Full-Grad

Integrated Gradients

To do list

Self-supervised Pre-Trained Speech Representation Models

X. Miscellaneous

Mixed precision training test

1.Result: 1.1 torch with one 2080ti ThinResNet for one epoch

Normal torch
> Running 10.9361 minutes for Epoch  1, GPU Memory-3799M: 
> Train Accuracy: 17.641670%, Avg loss: 4.453071.
  Valid Accuracy: 24.607762%, Avg loss: 3.693921.
  Train EER: 14.4528%, Threshold: 0.5362, mindcf-0.01: 0.8898, mindcf-0.001: 0.9166. 
  Test  ERR: 12.7572%, Threshold: 0.5246, mindcf-0.01: 0.8408, mindcf-0.001: 0.9176.

Nvidia Apex
> Running 9.1049 minutes for Epoch  1, GPU Memory-3799M: 
> Train Accuracy: 17.752310%, Avg loss: 4.443472.
  Valid Accuracy: 23.905863%, Avg loss: 3.808081.
  Train EER: 13.8477%, Threshold: 0.5310, mindcf-0.01: 0.8706, mindcf-0.001: 0.9282. 
  Test  ERR: 12.7784%, Threshold: 0.5301, mindcf-0.01: 0.8184, mindcf-0.001: 0.8848. 

Torch amp
> Running 8.9960 minutes for Epoch  1, GPU Memory-3729M: 
> Train Accuracy: 17.992458%, Avg loss: 4.410025.
  Valid Accuracy: 30.470685%, Avg loss: 3.432679.
  Train EER: 14.8058%, Threshold: 0.4383, mindcf-0.01: 0.9010, mindcf-0.001: 0.9551. 
  Test  ERR: 12.8420%, Threshold: 0.4092, mindcf-0.01: 0.7864, mindcf-0.001: 0.8909.

Work accomplished so far:

Models implementation
Data pipeline implementation - "Voxceleb"
Project structure cleanup.
Trained simple ResNet10 with softmax+triplet loss for pre-training 10 batch and triplet loss for 18 epoch , resulted in accuracy ???
DET curve

Timeline

Extract x-vectors from trained Neural Network in 20190626
Code cleanup (factory model creation) 20200725
Modified preprocessing
Modified model for ResNet34,50,101 in 20190625
Added cosine distance in Triplet Loss(The previous distance is l2) in 20190703
Adding scoring for identification
Fork plda method for classification in python from: https://github.com/RaviSoji/plda/blob/master/plda/

5. Performance

5.1 Baseline

Group	Model	epoch	Loss Type	Loss	Train/Test	Accuracy (%)	EER (%)
1	Resnet-10	1:22	Triplet	6.6420:0.0113	0.8553/0.8431	...
	ResNet-34	1:8	CrossEntropy	8.0285:0.0301	0.8360/0.8302	...
2	TDNN	40	CrossEntropy	3.1716:0.0412	vox1 dev/test	99.9994/99.5871	1.6700/5.4030
2	TDNN	40	CrossEntropy	3.0382:0.2196	vox2 dev/vox1 test	98.5265/98.2733	3.0800/3.0859
	ETDNN	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
	FTDNN	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
	DTDNN	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
	ARETDNN	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
3	LSTM	....	...	...	...	...
	LSTM	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
	LSTM	....	Softmax	8.0285:0.0301	0.8360/0.8302	...
2	...	....	...	...	...	...
2	TDNN	....	Softmax	8.0285:0.0301	0.8360/0.8302	...

TDNN_v5, Training set: voxceleb 2 161-dimensional spectrogram, Loss: arcosft, Cosine Similarity

Test Set	EER ( % )	Threshold	MinDCF-0.01	MinDCF-0.01	Date
vox1 test	2.3542%	0.2698025	0.2192	0.2854	20210426
sitw dev	2.8109%	0.2630014	0.2466	0.4026	20210515
sitw eval	3.2531%	0.2642460	0.2984	0.4581	20210515
cnceleb test	16.8276%	0.2165570	0.6923	0.8009	20210515
aishell2 test	10.8300%	0.2786811	0.8212	0.9527	20210515
aidata test	10.0972%	0.2952531	0.7859	0.9520	20210515

[1] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5329-5333 [2] Cai, Weicheng, Jinkun Chen, and Ming Li. "Analysis of Length Normalization in End-to-End Speaker Verification System.." conference of the international speech communication association (2018): 3618-3622.

[2] ...

Name		Name	Last commit message	Last commit date
Latest commit History 11,424 Commits
Define_Model		Define_Model
Diarazation		Diarazation
Eval		Eval
Extraction		Extraction
Light		Light
Lime		Lime
Log		Log
Misc		Misc
Process_Data		Process_Data
Score		Score
TrainAndTest		TrainAndTest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logger.py		logger.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Speaker Recognition - Pytorch Implementation

How To Install

Key-Features

Datasets

Pre-Processing

Accoustic Features

Neural-Networks

TDNN-based models

ResNet-based models

Loss Type

Discriminative

End-to-End

Pooling Type

Score with Normalization

Class Activation Mapping Analysis

To do list

X. Miscellaneous

Mixed precision training test

Timeline

5. Performance

5.1 Baseline

Related

Reference

About

Releases

Packages

Contributors 2

Languages

License

Wenhao-Yang/SpeakerVerifiaction-pytorch

Folders and files

Latest commit

History

Repository files navigation

Deep Speaker Recognition - Pytorch Implementation

How To Install

Key-Features

Datasets

Pre-Processing

Accoustic Features

Neural-Networks

TDNN-based models

ResNet-based models

Loss Type

Discriminative

End-to-End

Pooling Type

Score with Normalization

Class Activation Mapping Analysis

To do list

X. Miscellaneous

Mixed precision training test

Timeline

5. Performance

5.1 Baseline

Related

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages