GitHub - bashbaha/speakergan: Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction

This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang (dyyzhmm@163.com), Tiezheng Wang (wtz920729@163.com) and thanks for advice from TongFeng.

SpeakerGAN paper

SpeakerGAN: Speaker identification with conditional generative adversarial network， by Liyang Chen , Yifeng Liu , Wendong Xiao , Yingxue Wang ,Haiyong Xie.

Usage

Step 1: for vad preprocess.

$python vad.py filelist_with_absolute_path   #It will get vad file saved in the same directory with '_vad' for filename.
$cat filelist_with_absolute_path
/datasdc/librispeech/train-clean-100/458/126305/458-126305-0041.wav
/datasdc/librispeech/train-clean-100/4051/11218/4051-11218-0009.wav
/datasdc/librispeech/train-clean-100/7635/105409/7635-105409-0022.wav
.
.
.

Step 2: for train / test / generate:

python speakergan.py  #You may need to change the path of vad preprocessed wav files.

It costs us about 65 hours to train on NVIDIA Ampere A100 1 card with help of redis cache.

Our results

acc: 98.1955% on testset. Fixed first 1.6 seconds on testset, model/2200_D.pkl.

Generated samples with Generator on model/2200_G.pkl:

Details of paper

The following are details about this paper.

================ input ==================

feature: fbank, 8000hz, 25ms frame, 10ms overlap. shape:(160,64)
dataset: librispeech-100 train-clean-100 POI:251
data preprocess: vad、mean and variance normalization, shuffled.
60% train. 40% test.

================ model architecture ==================

dataflow: data -> feature extraction -> G & D
model architecture:

G: gated CNN, encoder-decoder, Huber loss + adversarial loss

D: ResnetBlocks, template average pooling, FC, softmax, crossentropy loss + adversarial loss
G: shuffler layer, GLU
D: ReLU

================ training ==================

lr: 0-9, 0.0005 | 9-49, 0.0002
L(d): λ1 λ2 = 1
batch_size: 128 # diff with paper.
epoch: 2200 #diff with paper
D_train steps / G_train steps = 4
Ladv Loss: Label smoothing, 1 -> 0.7 ~ 1.0, 0 -> 0 ~ 0.3

======== not sure or differences with paper ========

weights,bias initialize function, we use: xavier_uniform and zeros
pytorch huber_loss.
for shorter wav, paper: padded with zero. we: padded with feature again.
gated cnn architecture.
we use webrtcvad mode(3) for vad preprocess.
Paper error 1: we think the paper missing a plus sign in formula (5)
Paper error 2: we think the structure of conv6 in Generator is wrong, the output channel should be 64.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
logs		logs
model		model
LICENSE		LICENSE
Readme.md		Readme.md
speakergan.py		speakergan.py
vad.py		vad.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

logs

logs

model

model

LICENSE

LICENSE

Readme.md

Readme.md

speakergan.py

speakergan.py

vad.py

vad.py

Repository files navigation

Introduction

SpeakerGAN paper

Usage

Our results

Details of paper

About

Releases

Packages

Languages

License

bashbaha/speakergan

Folders and files

Latest commit

History

Repository files navigation

Introduction

SpeakerGAN paper

Usage

Our results

Details of paper

About

Topics

Resources

License

Stars

Watchers

Forks

Languages