A quick-use package for speech enhancement based on our DIHARD18 system

Original founder: @staplesinLA

Major contributor: @nryant @mmmaat(many thanks!)

The repository provides tools to reproduce the enhancement results of the speech preprocessing part of our DIHARD18 system[1]. The deep-learning based denoising model is trained on 400 hours of English and Mandarin audio; for full details see [1,2,3]. Currently the tools accept 16 kHz, 16-bit monochannel WAV files. Please convert the audio format in advance.

Additionally, this package integrates a voice activity detection (VAD) module based on py-webrtcvad, which provides a Python interface to the WebRTC VAD. The default parameters are tuned on the development set of DIHARD18.

[1] Sun, Lei, et al. "Speaker Diarization with Enhancing Speech for the First DIHARD Challenge." Proc. Interspeech 2018 (2018): 2793-2797. PDF

[2] Gao, Tian, et al. "Densely connected progressive learning for lstm-based speech enhancement." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. PDF

[3] Sun, Lei, et al. "Multiple-target deep learning for LSTM-RNN based speech enhancement." 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA). IEEE, 2017. PDF

Main Prerequisites

How to use it?

Download the speech enhancement repository :

 git lfs clone https://github.com/staplesinLA/denoising_DIHARD18.git

Install all dependencies (Note that you need to have Python and pip already installed on your system) :

 sudo apt-get install openmpi-bin
 pip install numpy scipy librosa
 pip install cntk-gpu
 pip install webrtcvad
 pip install wurlitzer
 pip install joblib

Make sure the CNTK engine installed successfully by querying its version:

 python -c "import cntk; print(cntk.__version__)"

Move to the directory:
```
 cd ./denoising_DIHARD18
```

Specify parameters in run_eval.sh:

For the speech enhancement tool:

  WAV_DIR=<path to original wavs>
  SE_WAV_DIR=<path to output dir>
  USE_GPU=<true|false, if false use CPU, default=true>
  GPU_DEVICE_ID=<GPU device id on your machine, default=0>
  TRUNCATE_MINUTES=<audio chunk length in minutes, default=10>

We recommend using a GPU for decoding as it's much faster than CPU. If decoding fails with a CUDA Error: out of memory error, reduce the value of TRUNCATE_MINUTES.

For the VAD tool:

  VAD_DIR=<path to output dir>
  HOPLENGTH=<duration in milliseconds of VAD frame size, default=30>
  MODE=<WebRTC aggressiveness, default=3>
  NJOBS=<number of parallel processes, default=1>

Execute run_eval.sh:
```
 ./run_eval.sh
```

Use within docker

Install docker
Install nvidia docker, a plugin to use your GPUs within docker
Build the image using the provided Dockerfile:
```
 docker build -t dihard18 .
```
Run the evaluation script within docker with the following commands:
```
 docker run -it --rm --runtime=nvidia -v /abs/path/to/dihard/data:/data dihard18 /bin/bash
 # you are now in the docker machine
 ./run_eval.sh  # before launcing the script you can edit it to modify the parameters
```
- The option --runtime=nvidia enables the use of GPUs within docker
- The option -v /absolute/path/to/dihard/data:/data mounts the folder where the data are stored into Docker in the /data folder. The directory /absolute/path/to/dihard/data must contain a wav/ subdirectory. The results will be stored in the directories wav_pn_enhanced/ and vad/.

Details

Speech enhancement model

The scripts accept 16 kHz, 16-bit monochannel WAV files. Please convert the audio format in advance. To easily rebuild the waveform, the input feature is log-power spectrum (LPS). As the model has dual outputs including "IRM" and "LPS", the final used component is the "IRM" target which directly applies a mask to the original speech. Compared with "LPS" output, it can yield better speech intelligibility and fewer distortions.
VAD module

The optional parameters of WebRTC VAD are aggressiveness mode (default=3) and hop length (default=30 ms). The default settings are tuned on the development set of the First DIHARD challenge. For the development set, here is the comparison between original speech and processed speech in terms of VAD metrics:

VAD(default) Original_Dev Processed_Dev

Miss 11.85 7.21

FA 6.12 6.17

Total 17.97 13.38

And the performance on the evaluation set:

VAD(default) Original_Eval Processed_Eval

Miss 17.49 8.89

FA 6.36 6.4

Total 23.85 15.29
Effectiveness

The contribution of a single sub-module on the final speaker diarization performance is too trivial to analyze. However, it can be seen clearly that the enhancement based pre-processing is beneficial to at least VAD performance. Users can also tune the default VAD parameters to obtain a desired trade-off between Miss and False Alarm rates.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
model		model
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
decode_model.py		decode_model.py
main_denoising.py		main_denoising.py
main_get_vad.py		main_get_vad.py
run_denoising.sh		run_denoising.sh
run_denoising_batch.sh		run_denoising_batch.sh
run_eval.sh		run_eval.sh
run_vad.sh		run_vad.sh
utils.py		utils.py

VAD(default)	Original_Dev	Processed_Dev
Miss	11.85	7.21
FA	6.12	6.17
Total	17.97	13.38

VAD(default)	Original_Eval	Processed_Eval
Miss	17.49	8.89
FA	6.36	6.4
Total	23.85	15.29

CorticoAI/denoising_DIHARD18

Folders and files

Latest commit

History

Repository files navigation

A quick-use package for speech enhancement based on our DIHARD18 system

Main Prerequisites

How to use it?

Use within docker

Details

About

Resources

Stars

Watchers

Forks

Languages