Skip to content

anasvaf/Fearless_Steps_INTERSPEECH2019

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Created by Eleftherios Fanioudakis and Anastasios Vafeiadis

Introduction

This repository contains the code for Task 1 (Speech Activity Detection) of the Fearless Steps Challenge. More details about the challenge can be found at Fearless Steps.

You can also check our paper that was accepted at INTERSPEECH 2019 Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection.

Explanation of the Speech Activity Detection (SAD) Task

Four system output possibilities are considered:

  • True Positive (TP) – system correctly identifies start-stop times of speech segments compared to the reference (manual annotation),
  • True Negative (TN) – system correctly identifies start-stop times of non-speech segments compared to reference,
  • False Positive (FP) – system incorrectly identifies speech in a segment where the reference identifies the segment as non-speech, and
  • False Negative (FN) – system missed identification of speech in a segment where the reference identifies a segment as speech.

SAD error rates represent a measure of the amount of time that is misclassified by the systems segmentation of the test audio files. Missing, or failing to detect, actual speech is considered a more serious error than misidentifying its start and end times.

The following link explains the Decision Cost Function (DCF) metric, as well as the '.txt' output file format: Evaluation Plan. In particular look at pages: 14-16 and 25.

Explanation of the Python Scripts

Library Prerequisites

extract_sad.py

This script processes the 30 min recordings for training and evaluation into 1 sec chunks (8000 samples). We target this problem as a multi-label problem. Despite having two labels (0: non-speech and 1: speech), we will have 8000 different labels for each 1 s wav file. The script saves a NumPy array for each 1 sec file with a corresponding NumPy array for its labels. You can run the script as python extract_sad.py train, for the Train files and python extract_sad.py test for the Eval files.

audio_to_spectrograms.py

This script will create a 129x126 spectrogram image (grayscale) for each 1 s wav that was created from the extract_sad.py. This spectrogram image will be used as an input to our 2D CRNN.

Hardware

The proposed algorithm was trained on an Intel Core i5 - 7600K (4-cores, 4-threads) clocked at 4.2 GHz. The GPU that was used was an Nvidia GTX 1080Ti Founder’s edition with 11 GB GDDR5x memory, 3584 cuda cores and 11.34 TFLOPS float (FP32) performance. The PC had 32 GB of DDR4 RAM and the entire algorithm was developed in Keras 2.2.4 with TensorFlow 1.13.1 backend, CUDA 10.0 and cuDNN 7.5.

About

2D Convolutional Recurrent Neural Networks for Speech Activity Detection

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages