Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This repository contains the code corresponding to the following paper. If you use this code or part of it, please cite:

Andres Perez-Lopez, Eduardo Fonseca, Xavier Serra, "A Hybrid Parametric-Deep Learning Approach for Sound Event Localization and Detection", Submitted to DCASE2019 Challenge.


The method implemented represents a novel approach for the Sound Event Localization and Detection (SELD) task, which is Task 3 of DCASE2019 Challenge. We use the TAU Spatial Sound Events 2019 - Ambisonic dataset, which provides First-Order Ambisonic (FOA) recordings. For more details about the task setup and dataset, please check the corresponding DCASE website.

The method implemented is based on four systems: DOA estimation, association, beamforming and classification, as shown in the following Figure. In turn, the three former systems are conceptually grouped into the so called frontend, while the classification conforms the backend.

system architecture

  • parametric frontend:

    1. DOA estimation: The input data is preprocessed by a parametric spatial audio analysis, which yields time-frequency DOA estimations. An schematic representation of the method is shown in the following Figure.

    DOA system

    2. Association: The spatial-temporal information is grouped into _events_, each of them having specific onset/offset times. The next Figure depicts the algorithms.

    association system

    3. Beamforming: Given the angular position and the temporal context, each event is mono-segmentated by beamforming in the input ambisonic scene.
  • classification backend: The estimated event signals are finally labelled by a multi-class CRNN, illustrated in the next Figure.

classification backend

Please, refer to the publication for a more detailed explanation of the method, including evaluation metrics and discussion.

Implementation and Usage

The SELD task is internally implemented in two different stages. First, estimates DOAs and audio segments from the dataset, and writes the results as csv files. The same code is used for both development and evaluation sets.

The signals estimated by the frontend are organized as an intermediate dataset of monophonic sound events (although some leakage is expected in some of them). Two versions of this intermediate dataset must be computed:

  • using an ideal frontend. Audio clips are obtained using the ground truth DOAs and onset/offset times as inputs to the beamformer; hence beamforming is the only processing carried out. Use preprocess_metadata_files.pyfor this. Files are stored in data/mono_data/wav/dev/

  • using the proposed complete frontend. Use for this. Files are stored in data/mono_data/wav/dev_param_Q/

These scripts need that the development set (400 wav files of TAU Spatial Sound Events 2019 - Ambisonic) be placed in data/foa_dev/, and the accompanying metadata be placed in /data/metadata_dev. The outcome of scripts and is:

  • the deterministic list of audio clips stored, and
  • the corresponding ground truth csv files to access the clips subsequently (namely gt_dev.csv and gt_dev_parametric_Q.csv, which must be placed in data/mono_data for further processing.

During development, the classification backend is trained using the provided four fold cross-validation setup in the development set. The CRNN is trained always on the outcome of the ideal frontend. The CRNN is tested on both the outcome of the ideal frontend and proposed complete frontend. See classif/ for more details. You can run it with

CUDA_VISIBLE_DEVICES=0 KERAS_BACKEND=tensorflow python -p params.yaml &> output_logs.out

During evaluation mode (ie, challenge submission), the CRNN is trained on the entire development set processed by the ideal frontend. Then, we predict on the evaluation set previously processed by the proposed complete frontend.

Results are written in two different csv conventions: metadata and output. Metadata files follow the grountruth convention: each row corresponds to a sound event, and information about class, onset/offset time (in seconds), and angular position is provided. Conversely, output files are frame-based (with a row for each frame with activity), and thus the onset/offset information is not explicitly stated.

In evaluation mode, DOA estimations are written in the corresponding results folder (results_metadata or results_output), within the /doa folder. Those first annotations do not contain the classification estimation. Then, after running the deep learning classification backend, the final result files are written into the /classif folder. Therefore, the result files following the required challenge format are stored in results_ouput/[DATASET_MODE_PRESET]/classif.

Results reported in the paper correspond to the configuration "Q". Therefore, the result files can be found in results_ouput/foa_[MODE]_Q]/classif).

Project Structure


Code for DOA and utils. Main DOA estimation loop. Based on the provided parameter configuration, iterates over the files computing localization and segmentation. As the process output, result files (in both output and metadata formats) are writen in the corresponding folders. Script to compute DOA metrics on the analysis results. Adapted from seld_dcase2019_master/ This file contains the source localization and segmentation functions. Some handy functions to read/write files in the required formats. The different run configurations are described here as a dictionary. There is only one method described, get_params(preset_string=None), which defines all possible values that parameters can take when running the algorithm. Different configuration presets can be identified by the preset_string. This file contains the code to extract monophonic estimates of the sources given the development dataset and the groundtruth. Beamforming is the only processing carried out (ideal frontend). Files are stored in data/mono_data/wav/dev/ This file contains the code to extract monophonic estimates of the sources given the development dataset WITHOUT any groundtruth. The complete proposed frontend is applied. Files are stored in data/mono_data/wav/dev_param_Q/ This file. Some convenience mathematical functions and classes. Script to visualize DOA metrics on the analysis results. Adapted from seld_dcase2019_master/misc_files/


Folder where the dataset should be located. Due to the size, the contents of this folder are not included in git.


The algorithm output is stored here, in form of .csv files, as in the required challenge output format. Each folder corresponds to the output with a different dataset type, mode and configuration preset. Within each folder, folder doa contains localization and timing estimations, and classif adds the source classification.


The doa-estimation output in human-readable style, formatted for the source classification step. Each folder corresponds to the output with a different dataset type, mode and configuration preset. Within each folder, folder doa contains localization and timing estimations, and classif adds the source classification.


This is just a fork of the baseline method, as of 28th March 2019 (commit c11188984875600a607d85f98ca05958ad9287ab). Some of the methods (evaluation, plot) are taken from there. It should be also possible to train and run it...


Deep learning classification backend code. is the main script. After training and testing the CRNN using multi-class classification accuracy, SELD metrics are computed using first the proposed complete fronten, and also using the ideal forntend contains the data generators and mixup code contains feature extraction code contains the CRNN architecture some basic utilities evaluation code (only for computing accuracy in the multi-class classsification problem during training; nothing to do with SELD metrics)



Code for the Task 3 of DCASE2 Challenge 2019




No packages published