This is the implementation of the BirdCLEF 2018 submission by OFAI within the aMOBY project.
It allows training an ensemble of neural networks to recognize 1500 South American bird species in audio recordings, with an option to factor in metadata about the recording date, time and location.
It contains the code for preparing the dataset (converting audio files and parsing metadata), for training a set of different models on audio recordings and/or metadata, for finding weights to form an ensemble of those models, and for producing the predictions on the test set for submission to the challenge.
For a detailed description of the approach, please refer to the paper "Bird Identification from Timestamped, Geotagged Audio Recordings" by Jan Schlüter included in the CLEF Working Notes 2018. [Paper, BibTeX]
The code requires the following software:
- Python 2.7+ or 3.4+
- Python packages: numpy, scipy, Theano, Lasagne
- bash or a compatible shell
- ffmpeg
For better performance, the following Python packages are recommended:
- pyfftw (for much faster spectrogram computation)
Before installing the dependencies, if desired, create and activate an
environment using pyenv
and/or virtualenv
, or using conda
Install the bleeding-edge versions of Theano and Lasagne from github:
pip install --upgrade --no-deps
pip install --upgrade --no-deps
(If not in an environment, add --user
to install in your home directory, or
to install globally.)
For GPU support, also install libgpuarray, following its installation instructions. For a more complete guide including CUDA and cuDNN, please refer to the From Zero to Lasagne guides.
For faster FFTs, install libfftw3 and pyfftw. On Ubuntu, this can be done with:
sudo apt-get install libfftw3-dev
pip install pyfftw
Under conda, it would be:
conda install -c conda-forge pyfftw
For preparing the experiments, clone the repository somewhere:
git clone
If you do not have git
available, download the code from and extract it.
The experiments rely on the BirdCLEF 2018 dataset. First download the files (specifically, BirdCLEF2017TrainingSetPart1.tar.gz, BirdCLEF2017TrainingSetPart2.tar.gz, BirdCLEF2018MonophoneTest.tar.gz, BirdCLEF2018SoundscapesTest.tar.gz, BirdCLEF2018SoundscapesValidation.tar.gz) and extract them to a common directory. If you were not a BirdCLEF participant, ask the organizers if they are willing to share the URLs.
Then open the cloned or extracted repository in a bash terminal and execute the following:
It will tell you that you need to specify the path to the extracted files, but it will also display some useful hints on how to organize the placement of the converted audio files. This script will call other scripts to convert the audio to 22 kHz mono files (this saves time during training), build the file lists for training and testing, and extract the ground truth and metadata from the XML files.
Finally, for all following commands, go into the experiments
cd experiments
To train all models for the ensemble, simply run:
To use a GPU, either setup a .theanorc
file in your home directory, or run:
THEANO_FLAGS=device=cuda,floatX=float32,gpuarray.preallocate=11000 ./
This will train 17 audio, 19 metadata and one combined network(s). On an Nvidia
Titan X Pascal GPU, a single training run will take up to 10 hours for audio,
and 50 minutes for metadata networks. If your GPU does not have enough memory,
reduce or remove the gpuarray.preallocate=11000
setting, and reduce the batch
size that is set in the defaults.vars
If you have multiple GPUs, you can distribute runs over these GPUs by running
the script multiple times in multiple terminals with different target devices,
e.g., THEANO_FLAGS=device=cuda1 ./
. If you have multiple servers
that can access the same directory via NFS, you can also run the script on
each server for further distribution of runs (runs are blocked with lockfiles).
The script will also compute network predictions after each training run. If this failed for some jobs for some reasons, run:
This will compute any missing network predictions (if none are missing, nothing happens).
To obtain results for all networks trained so far, run:
This will print the Mean Average Precision (MAP) against the foreground species, the MAP against the background species, and the top-k accuracy for the foreground species for k between 1 and 5, all on the validation set (the test set is kept secret by the organizers of the BirdCLEF challenge).
After all models have been trained, you can run hyperopt
to find an optimal
linear combination of models based on the validation set performance. Install
it with:
pip install hyperopt
We can now run
to do the actual optimization. The commands are
documented in comments in
. For example, for the audio-only
ensemble, run:
./ --dataset=birdclef --labelfile-background=bg.tsv --strategy=hyperopt \
In the end, it will produce a list of selected models and combination weights
that can be directly copied to
, preceded by submit
and a name
for the ensemble. It can also be used directly as arguments to ./
evaluate the ensemble.
Finally, to create the CSV files for submission, run:
Prefix the command with a THEANO_FLAGS=...
setting if needed.
This will compute predictions on the test set for all models participating in
any of the ensembles, combine the predictions according to the weights, and
produce a CSV file for each ensemble.
Datasets can be added to the datasets
directory and their name be passed as
the --dataset
argument of
, if needed). Each dataset directory must contain:
- an
subdirectory with.wav
files (this is a strict requirement, since they are accessed as memory maps), - a
directory with at least atrain
file listing the file names relative to theaudio
directory, and - a
directory with afg.tsv
file listing the training and validation file names along with their class labels, with a tab character in between, and alabelset
file listing all class names to give them a fixed order.
The implementation makes some use of features unique to Lasagne, so it is not trivial to port completely to another framework. Some parts may be interesting to take out, though:
contains code for fast spectrogram computation, and aWavFile
class for masquerading a.wav
files as a numpy array that is lazily mapped to memory when
, which provides a way to yield random excerpts from a set of audio files with wildly different lengths. Each mini-batch will have same-length excerpts, with the length bounded between a given minimum and maximum length, and files drawn from buckets to avoid excessive cropping or
contains a learnable mel filterbank, a learnable magnitude transformation, PCEN, and log-mean-exp
implements a conversion of a CNN that classifies excerpts to a fully-convolutional network with dilated convolutions and dilated max-pooling that efficiently processes a full recording, keeping the full output resolution