CNN-based singing voice detection experiments
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks

This is a reimplementation of the experiments presented in the paper "Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks" by Jan Schlüter and Thomas Grill at the 16th International Society for Music Information Retrieval Conference (ISMIR 2015). [Paper, BibTeX]

For follow-up experiments described in my PhD thesis, see the phd_extra branch, and for experiments on training a network to not be irritated by wiggly lines, see the unhorse branch. For a demonstration on how the networks can be fooled with hand-drawn wiggly lines, see the singing_horse repository.


The code requires the following software:

  • Python 2.7+ or 3.4+
  • Python packages: numpy, scipy, Theano, Lasagne
  • bash or a compatible shell with wget and tar
  • ffmpeg or avconv

For better performance, the following Python packages are recommended:

  • pyfftw (for much faster spectrogram computation)
  • scipy version 0.15+ (to allow time stretching and pitch shifting augmentations to be parallelized by multithreading, not only by multiprocessing,

For Theano and Lasagne, you may need the bleeding-edge versions from github. In short, they can be installed with:

pip install --upgrade --no-deps
pip install --upgrade --no-deps

(Add --user to install in your home directory, or sudo to install globally.) For more complete installation instructions including GPU setup, please refer to the From Zero to Lasagne guides.

On Ubuntu, pyfftw can be installed with the following two commands:

sudo apt-get install libfftw3-dev
pip install pyfftw


For preparing the experiments, clone the repository somewhere:

git clone

If you do not have git available, download the code from and extract it.

The experiments use the public Jamendo dataset by Mathieu Ramona. To download and prepare it, open the cloned or extracted repository in a bash terminal and execute the following scripts (in this order):



Table 1 in the paper shows results for Jamendo without data augmentation, with train-time augmentation (combining pitch-shifting, time-stretching and random frequency filtering), with test-time augmentation (pitch-shifting only) and with both train-time and test-time augmentation.

w/o augmentation

To reproduce results without augmentation, run the following in a terminal in the cloned or extracted repository:

cd experiments
python --no-augment --cache=/tmp jamendo_noaugment.npz
python --cache=/tmp jamendo_noaugment.{,pred.}npz
python jamendo_noaugment.pred.npz

The --cache=/tmp option will store the spectrograms in /tmp so they do not have to be recomputed for further runs. You can pass any directory there, or omit this option to always compute them on-the-fly (this will add less than a minute to training, and less than half a minute to computing the predictions). Total space requirements for the spectrograms are about 3.2 GiB.

The training code will produce two files: jamendo_meanstd.npz, storing the statistics needed to standardize the data, computed on the training set, and jamendo_noaugment.npz, storing the weights of the trained network.

The second command reads the network weights, computes predictions for all files of the validation and test set, and stores them in jamendo_noaugment.pred.npz.

Finally, the third command reads the predictions, preprocesses them, optimizes the threshold on the validation set and reports results on the test set.

Each command can be run with --help for documentation on further options.

train augmentation

To reproduce results with train-time augmentation, run:

OMP_NUM_THREADS=1 python --augment --cache=/tmp jamendo_augment.npz
python --cache=/tmp jamendo_augment{,.pred}.npz
python jamendo_augment.pred.npz

The only change is that --augment is activated for training. By default, data augmentation will happen on CPU in three background threads running in parallel to the training thread. Change bg_threads or bg_processes in if this is not what you want (this is not exposed as a command line argument). The OMP_NUM_THREADS=1 environment variable setting prevents the background threads from using multi-threaded BLAS routines, which would slow things down.

test augmentation

For test-time augmentation, run:

python --cache=/tmp --pitchshift=+10 jamendo_noaugment{,_p10.pred}.npz
python --cache=/tmp --pitchshift=+20 jamendo_noaugment{,_p20.pred}.npz
python --cache=/tmp --pitchshift=-10 jamendo_noaugment{,_m10.pred}.npz
python --cache=/tmp --pitchshift=-20 jamendo_noaugment{,_m20.pred}.npz
python jamendo_noaugment{,_p10,_p20,_m10,_m20}.pred.npz

This computes predictions for the first network with files pitch-shifted by +10, +20, -10 and -20 percent, then bags the predictions (along with the non-shifted ones) for evaluation.

train/test augmentation

For train-time and test-time augmentation, run:

python --cache=/tmp --pitchshift=+10 jamendo_augment{,_p10.pred}.npz
python --cache=/tmp --pitchshift=+20 jamendo_augment{,_p20.pred}.npz
python --cache=/tmp --pitchshift=-10 jamendo_augment{,_m10.pred}.npz
python --cache=/tmp --pitchshift=-20 jamendo_augment{,_m20.pred}.npz
python jamendo_augment{,_p10,_p20,_m10,_m20}.pred.npz

Similar to the previous step, this uses a network we trained before and applies it to different pitch-shifted versions.


... the code

This is not the code used for the original paper, but a compacted reimplementation. It is not perfectly identical (e.g., the original experiments use zero-padding of input files during training, while this implementation discards the borders for training and only pads for testing), but very close. It is written to be easy to read and pick out parts for reuse (obeying the license), not so much as a generic starting point for own experiments. For a more feature-complete starting point, see the phd_extra branch.

... the results

Results will vary depending on the random initialization of the networks. Even with fixed random seeds, results will not be exactly reproducible due to the multi-threaded data augmentation. Furthermore, when training on GPU with cuDNN, the backward pass is nondeterministic by default, introducing further noise. For more reliable comparison between the four variants, each experiment should be repeated at least five times, to compute averages and confidence intervals.