A Wavenet for Music Source Separation

A neural network for end-to-end music source separation, as described in End-to-end music source separation: is it possible in the waveform domain?

Listen to separated samples here

What is a Wavenet for Music Source Separation?

The Wavenet for Music Source Separation is a fully convolutional neural network that directly operates on the raw audio waveform.

It is an adaptation of Wavenet that turns the original causal model (that is generative and slow), into a non-causal model (that is discriminative and parallelizable). This idea was originally proposed by Rethage et al. for speech denoising and now it is adapted for monaural music source separation. Their code is reused.

The main difference between the original Wavenet and the non-causal adaptation used, is that some samples from the future can be used to predict the present one. As a result of removing the autoregressive causal nature of the original Wavenet, this fully convolutional model is now able to predict a target field instead of one sample at a time – due to this parallelization, it is possible to run the model in real-time on a GPU.

See the diagram below for a summary of the network architecture.

Installation

git clone https://github.com/francesclluis/source-separation-wavenet.git
Install conda
conda env create -f environment.yml
source activate sswavenet

Currently the project requires Keras 2.1 and Theano 1.0.1, the large dilations present in the architecture are not supported by the current version of Tensorflow

Usage

A pre-trained multi-instrument model (best-performing model described in the paper) can be found in sessions/multi-instrument/checkpoints and is ready to be used out-of-the-box. The parameterization of this model is specified in sessions/multi-instrument/config.json

A pre-trained singing-voice model (best-performing model described in the paper) can be found in sessions/singing-voice/checkpoints and is ready to be used out-of-the-box. The parameterization of this model is specified in sessions/singing-voice/config.json

Download the dataset as described below

Source Separation:

Example (multi-instrument): THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/multi-instrument/config.json --mixture_input_path audio/

Example (singing-voice): THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/singing-voice/config.json --mixture_input_path audio/

Speedup

To achieve faster source separation, one can increase the target-field length by use of the optional --target_field_length argument. This defines the amount of samples that are separated in a single forward propagation, saving redundant calculations. In the following example, it is increased 10x that of when the model was trained, the batch_size is reduced to 4.

Faster Example: THEANO_FLAGS=device=cuda python main.py --mode inference --target_field_length 16001 --batch_size 4 --config sessions/multi-instrument/config.json --mixture_input_path audio/

Training:

Example (multi-instrument): THEANO_FLAGS=device=cuda python main.py --mode training --target multi-instrument --config config_multi_instrument.json

Example (singing-voice): THEANO_FLAGS=device=cuda python main.py --mode training --target singing-voice --config config_singing_voice.json

Configuration

A detailed description of all configurable parameters can be found in config.md

Optional command-line arguments:

Argument	Valid Inputs	Default	Description
mode	[training, inference]	training
target	[multi-instrument, singing-voice]	multi-instrument	Target of the model to train
config	string	config.json	Path to JSON-formatted config file
print_model_summary	bool	False	Prints verbose summary of the model
load_checkpoint	string	None	Path to hdf5 file containing a snapshot of model weights

Additional arguments during source separation:

Argument	Valid Inputs	Default	Description
one_shot	bool	False	Separates each audio file in a single forward propagation
target_field_length	int	as defined in config.json	Overrides parameter in config.json for separating with different target-field lengths than used in training
batch_size	int	as defined in config.json	# of samples per batch

Dataset

The MUSDB18 is used for training the model. It is provided by the Community-Based Signal Separation Evaluation Campaign (SISEC).

Download here
Decode dataset to WAV format as explained here
Extract to data/MUSDB

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
audio		audio
img		img
sessions		sessions
LICENSE		LICENSE
README.md		README.md
config.md		config.md
config_multi_instrument.json		config_multi_instrument.json
config_singing_voice.json		config_singing_voice.json
datasets.py		datasets.py
environment.yml		environment.yml
layers.py		layers.py
main.py		main.py
models.py		models.py
separate.py		separate.py
util.py		util.py

License

francesclluis/source-separation-wavenet

Folders and files

Latest commit

History

Repository files navigation

A Wavenet for Music Source Separation

What is a Wavenet for Music Source Separation?

Installation

Usage

Source Separation:

Speedup

Training:

Configuration

Optional command-line arguments:

Additional arguments during source separation:

Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Languages