Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

VIMSS Visually-Informed Music Source Separation

Visually-informed Music Source Separation project @ Jeju 2018 Deep Learning Summer Camp


Unsupervised and weakly-supervised audio-visual deep learning models have emerged recently with an application in many tasks such as classification [1, 6], speech separation [4], audio source separation and localisation [2, 3].

In this project, we focus audio-visual music source separation. Taking as a basis the models proposed in [2] and [3], we would like to

  1. Reproduce the pipeline and results of [2, 3];
  2. Make an extent for more than two sources as both work focus only on the case of one or two audio and visual sources;
  3. Take advantage of integrating more advanced audio source separation models [5] into the audio-visual pipeline.

Proposed Framework


Evaluation datasets

  1. URMP dataset
  2. Clarinet4Science dataset
  3. Home-made Sound-of-Pixels dataset (by Juan Montesinos)
  4. MUSDB18 as a reference


  1. AudioSet
  2. Youtube-8M dataset

Audio Baselines


  • Reproduce Wave-U-Net baseline with MUSDB (GPU/TPU)
  • URMP dataset preprocessing
  • Wave-U-Net extension for URMP dataset (multiple sources)
  • Wave-U-Net conditioning for URMP dataset (with concatenation || multiplicative/additive attention)
  • Segmentation and feature estimation tasks from video frames
  • Writing, dissemination, demo

Model Architecture


The basis of our work stems from the Wave-U-Net [8] model implementation which performs end-to-end audio source separation with raw audio bits in time domain. Wave-U-Net model is an adaptation of the original U-Net [9] to perform series of 1-D convolution and series of up-sampling with skip connections from encoder to decoder layer at each feature level.

Wave-U-Net Architecture

The input to this network is a single channel audio mix, and the desired output is the separated K channels of individual audio sources, where K is the number of sources present in the audio mix. From each 2 to 3 minutes long music tracks, we split them into mini segments of 147443 samples, which comes to about 6 seconds long wav files. Then this input goes through 12 successive layers of 1D convolution down-sampling, where at each layer decimation drops the time resolution by half. At the very bottom of Wave-U-Net, number of sample drops extremely small to about only 9 samples long. Going up the U-Net, instead of using the transposed strided convolutions, linear interpolation is performed for upsampling. This preserves temporal continuity and avoids high-frequency noise in the final result. In other works, people have attempted to zero pad features maps and input before convolving to keep the original dimension size. However, in Wave-U-Net, convolutions are performed without implicit padding due to the aforementioned audio artifacts problem at segment borders. Therefore, our output result is much shorter (16839 samples) than our input (147443 samples) as a price to compute with correct audio context. On our final layer, K convolutional filters are applied to the features to extract K separate source outputs.


Feature-wise Transformation (Conditioning)

A significant addition to the source separation pipeline should be aggregation of multiple sources of information.

Our problem space using video data involves several different modalities of information:

  1. Audio

  2. Images (frames)

  3. Temporal evolution in video by optical flow or another advanced model

  4. Optional Scores and midi

We want our model to learn by understanding the context of information from images and refer to this while training the audio model. One way to fuse these different sources of information is by applying feature-wise transformation [10]. There are different methods to do these transformations and there are different stages where the transformations can be applied.


  • Simple concatenation
  • Additive conditioning
  • Multiplicative conditioning


  • Conditioning at every convolutional layer
  • Conditioning at the bottleneck
  • Conditioning at the output layer

The most naive approach is extracting the labels of instruments which are present in a video and condition source separation with those labels. Even though concatenation sounds like the simplest solution, we have little intuition why it should work. We experimented with multiplicative conditioning with ground through labels applying them at the bottleneck of Wave-U-Net. It results in a slightly lower but more noisy loss:

Loss curve for conditioned Wave-U-Net on URMP dataset


  • batch size
  • bfloat16
  • learning rate
  • exponential decay
  • number of sources



  • Leo Kim (@leoybkim), University of Waterloo
  • Olga Slizovskaia (@veleslavia), Pompeu Fabra University

Paper link

Related work

[1]. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba. Sounds of Pixels

[2]. Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

[3]. Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon. Learning to Localize Sound Source in Visual Scenes

[4]. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

[5]. Relja Arandjelovic, Andrew Zisserman. Objects that Sound

[6]. Ruohan Gao, Rogerio Feris, Kristen Grauman. Learning to Separate Object Sounds by Watching Unlabeled Video

[7]. Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard. Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

[8]. Daniel Stoller, Sebastian Ewert, Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

[9]. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation

[10]. Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, Yoshua Bengio. Feature-wise transformations


This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. Olga also acknowledges support from the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502).


This project is licensed under the GNU GPL v3 License - see the file for details


Visually-informed Music Source Separation project at Jeju 2018 Deep Learning Summer Camp







No releases published


No packages published