VIMSS Visually-Informed Music Source Separation
Visually-informed Music Source Separation project @ Jeju 2018 Deep Learning Summer Camp
Unsupervised and weakly-supervised audio-visual deep learning models have emerged recently with an application in many tasks such as classification [1, 6], speech separation , audio source separation and localisation [2, 3].
- Reproduce the pipeline and results of [2, 3];
- Make an extent for more than two sources as both work focus only on the case of one or two audio and visual sources;
- Take advantage of integrating more advanced audio source separation models  into the audio-visual pipeline.
- URMP dataset https://datadryad.org//resource/doi:10.5061/dryad.ng3r749
- Clarinet4Science dataset
- Home-made Sound-of-Pixels dataset (by Juan Montesinos)
- MUSDB18 as a reference
- Youtube-8M dataset
- Reproduce Wave-U-Net baseline with MUSDB (GPU/TPU)
- URMP dataset preprocessing
- Wave-U-Net extension for URMP dataset (multiple sources)
- Wave-U-Net conditioning for URMP dataset (with concatenation || multiplicative/additive attention)
- Segmentation and feature estimation tasks from video frames
- Writing, dissemination, demo
The basis of our work stems from the Wave-U-Net  model implementation which performs end-to-end audio source separation with raw audio bits in time domain. Wave-U-Net model is an adaptation of the original U-Net  to perform series of 1-D convolution and series of up-sampling with skip connections from encoder to decoder layer at each feature level.
The input to this network is a single channel audio mix, and the desired output is the separated K channels of individual audio sources, where K is the number of sources present in the audio mix. From each 2 to 3 minutes long music tracks, we split them into mini segments of 147443 samples, which comes to about 6 seconds long wav files. Then this input goes through 12 successive layers of 1D convolution down-sampling, where at each layer decimation drops the time resolution by half. At the very bottom of Wave-U-Net, number of sample drops extremely small to about only 9 samples long. Going up the U-Net, instead of using the transposed strided convolutions, linear interpolation is performed for upsampling. This preserves temporal continuity and avoids high-frequency noise in the final result. In other works, people have attempted to zero pad features maps and input before convolving to keep the original dimension size. However, in Wave-U-Net, convolutions are performed without implicit padding due to the aforementioned audio artifacts problem at segment borders. Therefore, our output result is much shorter (16839 samples) than our input (147443 samples) as a price to compute with correct audio context. On our final layer, K convolutional filters are applied to the features to extract K separate source outputs.
Feature-wise Transformation (Conditioning)
A significant addition to the source separation pipeline should be aggregation of multiple sources of information.
Our problem space using video data involves several different modalities of information:
Temporal evolution in video by optical flow or another advanced model
Optional Scores and midi
We want our model to learn by understanding the context of information from images and refer to this while training the audio model. One way to fuse these different sources of information is by applying feature-wise transformation . There are different methods to do these transformations and there are different stages where the transformations can be applied.
- Simple concatenation
- Additive conditioning
- Multiplicative conditioning
- Conditioning at every convolutional layer
- Conditioning at the bottleneck
- Conditioning at the output layer
The most naive approach is extracting the labels of instruments which are present in a video and condition source separation with those labels. Even though concatenation sounds like the simplest solution, we have little intuition why it should work. We experimented with multiplicative conditioning with ground through labels applying them at the bottleneck of Wave-U-Net. It results in a slightly lower but more noisy loss:
- batch size
- learning rate
- exponential decay
- number of sources
- Leo Kim (@leoybkim), University of Waterloo
- Olga Slizovskaia (@veleslavia), Pompeu Fabra University
. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group. Olga also acknowledges support from the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502).
This project is licensed under the GNU GPL v3 License - see the LICENSE.md file for details