Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Switch branches/tags
Nothing to show
Clone or download
Andrew Owens
Latest commit 8a5b44a Nov 9, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc update readme Apr 14, 2018
src fix windows compatibility bug Nov 9, 2018
.gitignore add license, update readme, add action recognition example Apr 20, 2018
LICENSE add license Apr 20, 2018 fix bug in CAM example code + compatibilty with tensorflow 1.9 Nov 9, 2018 initial commit Apr 14, 2018 initial commit Apr 14, 2018

[Paper] [Project page]

This repository contains code for the paper:

Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018


This release includes code and models for:

  • On/off-screen source separation: separating the speech of an on-screen speaker from background sounds.
  • Blind source separation: audio-only source separation using u-net and PIT.
  • Sound source localization: visualizing the parts of a video that correspond to sound-making actions.
  • Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).


pip install tensorflow     # for CPU evaluation only
pip install tensorflow-gpu # for GPU support

We used TensorFlow version 1.8, which can be installed with:

pip install tensorflow-gpu==1.8
  • Install other python dependencies
pip install numpy matplotlib pillow scipy
  • Download the pretrained models and sample data

Pretrained audio-visual features

We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see for a simple example that uses these pretrained features.

Audio-visual source separation

To try the on/off-screen source separation model, run:

python ../data/translator.mp4 --model full --duration_mult 4 --out ../results/

This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):

Input On-screen Off-screen

We can visually mask out one of the two on-screen speakers, thereby removing their voice:

python ../data/crossfire.mp4 --model full --mask l --out ../results/
python ../data/crossfire.mp4 --model full --mask r --out ../results/

This produces the following videos (click to watch):

Source Left Right

Blind (audio-only) source separation

This baseline trains a u-net model to minimize a permutation invariant loss.

python ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/

The model will write the two separated streams in an arbitrary order.

Visualizing the locations of sound sources

To view the self-supervised network's class activation map (CAM), use the --cam flag:

python ../data/translator.mp4 --model full --cam --out ../results/

This produces a video in which the CAM is overlaid as a heat map:

Action recognition and fine-tuning

We have provided example code for training an action recognition model (e.g. on the UCF-101 dataset) in This involves fine-tuning our pretrained, audio-visual network. It is also possible to train this network with only visual data (no audio).


If you use this code in your research, please consider citing our paper:

  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Owens, Andrew and Efros, Alexei A},
  journal={arXiv preprint arXiv:1804.03641},


  • 11/08/18: Fixed a bug in the class activation map example code. Added Tensorflow 1.9 compatibility.


Our u-net code draws from this implementation of pix2pix.