TensorFlow implementation of "SoundNet".
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
README.md Update extract code and minor type fixed Oct 3, 2017



TensorFlow implementation of "SoundNet" that learns rich natural sound representations.

Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

from soundnet


  • Linux
  • NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
  • Python 2.7 with numpy or Python 3.5
  • Tensorflow 1.0.0 (up to 1.3.0)
  • librosa

Getting Started

  • Clone this repo:
git clone git@github.com:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
  • Pretrained Model

I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. Please place it as ./models/sound8.npy in your folder.

  • Data

Prepare you input mp3 files and place them under ./data/

Generate a input file txt and place it under ./


Follow the steps in extract features

  • NOTE

If you found out that some audio with offset value start in FFMPEG will cause a tremendous difference between torch audio and librosa, please convert it with following command.

sox {input.mp3} {output.mp3} trim 0

After this, the result might be much better.


For demo, you can follow the following steps

i) Download a converted npy file demo.npy and place it under ./data/

ii) To extract multiple features from a pretrained model with torch lua audio loaded sound track: The sound track is equivalent with torch version.

python extract_feat.py -m {start layer number} -x {end layer numbe} -s

Then you can compare the outputs with torch ones.

Feature Extraction

Minimum example

i) Download input file demo.mp3 and place it under ./data/

ii) Prepare a file list in txt format (demo.txt) that includes the input mp3 file(s) and place it under ./


iii) Then extract features from raw wave in demo.txt: Please put the demo mp3 under ./data/demo.mp3

python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt

More options

To extract multiple features from a pretrained model with downloaded mp3 dataset:

python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract

e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy:

python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract

More details are in:

python extract_feat.py -h


To train from an existing model:

python main.py 


To train from scratch:

python main.py -p train

To extract features:

python main.py -p extract -m {start layer number} -x {end layer numbe} -s

More details are in:

python main.py -h


  • Change audio loader to soundnet format
  • Make it compatible to Python 3 format
  • Batch Norm behaviour different from Torch
  • Fix conv8 padding issue in training phase
  • Change all config into tf.app.flags
  • Change dummy distribution of scene and object to useful placeholder
  • Add sound and feature loader from Data section

Known issues

  • Loaded audio length is not consist in torch7 audio and librosa. Here is the issue
  • Training with a short length audio will make conv8 complain about output size would be negative


  • Why my loaded sound wave is different from torch7 audio to librosa: Here is my WiKi


Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!


Hou-Ning Hu / @eborboihuc