Skip to content
Deep Audio-Visual Embedding network (DAVEnet) implementation in PyTorch
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
dataloaders first commit Aug 30, 2018
.gitignore first commit Aug 30, 2018 Updated license and readme Aug 31, 2018
requirements.txt first commit Aug 30, 2018

DAVEnet Pytorch

Implementation in Pytorch of the DAVEnet (Deep Audio-Visual Embedding network) model, as described in

David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass, "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input," ECCV 2018


  • pytorch
  • torchvision
  • librosa


You will need the PlacesAudio400k spoken caption corpus in addition to the Places205 image dataset:

Please follow the instructions provided in the PlacesAudio400k download package with respect to how to configure and specify the dataset .json files.

Model Training

python train.json --data-val val.json

Where train.json and val.json are included in the PlacesAudio400k dataset.

See the script for more training options.

You can’t perform that action at this time.