Xudong Xu · Dejan Marković · Jacob Sandakly · Todd Keebler · Steven Krenn · Alexander Richard
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
supplemental_video.mp4
The Sounding Bodies dataset is hosted on AWS S3. We recommend using the AWS command line interface (see AWS CLI installation instructions).
To download the dataset run:
aws s3 cp --recursive --no-sign-request s3://fb-baas-f32eacb9-8abb-11eb-b2b8-4857dd089e15/SoundingBodies/ SoundingBodies/
or use sync
to avoid transferring existing files:
aws s3 sync --no-sign-request s3://fb-baas-f32eacb9-8abb-11eb-b2b8-4857dd089e15/SoundingBodies/ SoundingBodies/
The dataset takes around 680GB of space. If necessary, in configs/config_main.py
adjust data_dir
and mic_loc_file
to point to your download location.
NOTE: The published datased does not include speech data from subject7 and has no data from subject8. With respect to data used in the paper, this brings the total capture time from 4.4 hours to 3.6 hours. Below we provide pretrained model and updated evaluation numbers for the published dataset.
Third-party dependencies:
- tqdm
- numpy
- gitpython
- mmcv
- torch
- torchaudio
To train the network, run:
python train.py --config configs/config_main.py
To evaluate the performance of the model, in configs/config_main.py
change test_info_file
to desired test set: ./data_info/test/nonspeech_data.json
for non-speech data, and ./data_info/test/speech_data.json
for speech data, and run:
python evaluate.py --config configs/config_main.py --test_epoch best-accumulated_loss --out_name test
To save the output .wav
files add --save
option, for example:
python evaluate.py --config configs/config_main.py --test_epoch epoch-100 --out_name test --save
We provide the model trained on the published training set in ./checkpoint/neurips/pretrained/
. To evaluate the model, run:
python evaluate.py --config configs/config_pretrained.py --test_epoch best-accumulated_loss --out_name neurips_evaluation
The updated evaluation metrics are:
NON-SPEECH
SDR: 3.052
amplitude (x10^3): 0.832
phase: 0.314
SPEECH
SDR: 9.635
amplitude (x10^3): 0.701
phase: 0.464
NOTE: For the speech metrics reported in the paper, speech audio data was erroneously amplified by 10, resulting in the amplitude error being multiplied by 10, and the phase error being higher due to more silence/noise segments passing the energy threshold.
If you use this code or the dataset, please cite
@inproceedings{xu2023soundingbodies,
title={Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio},
author={Xu, Xudong and Markovic, Dejan and Sandakly, Jacob and Keebler, Todd and Krenn, Steven and Richard, Alexander},
booktitle={Conference on Neural Information Processing Systems},
year={2023}
}
The code and dataset are released under CC-BY-NC 4.0 license.