Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization (SIRA-SSL)

The Official PyTorch Implementation of the paper "Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization".

Accepted at ACMMM 2023:

[Paper] [arXiv]

Abstract

The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches.

Dataset

Flickr SoundNet

Download the Flickr SoundNet dataset from here

VGG-Sound Source

Download the VGG-Sound Source dataset from here

Training

Modify train.sh

trainset="dataset_to_train"
testset="dataset_to_test"
train_data_path="path_to_trainset"
gt_path="path_to_ground_truth"
etc...

Run train.sh

./train.sh

Evaluation

Modify test.sh

testset="dataset_to_test"
test_data_path="path_to_testset"
gt_path="path_to_ground_truth"
etc...

Run test.sh

./test.sh

Citation

If you find this code useful for your research, please cite our paper:

@inproceedings{um2023audio,
  title={Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization},
  author={Um, Sung Jin and Kim, Dongjin and Kim, Jung Uk},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={3507--3516},
  year={2023}
}

Acknowledgement

Our code is based on Attention, HardWay, HearTheFlow. We thank the authors for their great work and sharing the code.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
figure.png		figure.png
test.sh		test.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

figure.png

figure.png

test.sh

test.sh

train.sh

train.sh

Repository files navigation

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization (SIRA-SSL)

Abstract

Dataset

Flickr SoundNet

VGG-Sound Source

Training

Modify train.sh

Run train.sh

Evaluation

Modify test.sh

Run test.sh

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

VisualAIKHU/SIRA-SSL

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization (SIRA-SSL)

Abstract

Dataset

Flickr SoundNet

VGG-Sound Source

Training

Modify train.sh

Run train.sh

Evaluation

Modify test.sh

Run test.sh

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages