Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Starter code for SoundSpaces challenge at CVPR 21's Embodied AI workshop

License

Notifications You must be signed in to change notification settings

facebookresearch/soundspaces-challenge

Repository files navigation


SoundSpaces Challenge 2023

This repository contains starter code for the 2023 challenge, details of the tasks, and training and evaluation setups. For an overview of SoundSpaces Challenge visit soundspaces.org/challenge.

This year, we are hosting two challenges: the first one is on the audio-visual navigation task [1], where an agent is tasked to find a sound-making object in unmapped 3D environments with visual and auditory perception, and the second one is on the active audio-visual source separation task [3], where an agent is tasked to separate a target sound-making object emitting time-varying sounds from an audio mixture comprising spatial time-varying sounds from multiple sound sources.

Task

In AudioGoal navigation (AudioNav), an agent is spawned at a random starting position and orientation in an unseen environment. A sound-emitting object is also randomly spawned at a location in the same environment. The agent receives a one-second audio input in the form of a waveform at each time step and needs to navigate to the target location. No ground-truth map is available and the agent must only use its sensory input (audio and RGB-D) to navigate.

In Active Audio-Visual Separation (active AV separation), an agent is spawned at a random starting position and orientation in an unseen environment. Multiple sound-emitting objects, each of which emits a time-varying sound, are also randomly spawned at a location in the same environment. The agent receives a one-second audio input in the form of a waveform, which is a mixture of the spatial sounds from all sources, at each time step and needs to navigate to separate the audio from a target source, denoted by a target class label, at every step of its motion. No ground-truth map is available and the agent must only use its sensory input (audio and RGB) to navigate. The current version of the challenge considers separation scenarios like speech vs. speech and speech. vs. music.

Dataset

The challenge will be conducted on the SoundSpaces Dataset, which is based on AI Habitat, Matterport3D, and Replica. For this challenge, we use the Matterport3D dataset due to its diversity and scale of environments. This challenge focuses on evaluating agents' ability to generalize to unheard sounds and unseen environments. For AudioNav, the training and validation splits are the same as used in Unheard Sound experiments reported in the SoundSpaces paper. They can be downloaded from the SoundSpaces dataset page (including minival). For active AV separation, the training and validation splits are the same as used in Unheard Sound experiments reported in the Active AV Dynamic Separation paper.

Evaluation

For AudioNav, after calling the STOP action, the agent is evaluated using the 'Success weighted by Path Length' (SPL) metric [2]. An episode is deemed successful if on calling the STOP action, the agent is within 0.36m (2x agent-radius) of the goal position.

For active AV separation, the agent is evaluated using the 'Scale-invariant source-to-distortion ratio' (SI-SDR) metric, averaged over the whole agent trajectory.

Participation Guidelines

Participate in the contest by registering on the EvalAI challenge page and creating a team. Participants will upload JSON files containing the evaluation metric values for both challenges. For AV Nav, participants will also upload the trajectories executed by their model, which will be used to validate the submitted performance values. For active AV separation, the winning teams will be later asked to turn in their code and checkpoints for inspection. Suspicious submissions will be reviewed and if necessary, the participating team will be disqualified. Instructions for evaluation and online submission are provided below.

Evaluation

For AudioNav,

  1. Clone the challenge repository:

    git clone https://github.com/facebookresearch/soundspaces-challenge.git
    cd soundspaces-challenge
  2. Implement your own agent or try one of ours. We provide an agent in agent.py that takes random actions:

    import habitat
    import soundspaces
    
    class RandomAgent(habitat.Agent):
        def reset(self):
            pass
    
        def act(self, observations):
            return numpy.random.choice(len(self._POSSIBLE_ACTIONS))
    
    def main():
        agent = RandomAgent(task_config=config)
        challenge = soundspaces.Challenge()
        challenge.submit(agent)
  3. Following instructions for downloading SoundSpaces dataset and place all data under data/ folder.

  4. Evaluate the random agent locally:

    env CHALLENGE_CONFIG_FILE="configs/challenge_random.local.yaml" python agent.py 

    This calls eval.py, which dumps a JSON file that contains a Python dictionary of the following type:

    eval_dict = {"ACTIONS": {f"{scene_id_1}_{episode_id_1}": [action_1_1, ..., 0], f"{scene_id_2}_{episode_id_2}": [action_2_1, ..., 0]}, "SPL": average_spl, "SOFT_SPL": average_softspl, "DISTANCE_TO_GOAL": average_distance_to_goal, "SUCCESS": average_success}

    Make sure that the json that gets dumped upon evaluating your agent is of the exact same type. The easiest way to ensure that is by not modifying eval.py.

For active AV separation, follow instructions in the challenge branch of the active-AV-dynamic-separation repository.

Online submission

Follow instructions in the submit tab of the EvalAI challenge page (will open soon!) to upload your evaluation JSON file.

Valid challenge phases are AudioNav {Minival, Test-Standard} Phase and AudioSep Test-Standard Phase.

The challenge consists of the following phases:

  1. AudioNav Minival Phase: This split is same as the one used in ./test_locally_audionav_rgbd.sh. The purpose of this phase/split is sanity checking -- to confirm that your online submission to EvalAI doesn't run into any issue during evaluation. Each team is allowed maximum of 30 submission per day for this phase.
  2. AudioNav Test-Standard Phase: The purpose of this phase is to serve as the public leaderboard establishing the state of the art for AudioNav; this is what should be used to report results in papers. The relevant split for this phase is test_multiple_unheard. Each team is allowed maximum of 10 submission per day for this phase. As a reminder, the submitted trajectories will be used to validate the submitted performance values. Suspicious submissions will be reviewed and if necessary, the participating team will be disqualified.
  3. AudioSep Test-Standard Phase: The purpose of this phase is to serve as the public leaderboard establishing the state of the art for active AV separation; this is what should be used to report results in papers. The relevant split for this phase is testUnheard_1000episodes. Each team is allowed maximum of 30 submission per day for this phase. As a reminder, the winning teams of the active AV separation challenge will be later asked to turn in their code and checkpoints for inspection. Suspicious submissions will be reviewed and if necessary, the participating team will be disqualified.

Note: If you face any issues or have questions you can ask them by mailing the organizers or opening an issue on this repository.

Baselines and Starter Code

  1. AudioNav: We included both the configs and Python scripts for av-nav and av-wan. Note that the MapNav environment used by av-wan is baked into the environment container and can't be changed. We suggest you to re-write that planning for loop in the agent code if you want to modify mapping or planning.

  2. Active AV Separation: We have included configs and Python in the challenge branch of the active-AV-dynamic-separation repository.

Acknowledgments

Thank Habitat team for the challenge template.

References

[1] SoundSpaces: Audio-Visual Navigation in 3D Environments. Changan Chen*, Unnat Jain*, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman. ECCV, 2020.

[2] On evaluation of embodied navigation agents. Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir. arXiv:1807.06757, 2018.

[3] Active Audio-Visual Separation of Dynamic Sound Sources. Sagnik Majumder, Kristen Grauman. ECCV, 2022.

License

This repo is MIT licensed, as found in the LICENSE file.

About

Starter code for SoundSpaces challenge at CVPR 21's Embodied AI workshop

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published