🌸 SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
The official GitHub page of the paper "SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information"
- Authors: Chih-Kai Yang, Neo S. Ho*, Yen-Ting Piao*, Hung-yi Lee (*Equal Contribution)
- Affiliation: National Taiwan University
- Accepted to Interspeech 2025
- Paper link: https://arxiv.org/abs/2505.13237
TL;DR: We propose SAKURA, a benchmark for LALM multi-hop reasoning, and reveal their deficiency of such reasoning skills.
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs’ multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
🌟 Key findings
- LALMs exhibit blind spots for certain speech/audio attributes, failing to recognize them accurately.
- LALMs are unable to perform multi-hop reasoning based on attribute information they have correctly extracted from audio.
- We find that they can reason well in the textual modality, indicating that the above failures are not due to a lack of knowledge, but rather insufficient integration of multimodal information.
- [2025.06.09] Our dataset is released on HuggingFace! You can now access the datasets of the four tracks at HuggingFace.
- [2025.05.22] Our paper is now available on arXiv.
- [2025.05.19] Our paper is accepted to Interspeech 2025! See you in Rotterdam!
Here are the baseline models we included in the paper. The implementations were based on the official ones.
-
LTU-AS
-
GAMA-IT
-
SALMONN
-
DeSTA2
-
Qwen-Audio-Chat
-
Qwen2-Audio-Instruct
-
GPT-4o Audio
-
Gemini-1.5-flash and Gemini-1.5-pro
SAKURA uses the LLM-as-a-judge approach for automatic evaluation. We provide the evaluation code, models, and prompts in the evaluation/ directory. For more details on how to evaluate your models with SAKURA, please refer to this directory.
Here is the leaderboard of SAKURA, which currently only contain the baseline LALMs in the paper.
| Model | Size | Gender | Language | Emotion | Animal | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Single | Multi | Single | Multi | Single | Multi | Single | Multi | Single | Multi | ||
| LTU-AS | 7B | 52.4 | 19.6 | 16.8 | 11.4 | 28.6 | 19.6 | 65.6 | 21.8 | 40.9 | 18.1 |
| GAMA-IT | 7B | 76.4 | 39.8 | 5.6 | 19.4 | 5.6 | 24.2 | 85.2 | 51.4 | 43.2 | 33.7 |
| SALMONN | 7.5B | 59.8 | 48.6 | 21.8 | 29.6 | 19.8 | 28.2 | 68.6 | 34.6 | 42.5 | 35.3 |
| DeSTA2 | 8.3B | 88.4 | 85.2 | 94.2 | 75.4 | 34.8 | 36.4 | 34.4 | 31.2 | 63.0 | 57.1 |
| Qwen-Audio-Chat | 8.4B | 49.6 | 43.8 | 87.6 | 40.6 | 63.2 | 37.0 | 92.2 | 66.0 | 73.2 | 46.9 |
| Qwen2-Audio-Instruct | 8.4B | 88.0 | 47.2 | 83.8 | 48.0 | 64.2 | 39.8 | 88.8 | 61.4 | 81.2 | 49.1 |
| GPT-4o Audio | - | - | - | 95.2 | 83.6 | 38.2 | 23.8 | 80.6 | 55.4 | 71.3 | 54.3 |
| Gemini-1.5-flash | - | 77.0 | 24.2 | 98.2 | 79.8 | 24.6 | 19.4 | 27.2 | 16.2 | 56.8 | 34.9 |
| Gemini-1.5-pro | - | 74.0 | 43.4 | 97.2 | 90.6 | 39.2 | 24.0 | 42.0 | 28.6 | 63.1 | 46.6 |
We sincerely invite you to participate in SAKURA! Here are the steps on how to join this:
- Download the data in
data/and generate predictions with your models. It is recommended to save the predictions in the format specified inevaluation/so that the evaluation code can be used directly without significant modifications. - Obtain the scores for your models using the evaluation code in
evaluation/. We highly recommend using the LLM evaluator and the prompts provided inevaluation/to ensure consistency in evaluation. - You can open an issue and report your model’s scores. We prefer that you also provide either the model checkpoints (along with instructions on how to use them) or the prediction files to help us verify the reported scores.
If you find SAKURA helpful for your research, please consider to cite our paper.
@article{yang2025sakuramultihopreasoninglarge,
title={SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information},
author={Chih-Kai Yang and Neo Ho and Yen-Ting Piao and Hung-yi Lee},
year={2025},
eprint={2505.13237},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2505.13237},
}