Skip to content

ckyang1124/SAKURA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🌸 SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

The official GitHub page of the paper "SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information"

  • Authors: Chih-Kai Yang, Neo S. Ho*, Yen-Ting Piao*, Hung-yi Lee (*Equal Contribution)
  • Affiliation: National Taiwan University
  • Accepted to Interspeech 2025
  • Paper link: https://arxiv.org/abs/2505.13237

Overview

logo

Abstract

TL;DR: We propose SAKURA, a benchmark for LALM multi-hop reasoning, and reveal their deficiency of such reasoning skills.

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs’ multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

🌟 Key findings

  • LALMs exhibit blind spots for certain speech/audio attributes, failing to recognize them accurately.
  • LALMs are unable to perform multi-hop reasoning based on attribute information they have correctly extracted from audio.
  • We find that they can reason well in the textual modality, indicating that the above failures are not due to a lack of knowledge, but rather insufficient integration of multimodal information.

News

  • [2025.06.09] Our dataset is released on HuggingFace! You can now access the datasets of the four tracks at HuggingFace.
  • [2025.05.22] Our paper is now available on arXiv.
  • [2025.05.19] Our paper is accepted to Interspeech 2025! See you in Rotterdam!

Baselines

Here are the baseline models we included in the paper. The implementations were based on the official ones.

  • LTU-AS

  • GAMA-IT

    • GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities [arXiv, GitHub]
  • SALMONN

    • SALMONN: Towards Generic Hearing Abilities for Large Language Models [arXiv, GitHub]
  • DeSTA2

    • DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [arXiv, GitHub]
  • Qwen-Audio-Chat

    • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models [arXiv, GitHub]
  • Qwen2-Audio-Instruct

  • GPT-4o Audio

  • Gemini-1.5-flash and Gemini-1.5-pro

    • Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context [arXiv, website]

Evaluation

SAKURA uses the LLM-as-a-judge approach for automatic evaluation. We provide the evaluation code, models, and prompts in the evaluation/ directory. For more details on how to evaluate your models with SAKURA, please refer to this directory.

Leaderboard

Here is the leaderboard of SAKURA, which currently only contain the baseline LALMs in the paper.

Model Size Gender Language Emotion Animal Average
Single Multi Single Multi Single Multi Single Multi Single Multi
LTU-AS 7B 52.4 19.6 16.8 11.4 28.6 19.6 65.6 21.8 40.9 18.1
GAMA-IT 7B 76.4 39.8 5.6 19.4 5.6 24.2 85.2 51.4 43.2 33.7
SALMONN 7.5B 59.8 48.6 21.8 29.6 19.8 28.2 68.6 34.6 42.5 35.3
DeSTA2 8.3B 88.4 85.2 94.2 75.4 34.8 36.4 34.4 31.2 63.0 57.1
Qwen-Audio-Chat 8.4B 49.6 43.8 87.6 40.6 63.2 37.0 92.2 66.0 73.2 46.9
Qwen2-Audio-Instruct 8.4B 88.0 47.2 83.8 48.0 64.2 39.8 88.8 61.4 81.2 49.1
GPT-4o Audio - - - 95.2 83.6 38.2 23.8 80.6 55.4 71.3 54.3
Gemini-1.5-flash - 77.0 24.2 98.2 79.8 24.6 19.4 27.2 16.2 56.8 34.9
Gemini-1.5-pro - 74.0 43.4 97.2 90.6 39.2 24.0 42.0 28.6 63.1 46.6

How to Participate

We sincerely invite you to participate in SAKURA! Here are the steps on how to join this:

  • Download the data in data/ and generate predictions with your models. It is recommended to save the predictions in the format specified in evaluation/ so that the evaluation code can be used directly without significant modifications.
  • Obtain the scores for your models using the evaluation code in evaluation/. We highly recommend using the LLM evaluator and the prompts provided in evaluation/ to ensure consistency in evaluation.
  • You can open an issue and report your model’s scores. We prefer that you also provide either the model checkpoints (along with instructions on how to use them) or the prediction files to help us verify the reported scores.

Citation

If you find SAKURA helpful for your research, please consider to cite our paper.

@article{yang2025sakuramultihopreasoninglarge,
      title={SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information}, 
      author={Chih-Kai Yang and Neo Ho and Yen-Ting Piao and Hung-yi Lee},
      year={2025},
      eprint={2505.13237},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2505.13237}, 
}

About

Official GitHub repository for paper "SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information" (Interspeech 2025)

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages