GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Overview

Introducing GaMMA, a large multimodal model designed to jointly handle both global music understanding and temporal music reasoning within a unified parameter space. Built on a streamlined encoder-decoder paradigm, GaMMA combines language modeling with dual audio experts and a gated fusion mechanism to model both non-temporal musical semantics and time-dependent musical structure, while a progressive training pipeline based on pretraining, supervised fine-tuning, and reinforcement learning further strengthens instruction-following, full-song understanding, and temporal reasoning.

News

[2026-5-1]:🔥 GaMMA is released! Check out our project page and paper.

Installation

We use python 3.12 and CUDA 12.4 for implementation.
The inference code runs on NVIDIA A100 GPUs with 80 GB of memory. Please ensure your available VRAM is sufficient to avoid potential out-of-memory (OOM) issues.

PyTorch Inference

conda create -n gamma python=3.12 -y
conda activate gamma

pip install --extra-index-url https://download.pytorch.org/whl/cu121 torch==2.5.1 torchaudio==2.5.1
pip install -r requirements.txt

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# flash-attn build can take a long time in a fresh environment.
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.0.post2

Optional: vLLM Inference

conda create -n gamma_vllm python=3.12 -y
conda activate gamma_vllm

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

pip install torch==2.10.0+cu128 torchaudio==2.10.0+cu128 --index-url https://download.pytorch.org/whl/cu128
pip install torchvision==0.25.0+cu128 --index-url https://download.pytorch.org/whl/cu128
pip install numpy==2.2.6
pip install -r requirements-vllm.txt
pip install transformers==4.57.6
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.0.post2 causal-conv1d==1.6.0
pip install setuptools-scm "cmake>=3.26.1" uv
cd vllm
mkdir -p vllm/vllm_flash_attn
export SETUPTOOLS_SCM_PRETEND_VERSION=0.17.1
export UV_LINK_MODE=copy
export UV_CACHE_DIR=/tmp/build_tmp/uv-cache
export TMPDIR=/tmp/build_tmp
python use_existing_torch.py --prefix
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .
cd ..

Quick Start

PyTorch CLI

python demo/web_demo_mmaudio.py \
  --mode cli \
  --model-path /absolute/path/to/your/checkpoint \
  --audio-file /absolute/path/to/your/audio.mp3 \
  --question "Please briefly describe the style, mood and main vocal characteristics of this song."

PyTorch Web Demo

python demo/web_demo_mmaudio.py \
  --mode web \
  --model-path /absolute/path/to/your/checkpoint

vLLM CLI

python demo/web_demo_mmaudio_vllm.py \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 2 \
  --max-model-len 32768 \
  --model-path /absolute/path/to/your/checkpoint \
  --audio-file /absolute/path/to/your/audio.mp3 \
  --question "Please briefly describe the style, mood and main vocal characteristics of this song."

Acknowledgments

GaMMA is built upon LLaVA-NeXT, Qwen and vLLM. We express our gratitude to the authors for their remarkable work.

Citation

@article{gamma2026you,
  title={GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models},
  author={You, Zuyao and Yu, Zhesong and Liu, Mingyu and Zhu, Bilei and Wan, Yuan and Wu, Zuxuan},
  journal={arXiv},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Overview

News

Installation

PyTorch Inference

Optional: vLLM Inference

Quick Start

PyTorch CLI

PyTorch Web Demo

vLLM CLI

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Overview

News

Installation

PyTorch Inference

Optional: vLLM Inference

Quick Start

PyTorch CLI

PyTorch Web Demo

vLLM CLI

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages