MMAE: A Massive Multitask Audio Editing Benchmark

📖 arXiv | 🎬 MMAE Demo Video | 🛠️ GitHub Code | 🔊 HuggingFace Audio Download

Overview of MMAE

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. MMAE establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

Examples of the MMAE benchmark:

Distribution of the MMAE benchmark across three taxonomy dimensions: modality, difficulty, and operation.

Data Curation Pipeline

MMAE is constructed through a systematic five-stage pipeline designed to ensure both diversity and high-quality of the benchmark:

Brainstorming.
Taxonomy & Paradigm Construction.
Instruction-Centric Data Collection.
Rubrics Annotation.
Quality Inspection.

Evaluation

We use Qwen3-Omni as the judge model to evaluate audio editing outputs against our rubric-based criteria.

Step 1: Deploy Qwen3-Omni

Clone the official repository and set up the environment following their instructions:

git clone https://github.com/QwenLM/Qwen3-Omni.git
cd Qwen3-Omni
# Follow the official README to install dependencies

Then launch the vLLM serving instances. We provide a reference deployment script launch_qwen3_omni.sh.

This starts two Qwen3-Omni instances (tensor-parallel=4 each) on 8 GPUs, serving at ports 8001 and 8002. Edit MODEL_DIR in the script to point to your local model weights.

Step 2: Prepare Predictions

Run your audio editing model on the MMAE benchmark inputs (meta data). Modify the original chatml-format messages with an appended assistant turn pointing to the output audio path, e.g.,

[
  {
    "id": "69e897fbf1844435bec75eca",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Extract the music component from the audio."},
          {"type": "audio", "audio_url": "wav/69e897fbf1844435bec75eca/audio1.wav"}
        ]
      },
      {
        "role": "assistant",
        "content": [
          {"type": "audio", "audio_url": "your_output_wav_path"}
        ]
      }
    ]
  }
]

The audio_url paths can be absolute or relative to the predictions file's parent directory (or the --audio_root you specify). Save the modified meta data as another JSON file indicating your model's predictions.

Step 3: Run Evaluation

python -m eval.score \
  --predictions path/to/your_predictions.json \
  --base_urls "http://localhost:8001/v1,http://localhost:8002/v1" \
  --audio_root path/to/audio_root \
  --output_dir outputs/your_model \
  --concurrency 8

Arguments:

Argument	Description
`--predictions`	(required) Path to your predictions JSON file.
`--base_urls`	(required) Comma-separated Qwen3-Omni endpoint URLs.
`--metadata`	Path to MMAE metadata. Default: `MMAE-meta.json`.
`--audio_root`	Base directory for resolving relative audio paths. Default: parent directory of predictions file.
`--output_dir`	Where to write results. Default: `outputs/scores`.
`--concurrency`	Number of samples scored in parallel. Default: 16.
`--retries`	Number of valid judge responses to collect per rubric. Should be 3.
`--max_attempts`	Max total attempts (including failures) per rubric. Default: 10.
`--timeout`	Timeout in seconds per judge request. Default: 300.
`--model`	Model name served by vLLM. Default: `Qwen3Omni-Instruct`.

Output files (written to --output_dir):

File	Description
`results.jsonl`	Per-rubric detailed results: each rubric's 3 judge responses, per-attempt choices, scores, and raw model outputs.
`per_sample.json`	Per-sample aggregated scores: Instruction Following Rate, Consistency Rate, and Exact Match Rate for each data entry.
`taxonomy.json`	Scores grouped by modality, complexity, cross dimensions, and operation type.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
MMAE-meta.json		MMAE-meta.json
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMAE: A Massive Multitask Audio Editing Benchmark

Overview of MMAE

Data Curation Pipeline

Evaluation

Step 1: Deploy Qwen3-Omni

Step 2: Prepare Predictions

Step 3: Run Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMAE: A Massive Multitask Audio Editing Benchmark

Overview of MMAE

Data Curation Pipeline

Evaluation

Step 1: Deploy Qwen3-Omni

Step 2: Prepare Predictions

Step 3: Run Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages