A public benchmark that asks language models to make Charlie-Parker-style improvisational choices over real chord progressions, scored against Parker's actual recorded solos using formal music-theoretic metrics.
Blog post on why and how this exists: I built my own Claude eval.
Most LLM evals test verbal, mathematical, or coding reasoning. None test reasoning under the kind of soft, multi-constraint, judgeable conditions that human experts navigate intuitively. Jazz improvisation is exactly that: bounded (chord changes, key, time), judgeable (we have the actual solo as ground truth, plus formal scoring methods), and cognitively rich (constraint satisfaction + style + creativity, all in real time).
This benchmark turns the question "can the model improvise like Parker?" into a measurable reasoning task.
Given:
- The first
Nchord segments of one of Parker's choruses (chord symbols, key, pitch classes actually played) - The chord changes for the next
Msegments
Predict, for each of those M future segments, the pitch-class set Parker actually played over that chord.
For each predicted segment vs. ground truth:
- PC Jaccard — Jaccard similarity between predicted and actual pitch-class sets
- IV distance — Euclidean distance in interval-class space (ic1–ic6)
- Complexity delta — Absolute error on Parker's complexity metric (
iv_sum) - Dissonance delta — Absolute error on the dissonance metric (
ic1 + 0.5·ic2 + 0.8·ic6) - Forte-class match — Exact-match rate on the set-class label
All metrics aggregate across the M segments per task, then across all tasks.
- Random-from-chord-vocab — Sample uniformly from the empirical distribution of PC sets Parker used over the same chord function.
- Most-frequent-per-chord — Always emit the modal PC set for the given chord function.
- 1st-order Markov on IVs — Transition probabilities from the previous segment's interval vector.
- Claude Haiku 4.5
- Claude Sonnet 4.6
- Claude Opus 4.7
This benchmark builds on derived features from the Charlie Parker Time Series Analysis Pipeline, which processes the Aligned Charlie Parker Digital Omnibook (Déguernel, Vincent, Assayag / Inria, STMS Lab Ircam/CNRS/UPMC; CC BY-NC-SA 2.0). Raw transcriptions are not redistributed in this repo. Derived numerical features (interval vectors, pitch classes per chord, timestamps) are released here under CC BY-NC-SA 2.0 for non-commercial research use.
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# Generate the v0 task set (~100 examples)
python scripts/01_build_tasks.py
# Run baselines (no API key needed)
python scripts/02_run_baselines.py(a) Reproducible / public path — direct Anthropic API
cp .env.example .env # then add your ANTHROPIC_API_KEY
python scripts/03_run_claude.py --model claude-sonnet-4-6 --limit 10
python scripts/03_run_claude.py --model claude-haiku-4-5-20251001
python scripts/03_run_claude.py --model claude-opus-4-7
python scripts/04_score.py
python scripts/05_report.pyThis path uses prompt caching and is the canonical way to reproduce the benchmark numbers in the paper.
(b) Local / no-key path — Claude Code skill
From inside this directory, with Claude Code:
/jazzbench-run # haiku, 10 tasks (defaults)
/jazzbench-run sonnet 20
/jazzbench-run opus 5
/jazzbench-run all 10 # haiku + sonnet + opus sequentially
The skill spawns one workflow subagent per task, enforces the prediction schema, and writes results to results/runs/claude-<tier>-via-cc.jsonl. No ANTHROPIC_API_KEY is needed because Claude Code's own session is used. The same 04_score.py / 05_report.py consume the output.
The two paths produce separate model rows in the comparison table (claude-sonnet-4-6 vs. claude-sonnet-via-cc), so you can compare them if you want.
- Single artist, single style era (bebop). Generalization to other improvisers is future work.
- Pitch-class only. Rhythm, articulation, and register are not evaluated.
- Predictions are scored on harmonic content, not narrative arc.
- Markov baseline is 1st-order; higher orders are future work.
- Models see no audio, only symbolic data.
@software{rubini2026jazzbench,
author={Rubini, Mike},
title={JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark},
year={2026},
url={https://github.com/code91/claude-impro-eval}
}Builds on:
@software{rubini2025parker_code,
author={Rubini, Mike},
title={Charlie Parker Time Series Analysis Pipeline},
year={2025},
publisher={Zenodo},
doi={10.5281/zenodo.18037822}
}