JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark

A public benchmark that asks language models to make Charlie-Parker-style improvisational choices over real chord progressions, scored against Parker's actual recorded solos using formal music-theoretic metrics.

Blog post on why and how this exists: I built my own Claude eval.

Why This Exists

Most LLM evals test verbal, mathematical, or coding reasoning. None test reasoning under the kind of soft, multi-constraint, judgeable conditions that human experts navigate intuitively. Jazz improvisation is exactly that: bounded (chord changes, key, time), judgeable (we have the actual solo as ground truth, plus formal scoring methods), and cognitively rich (constraint satisfaction + style + creativity, all in real time).

This benchmark turns the question "can the model improvise like Parker?" into a measurable reasoning task.

The Task

Given:

The first N chord segments of one of Parker's choruses (chord symbols, key, pitch classes actually played)
The chord changes for the next M segments

Predict, for each of those M future segments, the pitch-class set Parker actually played over that chord.

Metrics

For each predicted segment vs. ground truth:

PC Jaccard — Jaccard similarity between predicted and actual pitch-class sets
IV distance — Euclidean distance in interval-class space (ic1–ic6)
Complexity delta — Absolute error on Parker's complexity metric (iv_sum)
Dissonance delta — Absolute error on the dissonance metric (ic1 + 0.5·ic2 + 0.8·ic6)
Forte-class match — Exact-match rate on the set-class label

All metrics aggregate across the M segments per task, then across all tasks.

Baselines

Random-from-chord-vocab — Sample uniformly from the empirical distribution of PC sets Parker used over the same chord function.
Most-frequent-per-chord — Always emit the modal PC set for the given chord function.
1st-order Markov on IVs — Transition probabilities from the previous segment's interval vector.

Models Evaluated

Claude Haiku 4.5
Claude Sonnet 4.6
Claude Opus 4.7

Data Provenance

This benchmark builds on derived features from the Charlie Parker Time Series Analysis Pipeline, which processes the Aligned Charlie Parker Digital Omnibook (Déguernel, Vincent, Assayag / Inria, STMS Lab Ircam/CNRS/UPMC; CC BY-NC-SA 2.0). Raw transcriptions are not redistributed in this repo. Derived numerical features (interval vectors, pitch classes per chord, timestamps) are released here under CC BY-NC-SA 2.0 for non-commercial research use.

Quick Start

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Generate the v0 task set (~100 examples)
python scripts/01_build_tasks.py

# Run baselines (no API key needed)
python scripts/02_run_baselines.py

Two ways to run Claude on the eval

(a) Reproducible / public path — direct Anthropic API

cp .env.example .env  # then add your ANTHROPIC_API_KEY
python scripts/03_run_claude.py --model claude-sonnet-4-6 --limit 10
python scripts/03_run_claude.py --model claude-haiku-4-5-20251001
python scripts/03_run_claude.py --model claude-opus-4-7

python scripts/04_score.py
python scripts/05_report.py

This path uses prompt caching and is the canonical way to reproduce the benchmark numbers in the paper.

(b) Local / no-key path — Claude Code skill

From inside this directory, with Claude Code:

/jazzbench-run               # haiku, 10 tasks (defaults)
/jazzbench-run sonnet 20
/jazzbench-run opus 5
/jazzbench-run all 10        # haiku + sonnet + opus sequentially

The skill spawns one workflow subagent per task, enforces the prediction schema, and writes results to results/runs/claude-<tier>-via-cc.jsonl. No ANTHROPIC_API_KEY is needed because Claude Code's own session is used. The same 04_score.py / 05_report.py consume the output.

The two paths produce separate model rows in the comparison table (claude-sonnet-4-6 vs. claude-sonnet-via-cc), so you can compare them if you want.

Limitations (v0)

Single artist, single style era (bebop). Generalization to other improvisers is future work.
Pitch-class only. Rhythm, articulation, and register are not evaluated.
Predictions are scored on harmonic content, not narrative arc.
Markov baseline is 1st-order; higher orders are future work.
Models see no audio, only symbolic data.

Citation

@software{rubini2026jazzbench,
  author={Rubini, Mike},
  title={JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark},
  year={2026},
  url={https://github.com/code91/claude-impro-eval}
}

Builds on:

@software{rubini2025parker_code,
  author={Rubini, Mike},
  title={Charlie Parker Time Series Analysis Pipeline},
  year={2025},
  publisher={Zenodo},
  doi={10.5281/zenodo.18037822}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark

Why This Exists

The Task

Metrics

Baselines

Models Evaluated

Data Provenance

Quick Start

Two ways to run Claude on the eval

Limitations (v0)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
data		data
jazzbench		jazzbench
paper		paper
results		results
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark

Why This Exists

The Task

Metrics

Baselines

Models Evaluated

Data Provenance

Quick Start

Two ways to run Claude on the eval

Limitations (v0)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages