Skip to content

code91/claude-impro-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark

A public benchmark that asks language models to make Charlie-Parker-style improvisational choices over real chord progressions, scored against Parker's actual recorded solos using formal music-theoretic metrics.

Blog post on why and how this exists: I built my own Claude eval.

Why This Exists

Most LLM evals test verbal, mathematical, or coding reasoning. None test reasoning under the kind of soft, multi-constraint, judgeable conditions that human experts navigate intuitively. Jazz improvisation is exactly that: bounded (chord changes, key, time), judgeable (we have the actual solo as ground truth, plus formal scoring methods), and cognitively rich (constraint satisfaction + style + creativity, all in real time).

This benchmark turns the question "can the model improvise like Parker?" into a measurable reasoning task.

The Task

Given:

  • The first N chord segments of one of Parker's choruses (chord symbols, key, pitch classes actually played)
  • The chord changes for the next M segments

Predict, for each of those M future segments, the pitch-class set Parker actually played over that chord.

Metrics

For each predicted segment vs. ground truth:

  • PC Jaccard — Jaccard similarity between predicted and actual pitch-class sets
  • IV distance — Euclidean distance in interval-class space (ic1–ic6)
  • Complexity delta — Absolute error on Parker's complexity metric (iv_sum)
  • Dissonance delta — Absolute error on the dissonance metric (ic1 + 0.5·ic2 + 0.8·ic6)
  • Forte-class match — Exact-match rate on the set-class label

All metrics aggregate across the M segments per task, then across all tasks.

Baselines

  • Random-from-chord-vocab — Sample uniformly from the empirical distribution of PC sets Parker used over the same chord function.
  • Most-frequent-per-chord — Always emit the modal PC set for the given chord function.
  • 1st-order Markov on IVs — Transition probabilities from the previous segment's interval vector.

Models Evaluated

  • Claude Haiku 4.5
  • Claude Sonnet 4.6
  • Claude Opus 4.7

Data Provenance

This benchmark builds on derived features from the Charlie Parker Time Series Analysis Pipeline, which processes the Aligned Charlie Parker Digital Omnibook (Déguernel, Vincent, Assayag / Inria, STMS Lab Ircam/CNRS/UPMC; CC BY-NC-SA 2.0). Raw transcriptions are not redistributed in this repo. Derived numerical features (interval vectors, pitch classes per chord, timestamps) are released here under CC BY-NC-SA 2.0 for non-commercial research use.

Quick Start

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Generate the v0 task set (~100 examples)
python scripts/01_build_tasks.py

# Run baselines (no API key needed)
python scripts/02_run_baselines.py

Two ways to run Claude on the eval

(a) Reproducible / public path — direct Anthropic API

cp .env.example .env  # then add your ANTHROPIC_API_KEY
python scripts/03_run_claude.py --model claude-sonnet-4-6 --limit 10
python scripts/03_run_claude.py --model claude-haiku-4-5-20251001
python scripts/03_run_claude.py --model claude-opus-4-7

python scripts/04_score.py
python scripts/05_report.py

This path uses prompt caching and is the canonical way to reproduce the benchmark numbers in the paper.

(b) Local / no-key path — Claude Code skill

From inside this directory, with Claude Code:

/jazzbench-run               # haiku, 10 tasks (defaults)
/jazzbench-run sonnet 20
/jazzbench-run opus 5
/jazzbench-run all 10        # haiku + sonnet + opus sequentially

The skill spawns one workflow subagent per task, enforces the prediction schema, and writes results to results/runs/claude-<tier>-via-cc.jsonl. No ANTHROPIC_API_KEY is needed because Claude Code's own session is used. The same 04_score.py / 05_report.py consume the output.

The two paths produce separate model rows in the comparison table (claude-sonnet-4-6 vs. claude-sonnet-via-cc), so you can compare them if you want.

Limitations (v0)

  • Single artist, single style era (bebop). Generalization to other improvisers is future work.
  • Pitch-class only. Rhythm, articulation, and register are not evaluated.
  • Predictions are scored on harmonic content, not narrative arc.
  • Markov baseline is 1st-order; higher orders are future work.
  • Models see no audio, only symbolic data.

Citation

@software{rubini2026jazzbench,
  author={Rubini, Mike},
  title={JazzBench v0: Improvisational Decision-Making as an LLM Reasoning Benchmark},
  year={2026},
  url={https://github.com/code91/claude-impro-eval}
}

Builds on:

@software{rubini2025parker_code,
  author={Rubini, Mike},
  title={Charlie Parker Time Series Analysis Pipeline},
  year={2025},
  publisher={Zenodo},
  doi={10.5281/zenodo.18037822}
}

About

Improvisational Decision-Making as an LLM Reasoning Benchmark for Claude

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors