Creative is an installable Codex skill that pushes an LLM away from default answers while keeping the answer useful, feasible, and simple.
This repo contains Creative, benchmark code, a default task set, and generated benchmark results.
Current result: Creative made answers much more original than baseline and won most overall comparisons in the committed 80-task real run.
| Plain-English Question | Answer |
|---|---|
| Does Creative make answers more original? | Yes, strongly. |
| Does Creative usually beat the baseline? | Often. |
| Does Creative win without losing feasibility? | Mixed. |
| Does Creative become too complicated? | No. |
The benchmark result is: Creative changes answers in the intended direction. The strongest evidence is originality; the main caveat is feasibility tradeoff.
Judge-question results:
The table below shows how often Creative won overall and how often it won each judge question.
| Judge Question | What It Measures | Creative Won | 95% CI |
|---|---|---|---|
| Which answer was better overall? | Overall judgment | 78.75% | 68.6%-86.3% |
| Which answer is less default and brings a better non-obvious angle? | Originality | 87.50% | 78.5%-93.1% |
| Which answer gives the user a better next move? | Usefulness | 42.50% | 32.3%-53.4% |
| Which answer is more realistic to execute as written? | Feasibility | 12.50% | 6.9%-21.5% |
| Which answer gives more concrete details, tradeoffs, or next actions? | Specificity | 38.75% | 28.8%-49.7% |
| Which answer avoids unnecessary complexity, ceremony, or bloat? | Simplicity | 33.75% | 24.3%-44.6% |
CreativeBench is an automated benchmark for testing whether Creative changes LLM outputs in a useful direction. It compares normal answers against answers generated with Creative, then judges the pair blindly.
Creative asks the model to reject the first clean answer, generate stronger alternatives with specific moves, and keep only the idea that is more useful than the obvious/default answer. The installable file lives at skills/creative/SKILL.md.
Prompts that ask for originality can drift into novelty without utility. CreativeBench tests whether Creative produces less default answers while preserving usefulness, feasibility, specificity, and simplicity.
The benchmark is automated. It uses blind pairwise judging. The headline signals are originality win rate, overall win rate, valid win rate, feasibility loss rate, and overcomplication rate. The benchmark punishes weird but useless answers.
CreativeBench is not a replacement for human evaluation, but it is a fast first-pass test.
For each task, CreativeBench generates two answers:
- Baseline mode: answer normally.
- Creative mode: answer using Creative.
It randomizes whether the baseline or creative answer appears as answer A. The judge does not see the labels. The judge scores originality, usefulness, feasibility, specificity, and simplicity, then chooses an overall winner or tie.
The judge uses five plain-English questions:
| Judge Question | What It Measures |
|---|---|
| Which answer is less default and brings a better non-obvious angle? | Originality |
| Which answer gives the user a better next move? | Usefulness |
| Which answer is more realistic to execute as written? | Feasibility |
| Which answer gives more concrete details, tradeoffs, or next actions? | Specificity |
| Which answer avoids unnecessary complexity, ceremony, or bloat? | Simplicity |
The public result is the Markdown report. JSON and JSONL files are kept as machine-readable evidence so the benchmark can be reproduced.
python -m pip install -e ".[dev]"Use skills/creative/SKILL.md as the installable Creative skill file. In this repo, the benchmark code reads that exact file when it runs Creative mode.
Real runs require OPENAI_API_KEY. This is the run that tests whether Creative actually changes model outputs compared with baseline.
cp .env.example .env
# Add your API key to .env
python -m creative_bench.cli runBy default the real benchmark uses gpt-5.5 for both generation and judging with medium reasoning effort.
It runs API calls in parallel with MAX_WORKERS=12 by default. Lower this value in .env if you hit rate limits.
Human-facing output:
results/report.md
Machine-readable outputs:
results/baseline.jsonlresults/creative.jsonlresults/pairs.jsonlresults/judgments.jsonlresults/summary.json
Edit data/tasks.jsonl. Each row must include:
task_idcategoryprompt
Prompts should describe situations where the obvious/default answer is likely too generic, too safe, or failing.
creative_valid_win_rate is the strictest "did Creative really win?" metric. It counts cases where the Creative answer wins overall, is not nonsense, and does not lose feasibility. Higher is better.
creative_overall_win_rate measures whether the blind judge preferred Creative overall. Higher is better.
creative_originality_win_rate measures whether Creative is less default than baseline. Higher is better.
creative_feasibility_loss_rate catches cases where Creative becomes less practical than baseline. Lower is better.
creative_overcomplication_rate catches answers that become elaborate without earning the complexity. Lower is better.
The report does not use hard pass/fail thresholds. It reports observed rates and 95% confidence intervals so readers can see the effect size and uncertainty.
creative_bench/
README.md
LICENSE
pyproject.toml
.gitignore
.env.example
CODEX.md
.github/workflows/ci.yml
skills/creative/SKILL.md
creative_bench/
data/tasks.jsonl
results/summary.json
results/report.md
tests/
Automated judging can be biased by the judge model. Results may change across model versions. The current committed result has 80 judged pairs, which gives a useful signal but still has uncertainty. The benchmark measures a useful signal, not final product quality.