Creative

Creative is an installable Codex skill that pushes an LLM away from default answers while keeping the answer useful, feasible, and simple.

This repo contains Creative, benchmark code, a default task set, and generated benchmark results.

Current result: Creative made answers much more original than baseline and won most overall comparisons in the committed 80-task real run.

Current real benchmark result

Plain-English Question	Answer
Does Creative make answers more original?	Yes, strongly.
Does Creative usually beat the baseline?	Often.
Does Creative win without losing feasibility?	Mixed.
Does Creative become too complicated?	No.

The benchmark result is: Creative changes answers in the intended direction. The strongest evidence is originality; the main caveat is feasibility tradeoff.

Judge-question results:

The table below shows how often Creative won overall and how often it won each judge question.

Judge Question	What It Measures	Creative Won	95% CI
Which answer was better overall?	Overall judgment	78.75%	68.6%-86.3%
Which answer is less default and brings a better non-obvious angle?	Originality	87.50%	78.5%-93.1%
Which answer gives the user a better next move?	Usefulness	42.50%	32.3%-53.4%
Which answer is more realistic to execute as written?	Feasibility	12.50%	6.9%-21.5%
Which answer gives more concrete details, tradeoffs, or next actions?	Specificity	38.75%	28.8%-49.7%
Which answer avoids unnecessary complexity, ceremony, or bloat?	Simplicity	33.75%	24.3%-44.6%

What is CreativeBench?

CreativeBench is an automated benchmark for testing whether Creative changes LLM outputs in a useful direction. It compares normal answers against answers generated with Creative, then judges the pair blindly.

What is Creative?

Creative asks the model to reject the first clean answer, generate stronger alternatives with specific moves, and keep only the idea that is more useful than the obvious/default answer. The installable file lives at skills/creative/SKILL.md.

Why benchmark this?

Prompts that ask for originality can drift into novelty without utility. CreativeBench tests whether Creative produces less default answers while preserving usefulness, feasibility, specificity, and simplicity.

The benchmark is automated. It uses blind pairwise judging. The headline signals are originality win rate, overall win rate, valid win rate, feasibility loss rate, and overcomplication rate. The benchmark punishes weird but useless answers.

CreativeBench is not a replacement for human evaluation, but it is a fast first-pass test.

Benchmark method

For each task, CreativeBench generates two answers:

Baseline mode: answer normally.
Creative mode: answer using Creative.

It randomizes whether the baseline or creative answer appears as answer A. The judge does not see the labels. The judge scores originality, usefulness, feasibility, specificity, and simplicity, then chooses an overall winner or tie.

The judge uses five plain-English questions:

Judge Question	What It Measures
Which answer is less default and brings a better non-obvious angle?	Originality
Which answer gives the user a better next move?	Usefulness
Which answer is more realistic to execute as written?	Feasibility
Which answer gives more concrete details, tradeoffs, or next actions?	Specificity
Which answer avoids unnecessary complexity, ceremony, or bloat?	Simplicity

The public result is the Markdown report. JSON and JSONL files are kept as machine-readable evidence so the benchmark can be reproduced.

Install

python -m pip install -e ".[dev]"

Install Creative

Use skills/creative/SKILL.md as the installable Creative skill file. In this repo, the benchmark code reads that exact file when it runs Creative mode.

Run real benchmark

Real runs require OPENAI_API_KEY. This is the run that tests whether Creative actually changes model outputs compared with baseline.

cp .env.example .env
# Add your API key to .env
python -m creative_bench.cli run

By default the real benchmark uses gpt-5.5 for both generation and judging with medium reasoning effort. It runs API calls in parallel with MAX_WORKERS=12 by default. Lower this value in .env if you hit rate limits.

Human-facing output:

results/report.md

Machine-readable outputs:

results/baseline.jsonl
results/creative.jsonl
results/pairs.jsonl
results/judgments.jsonl
results/summary.json

Add new tasks

Edit data/tasks.jsonl. Each row must include:

task_id
category
prompt

Prompts should describe situations where the obvious/default answer is likely too generic, too safe, or failing.

Interpret metrics

creative_valid_win_rate is the strictest "did Creative really win?" metric. It counts cases where the Creative answer wins overall, is not nonsense, and does not lose feasibility. Higher is better.

creative_overall_win_rate measures whether the blind judge preferred Creative overall. Higher is better.

creative_originality_win_rate measures whether Creative is less default than baseline. Higher is better.

creative_feasibility_loss_rate catches cases where Creative becomes less practical than baseline. Lower is better.

creative_overcomplication_rate catches answers that become elaborate without earning the complexity. Lower is better.

The report does not use hard pass/fail thresholds. It reports observed rates and 95% confidence intervals so readers can see the effect size and uncertainty.

Repository structure

creative_bench/
  README.md
  LICENSE
  pyproject.toml
  .gitignore
  .env.example
  CODEX.md
  .github/workflows/ci.yml
  skills/creative/SKILL.md
  creative_bench/
  data/tasks.jsonl
  results/summary.json
  results/report.md
  tests/

Limitations

Automated judging can be biased by the judge model. Results may change across model versions. The current committed result has 80 judged pairs, which gives a useful signal but still has uncertainty. The benchmark measures a useful signal, not final product quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creative

Current real benchmark result

What is CreativeBench?

What is Creative?

Why benchmark this?

Benchmark method

Install

Install Creative

Run real benchmark

Add new tasks

Interpret metrics

Repository structure

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
creative_bench		creative_bench
data		data
results		results
skills/creative		skills/creative
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODEX.md		CODEX.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Creative

Current real benchmark result

What is CreativeBench?

What is Creative?

Why benchmark this?

Benchmark method

Install

Install Creative

Run real benchmark

Add new tasks

Interpret metrics

Repository structure

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages