MyBench

Stop reading model reviews. Build your own benchmark.

MyBench is a 45-minute interview that turns your actual work into a private, saturate-resistant AI benchmark suite — and a methodology for running it across model × harness combinations so you know what to use, when.

Inspired by Nate B. Jones' private benchmark (Dingo / Splash Brothers / Artemis II). Built on the seven-principle scoring methodology from Don't Let the LLM Pick a Number. By Codefi.

Quickstart

Option A — Copy the prompt

Open personal-benchmark/SKILL.md (or visit mybench.codefiworks.com), paste the interview prompt into any AI agent with shell + file access (Claude Code, Codex, Cursor, Paperclip, OpenCode, …), and answer the interview. The agent writes a benchmark suite to your working directory.

Option B — Install as a skill via skills.sh

npx skills add CodefiLabs/mybench

Now any skills-aware agent (Claude Code, Cursor, Goose, OpenCode, …) can invoke MyBench by name.

What you get

After the interview, your working directory has:

benchmarks/
  _profile.md
  {your-benchmark-1}/
    prompt.md
    inputs/
    expected/
    evidence-guide.md
    traps.md
    meta.yaml
  {your-benchmark-2}/
    ...

Three to five benchmarks tuned to your work, each with a copy-pasteable prompt, realistic input files (with planted traps), an evidence guide for scoring, and a meta.yaml that captures the capability axis and criterion weights.

How scoring works

The LLM never picks a number. Scoring LLMs find discrete-impact evidence items from {+5, +3, +2, +1, -1, -2, -3, -5}, organized in a 5-perspective × 5-criterion matrix. The formula computes the score:

# per criterion
normalized = net_impact / sqrt(total_items)
raw        = clamp(50 + normalized * 8.0, 0, 100)
multiplier = 0.75 + 0.25 * clamp(total_items / 20, 0, 1)   # never > 1.0
final      = round(50 + (raw - 50) * multiplier)
confidence = clamp(total_items / 20, 0, 1)

# overall
overall_score      = sum(c.final * c.weight for c in criteria)
overall_confidence = min(c.confidence for c in criteria)

Sparse evidence is visibly low-confidence. Sqrt normalization punishes evidence farming. The 5×5 matrix forces multi-stakeholder evaluation. Independent passes by different model families catch contradictions.

Full methodology: pickanumber.codefiworks.com.

Two dimensions

MyBench scores model × harness, not just model:

	Claude Code	Codex	Cursor	pi.dev	OpenCode	Paperclip	OpenClaw	raw API	raw chat
Claude Opus
GPT-5.5
Gemini 3.1 Pro

The same prompt across the grid. The leaderboard tells you which combination to reach for when your messy work hits your desk on Tuesday.

Repo layout

mybench/
├── personal-benchmark/        # the skill (skills.sh indexes this folder)
│   └── SKILL.md
├── src/                       # SvelteKit web app (mybench.codefiworks.com)
│   ├── routes/+page.svelte
│   └── lib/prompt.js          # the interview prompt, single source of truth
├── static/
├── package.json
└── README.md

Develop

npm install
npm run dev          # http://localhost:5173
npm run build        # static build → ./build
npm run preview      # serve the production build

The site is static — @sveltejs/adapter-static builds to build/ and deploys anywhere.

Credits

Source video — Nate B. Jones, GPT-5.5 vs Claude vs Gemini: The Real Difference Nobody's Talking About
Scoring methodology — Don't Let the LLM Pick a Number, calibrated on 18 hackathon submissions and 342 BLS occupations across 9 models
Built by — Codefi
Pairwise diagnostic pattern — borrows from lechmazur/writing-style
Skill ecosystem — skills.sh

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
personal-benchmark		personal-benchmark
scripts		scripts
src		src
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jsconfig.json		jsconfig.json
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
svelte.config.js		svelte.config.js
tailwind.config.js		tailwind.config.js
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MyBench

Quickstart

Option A — Copy the prompt

Option B — Install as a skill via skills.sh

What you get

How scoring works

Two dimensions

Repo layout

Develop

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MyBench

Quickstart

Option A — Copy the prompt

Option B — Install as a skill via skills.sh

What you get

How scoring works

Two dimensions

Repo layout

Develop

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages