Skip to content

CodefiLabs/mybench

Repository files navigation

MyBench

Stop reading model reviews. Build your own benchmark.

MyBench is a 45-minute interview that turns your actual work into a private, saturate-resistant AI benchmark suite — and a methodology for running it across model × harness combinations so you know what to use, when.

Inspired by Nate B. Jones' private benchmark (Dingo / Splash Brothers / Artemis II). Built on the seven-principle scoring methodology from Don't Let the LLM Pick a Number. By Codefi.

Quickstart

Option A — Copy the prompt

Open personal-benchmark/SKILL.md (or visit mybench.codefiworks.com), paste the interview prompt into any AI agent with shell + file access (Claude Code, Codex, Cursor, Paperclip, OpenCode, …), and answer the interview. The agent writes a benchmark suite to your working directory.

Option B — Install as a skill via skills.sh

npx skills add CodefiLabs/mybench

Now any skills-aware agent (Claude Code, Cursor, Goose, OpenCode, …) can invoke MyBench by name.

What you get

After the interview, your working directory has:

benchmarks/
  _profile.md
  {your-benchmark-1}/
    prompt.md
    inputs/
    expected/
    evidence-guide.md
    traps.md
    meta.yaml
  {your-benchmark-2}/
    ...

Three to five benchmarks tuned to your work, each with a copy-pasteable prompt, realistic input files (with planted traps), an evidence guide for scoring, and a meta.yaml that captures the capability axis and criterion weights.

How scoring works

The LLM never picks a number. Scoring LLMs find discrete-impact evidence items from {+5, +3, +2, +1, -1, -2, -3, -5}, organized in a 5-perspective × 5-criterion matrix. The formula computes the score:

# per criterion
normalized = net_impact / sqrt(total_items)
raw        = clamp(50 + normalized * 8.0, 0, 100)
multiplier = 0.75 + 0.25 * clamp(total_items / 20, 0, 1)   # never > 1.0
final      = round(50 + (raw - 50) * multiplier)
confidence = clamp(total_items / 20, 0, 1)

# overall
overall_score      = sum(c.final * c.weight for c in criteria)
overall_confidence = min(c.confidence for c in criteria)

Sparse evidence is visibly low-confidence. Sqrt normalization punishes evidence farming. The 5×5 matrix forces multi-stakeholder evaluation. Independent passes by different model families catch contradictions.

Full methodology: pickanumber.codefiworks.com.

Two dimensions

MyBench scores model × harness, not just model:

Claude Code Codex Cursor pi.dev OpenCode Paperclip OpenClaw raw API raw chat
Claude Opus
GPT-5.5
Gemini 3.1 Pro

The same prompt across the grid. The leaderboard tells you which combination to reach for when your messy work hits your desk on Tuesday.

Repo layout

mybench/
├── personal-benchmark/        # the skill (skills.sh indexes this folder)
│   └── SKILL.md
├── src/                       # SvelteKit web app (mybench.codefiworks.com)
│   ├── routes/+page.svelte
│   └── lib/prompt.js          # the interview prompt, single source of truth
├── static/
├── package.json
└── README.md

Develop

npm install
npm run dev          # http://localhost:5173
npm run build        # static build → ./build
npm run preview      # serve the production build

The site is static — @sveltejs/adapter-static builds to build/ and deploys anywhere.

Credits

License

MIT — see LICENSE.

About

Your personal AI benchmark. Hand the interview prompt to any AI agent and get a saturate-resistant benchmark suite tuned to your actual work.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors