Skip to content

akkii2006/SpecBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpecBench

Custom eval suite generator for language models. Describe your model, get a targeted benchmark, score your outputs, and compare architectures and training runs.

Built for ML engineers who build small models from scratch.

Live: https://specbench.vercel.app


What it does

Most benchmarks like MMLU or HellaSwag are useless for custom small models. SpecBench generates an eval suite specific to what your model actually does, scores your outputs against a rubric, and tells you where it is weak and why.

Eval Suite Generation Describe your model and dataset. SpecBench decomposes it into capability axes and generates targeted prompts across standard, adversarial, and edge case types. Supports multilingual datasets. Parallel workers speed up generation for large suites.

Scoring Paste your model outputs or upload a JSON batch file. SpecBench scores each output 1 to 5 against a generated rubric and produces a per-axis report with weaknesses called out and concrete recommendations.

Eval Script Paste your model architecture and SpecBench generates a runnable Python eval script tailored to your implementation.

Diff Compare two model output sets, architectures, eval suites, or training configs. Get an AI verdict on which is better and why.


Getting started

SpecBench runs entirely on your own API key. No key is stored on any server.

Supported providers:

  • Google AI Studio (Gemini 3.1 Flash Lite recommended)
  • Groq (LLaMA 3.3 70B recommended)

Get your key:

Open https://specbench.vercel.app, paste your key on the setup screen, and start.


Stack

Frontend: React, Vite, deployed on Vercel

Backend: Node.js, Express, deployed on Render

The backend runs on Render's free tier and may take 30 to 50 seconds to respond on the first request after a period of inactivity. Subsequent requests are fast.


Running locally

Backend

cd backend
npm install
npm run dev

Starts on http://localhost:4000.

Frontend

cd frontend
npm install
npm run dev

Create a .env file in the frontend directory:

VITE_BACKEND_URL=http://localhost:4000

Starts on http://localhost:3000.


Screenshots

Spec screen Eval suite Diff


About

Custom eval suite generator for language models. Describe your model, get a targeted benchmark, score your outputs, and compare architectures and training runs. Built for ML engineers who build small models from scratch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors