Custom eval suite generator for language models. Describe your model, get a targeted benchmark, score your outputs, and compare architectures and training runs.
Built for ML engineers who build small models from scratch.
Live: https://specbench.vercel.app
Most benchmarks like MMLU or HellaSwag are useless for custom small models. SpecBench generates an eval suite specific to what your model actually does, scores your outputs against a rubric, and tells you where it is weak and why.
Eval Suite Generation Describe your model and dataset. SpecBench decomposes it into capability axes and generates targeted prompts across standard, adversarial, and edge case types. Supports multilingual datasets. Parallel workers speed up generation for large suites.
Scoring Paste your model outputs or upload a JSON batch file. SpecBench scores each output 1 to 5 against a generated rubric and produces a per-axis report with weaknesses called out and concrete recommendations.
Eval Script Paste your model architecture and SpecBench generates a runnable Python eval script tailored to your implementation.
Diff Compare two model output sets, architectures, eval suites, or training configs. Get an AI verdict on which is better and why.
SpecBench runs entirely on your own API key. No key is stored on any server.
Supported providers:
- Google AI Studio (Gemini 3.1 Flash Lite recommended)
- Groq (LLaMA 3.3 70B recommended)
Get your key:
Open https://specbench.vercel.app, paste your key on the setup screen, and start.
Frontend: React, Vite, deployed on Vercel
Backend: Node.js, Express, deployed on Render
The backend runs on Render's free tier and may take 30 to 50 seconds to respond on the first request after a period of inactivity. Subsequent requests are fast.
Backend
cd backend
npm install
npm run devStarts on http://localhost:4000.
Frontend
cd frontend
npm install
npm run devCreate a .env file in the frontend directory:
VITE_BACKEND_URL=http://localhost:4000
Starts on http://localhost:3000.


