Convex is an open-source, reactive database that's the best platform for full-stack AI coding.
We ensure that Convex performs well with a large set of models by continuously running evals. Each eval has a set prompts for coding a Convex backend, a set of human-curated solutions, and a script for evaluating the LLM's output. These evals are split up into seven different categories:
- Fundamentals
- Data Modeling
- Queries
- Mutations
- Actions
- Idioms
- Clients
The most up to date eval runs can be found on our website.
Detailed results from production runs can be visualized at convex-evals.netlify.app:
We use these evals to tune our Convex Guidelines, which greatly improve model performance writing Convex code and decrease hallucinations.
First, install dependencies:
npm install -g bun
bun install
echo "ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>" > .env
echo "OPENAI_API_KEY=<your OPENAI_API_KEY>" >> .envThe easiest way to run evals is with the interactive CLI:
bun run evalsThis launches an interactive menu where you can:
- Run all evals
- Select specific categories to run
- Select individual evals
- Re-run failed evals from your last run
- Choose which model(s) to use
| Command | Description |
|---|---|
bun run evals |
Interactive mode |
bun run evals list |
List all available evals by category |
bun run evals status |
Show results from last run |
bun run evals status --failed |
Show only failed evals |
bun run evals models |
List available models |
bun run evals:failed |
Re-run only failed evals from last run |
Run evals directly without interactive mode:
# Run specific categories
bun run evals run -c 000-fundamentals 002-queries
# Run with a specific model
bun run evals run -m claude-sonnet-4-5 -c 005-idioms
# Run with multiple models
bun run evals run -m claude-sonnet-4-5 -m gpt-5 -f "000-fundamentals"
# Re-run failed evals
bun run evals run --failed
# Filter by regex pattern
bun run evals run -f "pagination"
# Post results to Convex database
bun run evals run --post-to-convex -c 000-fundamentalsYou can run the eval runner directly:
bun run runner/index.tsYou can specify a test filter regex via an environment variable:
TEST_FILTER='data_modeling' bun run runner/index.tsThe test will also print out what temporary directory it's using for storing the generated files. You can override this
with the OUTPUT_TEMPDIR environment variable.
OUTPUT_TEMPDIR=/tmp/convex-codegen-evals bun run runner/index.ts| Variable | Description |
|---|---|
MODELS |
Comma-separated list of models to run |
TEST_FILTER |
Regex pattern to filter evals |
OUTPUT_TEMPDIR |
Directory for generated output files |
CONVEX_EVAL_URL |
Convex deployment URL (e.g. https://xxx.convex.cloud) |
CONVEX_AUTH_TOKEN |
Auth token for the Convex backend |
- Per-step progress lines with the eval id
- Per-eval result with pass/fail status and a clickable output dir
Note that test or category names cannot contain dashes.
- Create a new directory under
evals/<category>/<name>/ - Add a
TASK.txtfile describing what the LLM should do - Add an
answer/directory with the human-curated solution - Add a
grader.test.tsfile with unit tests - Run the eval to verify it works
- Create
schema.tsfirst - Run codegen to generate types:
cd evals/<category>/<eval>/answer && bunx convex codegen
- Implement solution files
- Run codegen again after any schema changes
These evals measure whether a model understands Convex - not whether it can follow detailed instructions. This creates a deliberate tension when writing tasks:
- Be explicit about the shape of the problem: schema, function names, argument types, return structure, which files to create.
- Don't over-specify Convex implementation details that are covered in the guidelines (e.g. when to use
internalMutation, how to call functions viainternal.*, how to export queries alongside an HTTP router). If a model needs the task to spell those out, it's failing the eval for the right reason.
When reviewing a failure, the first question should be: "Is this something the guidelines already cover?" If yes, it's a model fault - not a task problem. Only add detail to a task when the requirement is genuinely ambiguous or the model's interpretation was reasonable given the guidelines.
-
Be explicit about schema - always provide the complete schema in the prompt using TypeScript code blocks
-
Clear requirements - for each function, specify:
- Exact function name
- Required arguments and their types
- Expected return type/structure
- Any specific behaviors or edge cases to handle
-
Scope the context - describe what the feature does, but trust the model to know how to implement it in Convex. Don't assume knowledge of the problem domain; do assume knowledge of Convex patterns from the guidelines.
-
Implementation constraints - specify what files to create, what NOT to do, and any performance considerations that aren't obvious from the guidelines.
-
Ambiguous requirements - don't leave function names unspecified; don't use vague terms like "appropriate" without context; always specify exact field names and types
-
Over-complication - don't test multiple concepts in one eval; keep schemas focused on the tested concept
-
Missing context - describe the problem domain clearly, but don't explain Convex mechanics that are already in the guidelines
-
Untestable requirements - make success criteria measurable; specify exact return types; include specific test cases
-
Over-specification - spelling out every Convex detail (e.g. which function type to use, how the internal API works) defeats the purpose of the eval; if a model needs that hand-holding, that's a meaningful signal
Each eval directory contains:
TASK.txt- the prompt sent to the modelanswer/- the human-curated reference solutiongrader.test.ts- Vitest tests that score the model's output
- Data modeling - table relationships, index design, schema validation
- Query patterns - CRUD, index usage, filtering, joins, pagination, aggregation
- Actions - external calls, storage, node runtime, HTTP endpoints
- Idioms - internal functions, file organisation, batch patterns, code reuse
Grader tests can include a lightweight AI-based assessment that reviews the generated project and provides concise reasoning on pass/fail.
The grader builds a prompt from TASK.txt plus a manifest of files from the generated output directory and asks a model to decide pass/fail with reasoning. On failure, the reasoning appears directly in the test output and in run.log.
Add a single standardised test using the helper:
import { createAIGraderTest } from "../../../grader/aiGrader";
// Basic usage (default name and 60s timeout)
createAIGraderTest(import.meta.url);
// Optional: custom name/timeout
createAIGraderTest(import.meta.url, "AI grader assessment", 60000);bun run build:releaseThis will generate guideline files in the dist/ directory for various AI coding assistants.
bun run list:models
bun run scripts/listModels.ts --format json
bun run scripts/listModels.ts --due-only --format jsonThe repo has one scheduled periodic eval workflow:
periodic_evals.ymlruns every 4 hours- each run unions candidates from curated models, top-day non-curated OpenRouter models, and top OpenRouter benchmark models
- the combined candidate list is deduped before the workflow matrix expands
The periodic workflow uses the same scheduling policy before it actually queues a model:
- if we have never run a model before, it is due immediately
- otherwise we look at the model's stored OpenRouter first-seen timestamp
- the target interval starts at
24h, grows with model age, hits about30dat one year old, and approaches60dfor very old models - the due check uses the latest completed default-experiment leaderboard run, so failed runs and
no_guidelinesruns do not delay the next periodic default run
The OpenRouter-derived selectors also do a lightweight preflight check so obviously dead models are skipped before entering the matrix.
