CloudEval is a CLI for running model evals, comparing models, and generating shareable reports.
It is designed for:
- Cloudflare dogfooding
- public feedback loops
- small, opinionated team evals
- eventually, a broader OSS audience
When you want to compare a model like workers-ai/@cf/zai-org/glm-5.1 against a baseline, you should be able to:
- run the same dataset against both models
- score the outputs consistently
- generate a report your team can read quickly
- explain the result in plain English
- share the run in Braintrust when needed
CloudEval does that.
cd cloudeval
cp .env.example .env
source ~/.nvm/nvm.sh && nvm use 22
node ./bin/cloudeval.mjs doctor
node ./bin/cloudeval.mjs run --dataset agent-quality --models workers-ai/@cf/zai-org/glm-5.1,baselinecloudeval doctor— validate Node, config, and envcloudeval init— scaffold a starter config and sample datasetscloudeval run— run an eval locally and write a JSON resultcloudeval report— render a JSON result as markdowncloudeval explain— turn a JSON result into a plain-English summarycloudeval compare— compare two result filescloudeval run --braintrust— generate and execute Braintrust evals
node ./bin/cloudeval.mjs run \
--dataset agent-quality \
--models workers-ai/@cf/zai-org/glm-5.1,baseline \
--braintrustThat will:
- generate Braintrust eval scripts
- run the task model(s)
- score the outputs
- write a shareable summary to
.cloudeval/braintrust/
CloudEval looks for evals.config.mjs.
If it is missing, it falls back to the built-in Cloudflare preset.
Relevant env vars:
CLOUDFLARE_ACCOUNT_IDCLOUDFLARE_API_TOKENBRAINTRUST_API_KEY
To add a dataset:
- create a file under
src/datasets/ - export
{ name, rows } - reference it from
evals.config.mjs
To add a scorer:
- add a rubric in
src/scorers/registry.mjs - wire it into the runner/generator
- add a test
To add a provider:
- add an adapter under
src/providers/ - keep the provider boundary thin
- preserve the local/reporting flow
src/cli.mjs— command entrypointsrc/runners/— local eval executionsrc/report/— markdown + explanation outputsrc/providers/— model/provider adapterssrc/scorers/— judging logic and rubricssrc/braintrust/— Braintrust script generationsrc/datasets/— sample datasetssrc/presets/— Cloudflare and generic presets
node --testMIT