Run structured LLM evaluations across 14 tasks — quality, reasoning, vision, safety, and performance — from a single CLI.
Tweety evaluates language models across five capability groups using an LLM-as-a-Judge pipeline. Point it at a model, run tweety run, and get a scored report with per-task breakdowns, profiling metrics, and a readable HTML summary.
Backends: transformers · vllm · bitsandbytes · ollama · litellm (cloud)
Judge: any model via LiteLLM — defaults to gpt-4.1
1. Install
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"2. Set your judge API key
export OPENAI_API_KEY="sk-..." # OpenAI (default judge: gpt-4.1)
# or ANTHROPIC_API_KEY / GEMINI_API_KEY / AZURE_API_KEY — any LiteLLM provider works3. Preprocess eval data (one-time, requires judge API key)
tweety preprocess --all4. Run an evaluation
# minimal — model name and backend are all that's required
tweety run --model meta-llama/Llama-3-8B --backend transformers
# with a config file
tweety run -c config.yamlResults land in results/<timestamp>/ — a scorecard.json, per-task verdicts.jsonl, and a report.md / report.html.
All flags have CLI equivalents, but a config file is cleaner for repeated runs:
model:
name: "meta-llama/Llama-3-8B"
backend: "transformers" # transformers | vllm | bitsandbytes | ollama | litellm
precision: "float16"
device: "auto"
judge:
model: "gpt-4.1" # any LiteLLM model string
tasks:
tasks: ["all"] # run everything, or e.g. ["A", "b1_math_logic", "D"]
skip: []
output:
formats: ["markdown", "html"]
directory: "results"
profiling:
enabled: false # set true to measure latency, throughput, memory| Group | Tasks | What's measured |
|---|---|---|
| A — Text | Multilingual QA, Needle in Haystack, Adversarial QA, Multi-Turn, Summarization | Factual accuracy, hallucination, coherence, language fidelity |
| B — Reasoning | Math & Logic, Instruction Following | Correct answers, valid reasoning chains, constraint satisfaction |
| C — Visual | Visual QA, Scene Understanding, OCR | Image comprehension, spatial reasoning, text extraction from images |
| D — Structured Output | JSON Generation, Function Calling | Schema compliance, field accuracy, tool selection |
| E — Safety | Refusal, Prompt Injection Resistance | Appropriate refusals, resistance to jailbreaks and injections |
Each sample is scored by a hybrid judge — deterministic checks where possible, LLM scoring for open-ended factors. The final quality score is a weighted average across all groups.
tweety judge --results-dir results/20260421_080719 # re-judge without re-running inference
tweety report --results-dir results/20260421_080719 # regenerate report from saved scorecard
tweety preprocess --task A # preprocess a single group
tweety list-tasks # print all task IDs and groups
tweety list-backends # print available backendsResume an interrupted run without re-running completed tasks:
tweety run -c config.yaml --resume results/20260421_080719→ Full flag reference: docs/CLI.md
results/<run_id>/
├── config.json # config snapshot
├── run_meta.json # hardware, timing, versions
├── profiling.json # latency / memory / throughput (if enabled)
├── scorecard.json # all scores, factor pass-rates, group weights
├── report.md # human-readable markdown summary
├── report.html # standalone HTML report
└── <task_id>/
├── responses.jsonl # model responses, token counts, errors
└── verdicts.jsonl # per-factor boolean scores + judge reasoning
MIT