Tweety

Run structured LLM evaluations across 14 tasks — quality, reasoning, vision, safety, and performance — from a single CLI.

Tweety evaluates language models across five capability groups using an LLM-as-a-Judge pipeline. Point it at a model, run tweety run, and get a scored report with per-task breakdowns, profiling metrics, and a readable HTML summary.

Backends: transformers · vllm · bitsandbytes · ollama · litellm (cloud)
Judge: any model via LiteLLM — defaults to gpt-4.1

Quick Start

1. Install

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

2. Set your judge API key

export OPENAI_API_KEY="sk-..."          # OpenAI (default judge: gpt-4.1)
# or ANTHROPIC_API_KEY / GEMINI_API_KEY / AZURE_API_KEY — any LiteLLM provider works

3. Preprocess eval data (one-time, requires judge API key)

tweety preprocess --all

4. Run an evaluation

# minimal — model name and backend are all that's required
tweety run --model meta-llama/Llama-3-8B --backend transformers

# with a config file
tweety run -c config.yaml

Results land in results/<timestamp>/ — a scorecard.json, per-task verdicts.jsonl, and a report.md / report.html.

Config File

All flags have CLI equivalents, but a config file is cleaner for repeated runs:

model:
  name: "meta-llama/Llama-3-8B"
  backend: "transformers"   # transformers | vllm | bitsandbytes | ollama | litellm
  precision: "float16"
  device: "auto"

judge:
  model: "gpt-4.1"          # any LiteLLM model string

tasks:
  tasks: ["all"]            # run everything, or e.g. ["A", "b1_math_logic", "D"]
  skip: []

output:
  formats: ["markdown", "html"]
  directory: "results"

profiling:
  enabled: false            # set true to measure latency, throughput, memory

What Gets Evaluated

Group	Tasks	What's measured
A — Text	Multilingual QA, Needle in Haystack, Adversarial QA, Multi-Turn, Summarization	Factual accuracy, hallucination, coherence, language fidelity
B — Reasoning	Math & Logic, Instruction Following	Correct answers, valid reasoning chains, constraint satisfaction
C — Visual	Visual QA, Scene Understanding, OCR	Image comprehension, spatial reasoning, text extraction from images
D — Structured Output	JSON Generation, Function Calling	Schema compliance, field accuracy, tool selection
E — Safety	Refusal, Prompt Injection Resistance	Appropriate refusals, resistance to jailbreaks and injections

Each sample is scored by a hybrid judge — deterministic checks where possible, LLM scoring for open-ended factors. The final quality score is a weighted average across all groups.

Other Commands

tweety judge  --results-dir results/20260421_080719   # re-judge without re-running inference
tweety report --results-dir results/20260421_080719   # regenerate report from saved scorecard
tweety preprocess --task A                            # preprocess a single group
tweety list-tasks                                     # print all task IDs and groups
tweety list-backends                                  # print available backends

Resume an interrupted run without re-running completed tasks:

tweety run -c config.yaml --resume results/20260421_080719

→ Full flag reference: docs/CLI.md

Output Structure

results/<run_id>/
├── config.json          # config snapshot
├── run_meta.json        # hardware, timing, versions
├── profiling.json       # latency / memory / throughput (if enabled)
├── scorecard.json       # all scores, factor pass-rates, group weights
├── report.md            # human-readable markdown summary
├── report.html          # standalone HTML report
└── <task_id>/
    ├── responses.jsonl  # model responses, token counts, errors
    └── verdicts.jsonl   # per-factor boolean scores + judge reasoning

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
data		data
docs		docs
scripts		scripts
src/tweety		src/tweety
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweety

Quick Start

Config File

What Gets Evaluated

Other Commands

Output Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tweety

Quick Start

Config File

What Gets Evaluated

Other Commands

Output Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages