Skip to content

bakrianoo/tweety

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tweety

Tweety

Run structured LLM evaluations across 14 tasks — quality, reasoning, vision, safety, and performance — from a single CLI.


Tweety evaluates language models across five capability groups using an LLM-as-a-Judge pipeline. Point it at a model, run tweety run, and get a scored report with per-task breakdowns, profiling metrics, and a readable HTML summary.

Backends: transformers · vllm · bitsandbytes · ollama · litellm (cloud)
Judge: any model via LiteLLM — defaults to gpt-4.1


Quick Start

1. Install

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

2. Set your judge API key

export OPENAI_API_KEY="sk-..."          # OpenAI (default judge: gpt-4.1)
# or ANTHROPIC_API_KEY / GEMINI_API_KEY / AZURE_API_KEY — any LiteLLM provider works

3. Preprocess eval data (one-time, requires judge API key)

tweety preprocess --all

4. Run an evaluation

# minimal — model name and backend are all that's required
tweety run --model meta-llama/Llama-3-8B --backend transformers

# with a config file
tweety run -c config.yaml

Results land in results/<timestamp>/ — a scorecard.json, per-task verdicts.jsonl, and a report.md / report.html.


Config File

All flags have CLI equivalents, but a config file is cleaner for repeated runs:

model:
  name: "meta-llama/Llama-3-8B"
  backend: "transformers"   # transformers | vllm | bitsandbytes | ollama | litellm
  precision: "float16"
  device: "auto"

judge:
  model: "gpt-4.1"          # any LiteLLM model string

tasks:
  tasks: ["all"]            # run everything, or e.g. ["A", "b1_math_logic", "D"]
  skip: []

output:
  formats: ["markdown", "html"]
  directory: "results"

profiling:
  enabled: false            # set true to measure latency, throughput, memory

What Gets Evaluated

Group Tasks What's measured
A — Text Multilingual QA, Needle in Haystack, Adversarial QA, Multi-Turn, Summarization Factual accuracy, hallucination, coherence, language fidelity
B — Reasoning Math & Logic, Instruction Following Correct answers, valid reasoning chains, constraint satisfaction
C — Visual Visual QA, Scene Understanding, OCR Image comprehension, spatial reasoning, text extraction from images
D — Structured Output JSON Generation, Function Calling Schema compliance, field accuracy, tool selection
E — Safety Refusal, Prompt Injection Resistance Appropriate refusals, resistance to jailbreaks and injections

Each sample is scored by a hybrid judge — deterministic checks where possible, LLM scoring for open-ended factors. The final quality score is a weighted average across all groups.


Other Commands

tweety judge  --results-dir results/20260421_080719   # re-judge without re-running inference
tweety report --results-dir results/20260421_080719   # regenerate report from saved scorecard
tweety preprocess --task A                            # preprocess a single group
tweety list-tasks                                     # print all task IDs and groups
tweety list-backends                                  # print available backends

Resume an interrupted run without re-running completed tasks:

tweety run -c config.yaml --resume results/20260421_080719

→ Full flag reference: docs/CLI.md


Output Structure

results/<run_id>/
├── config.json          # config snapshot
├── run_meta.json        # hardware, timing, versions
├── profiling.json       # latency / memory / throughput (if enabled)
├── scorecard.json       # all scores, factor pass-rates, group weights
├── report.md            # human-readable markdown summary
├── report.html          # standalone HTML report
└── <task_id>/
    ├── responses.jsonl  # model responses, token counts, errors
    └── verdicts.jsonl   # per-factor boolean scores + judge reasoning

License

MIT

About

Readable LLMs Evaluations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors