Qualitative LLM evaluation — a lightweight framework for testing open-source models side-by-side.
Built for vibe-coding benchmarks, visual quality checks, and performance comparisons of small OSS models served via llama.cpp.
- Multi-model evaluation — run the same prompt against multiple models sequentially
- Server orchestration — auto-start/stop
llama-serverinstances with per-model configs - Performance benchmarking — captures prompt processing speed, generation speed (tk/s), token counts
- Structured results — every run saved with metadata, raw output, extracted code, and perf metrics
- Comparison plots — dark-themed bar charts for gen speed, prompt speed, token counts
- Blind evaluation — anonymize model identities for unbiased human scoring (
--blind) - Screenshot capture — auto-run generated code and capture window screenshots (
--run-outputs)
# Install dependencies
pip install -r requirements.txt
# Dry run (validate everything, generate dummy plots)
python run_eval.py \
--models model_configs/gpt_oss_120b.yaml model_configs/qwen3_coder_next.yaml \
--prompt prompts/bouncing_balls.txt \
--dry-run
# Real run (servers already running)
python run_eval.py \
--models model_configs/qwen3_coder_next.yaml \
--prompt prompts/bouncing_balls.txt \
--no-start-server
# Full orchestration (start → eval → stop, sequentially)
python run_eval.py \
--models model_configs/gpt_oss_120b.yaml model_configs/qwen3_coder_next.yaml \
--prompt prompts/bouncing_balls.txt
# Blind evaluation with auto-screenshots
python run_eval.py \
--models model_configs/gpt_oss_120b.yaml model_configs/qwen3_coder_next.yaml \
--prompt prompts/bouncing_balls.txt \
--blind --run-outputs --display-seconds 8Each run creates a timestamped directory:
results/bouncing_balls_20260218_153306/
├── metadata.json # Run metadata
├── comparison.png # Performance comparison chart
├── summary_table.png # Summary table image
├── blind_key.json # Model identity mapping (only with --blind)
├── gpt_oss_120b/
│ ├── output.txt # Raw model output
│ ├── output.py # Extracted Python code
│ ├── perf.json # Performance metrics
│ └── screenshot.png # Window capture (only with --run-outputs)
└── qwen3_coder_next/
├── output.txt
├── output.py
├── perf.json
└── screenshot.png
New model — create a YAML file in model_configs/:
name: "My-Model"
alias: "org/model-name"
model_path: "/path/to/model.gguf"
port: 8001
server_args:
fit: "on"
threads: 10
flash-attn: "on"New prompt — drop a .txt file in prompts/:
Write a Python program that simulates a double pendulum...
| Flag | Description |
|---|---|
--models / -m |
Model config YAML paths (one or more) |
--prompt / -p |
Prompt file path (.txt or .yaml) |
--no-start-server |
Skip server orchestration (servers already running) |
--dry-run |
Validate configs, generate dummy results |
--blind |
Anonymize model names for unbiased evaluation |
--run-outputs |
Run generated code + capture screenshots |
--display-seconds |
Time to wait before capturing screenshot (default: 5) |
--output-dir / -o |
Custom results directory |
--max-tokens |
Max generation tokens (default: 16384) |
- Python 3.10+
requests,pyyaml,matplotlib- Optional:
xdotool+imagemagick(for--run-outputsscreenshot capture) - llama.cpp
llama-serverbinary