Coming soon.
VABench is a public, reproducible benchmark for AI agents that produce video advertisements end-to-end. We will be releasing the full harness — briefs, evaluation scorers, baseline runners, and aggregation scripts — as open source under the Apache 2.0 license in a future update.
VABench grades AI video-ad systems on three independent evaluation arms:
- Arm 1 — Capability. Pairwise head-to-head against general-purpose video agents on the brief.
- Arm 2 — Ad Quality. Pairwise against frontier text-to-video models (single-scene and multi-scene configurations) on eight ad-rubric dimensions.
- Arm 3 — Video Production Quality. Brief-blind grading on production polish, with a frame-level hallucination rubric folded in at 3× weight.
The brief set covers every major performance-ad pattern (problem-solution, before-after, demo, testimonial, lifestyle) plus targeted stress tests against the hardest axes of multi-scene production (anchor storytelling, persona consistency, brand-asset reuse, structured CTA text, persuasion-arc compliance).
The headline leaderboard and methodology are public now:
- Leaderboard: https://creatify.ai/research/agent-benchmarks
- Technical report: https://creatify.ai/research/agent
- Brief set + JSON Schema
- Pairwise judge implementation (Claude Opus 4.7 + GPT-5, position-swap consensus)
- Frame-level hallucination scorer
- Structural-compliance checks (duration, aspect, audio, OCR)
- Baseline runners for the raw text-to-video frontier (Veo 3.1, Kling 3.0, Seedance 2.0) and general-purpose video agents (HeyGen V3)
- Aggregation scripts that emit the published leaderboard tables
Released under the Apache 2.0 license (see LICENSE).
Watch this repository to be notified when the harness lands.