Code coming soon!
This repository will host the code and dataset for GT-HarmBench, a benchmark of 2,009 high-stakes scenarios for evaluating AI alignment in multi-agent environments.
Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. GT-HarmBench addresses this gap with scenarios spanning game-theoretic structures including the Prisoner's Dilemma, Stag Hunt, and Chicken, drawn from realistic AI risk contexts in the MIT AI Risk Repository.
- Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes.
- Models show sensitivity to game-theoretic prompt framing and ordering.
- Game-theoretic interventions improve socially beneficial outcomes by up to 18%.